Computational Intelligence and Neuroscience

Volume 2017, Article ID 5874896, 15 pages

https://doi.org/10.1155/2017/5874896

## Modified Mahalanobis Taguchi System for Imbalance Data Classification

Industrial Engineering Department, German Jordanian University, P.O. Box 35247, Amman 11180, Jordan

Correspondence should be addressed to Mahmoud El-Banna; moc.liamg@annablam

Received 7 March 2017; Revised 14 May 2017; Accepted 22 May 2017; Published 24 July 2017

Academic Editor: Massimo Panella

Copyright © 2017 Mahmoud El-Banna. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The Mahalanobis Taguchi System (MTS) is considered one of the most promising binary classification algorithms to handle imbalance data. Unfortunately, MTS lacks a method for determining an efficient threshold for the binary classification. In this paper, a nonlinear optimization model is formulated based on minimizing the distance between MTS Receiver Operating Characteristics (ROC) curve and the theoretical optimal point named Modified Mahalanobis Taguchi System (MMTS). To validate the MMTS classification efficacy, it has been benchmarked with Support Vector Machines (SVMs), Naive Bayes (NB), Probabilistic Mahalanobis Taguchi Systems (PTM), Synthetic Minority Oversampling Technique (SMOTE), Adaptive Conformal Transformation (ACT), Kernel Boundary Alignment (KBA), Hidden Naive Bayes (HNB), and other improved Naive Bayes algorithms. MMTS outperforms the benchmarked algorithms especially when the imbalance ratio is greater than 400. A real life case study on manufacturing sector is used to demonstrate the applicability of the proposed model and to compare its performance with Mahalanobis Genetic Algorithm (MGA).

#### 1. Introduction

Classification is one of the supervised learning approaches in which a new observation needs to be assigned to one of the predetermined classes or categories. If the number of the predetermined classes is more than two, it is a multiclass classification problem; otherwise, the problem is known as the binary classification problem. At present, these problems have found applications in different domains such as product quality [1] and speech recognition [2].

The classification accuracy depends on both the classifier and the data types. The classifier types can be categorized according to supervised versus unsupervised learning, linear versus nonlinear hyperplane, and feature selection versus feature extraction based approach [3]. On the other hand, Sun et al. [4] reported that the parameters affecting the classification are the overlapping between data (i.e., class separability), small sample size, within-class concept (i.e., a single class may consist of various subclasses, which do not necessary have the same size), and the data distribution for each class. If the data distribution of one class is different from distributions of others, then the data is considered imbalance. The border that separates balance from imbalance data is vague; for example, imbalance ratio, which is the ratio between the major to minor class observations, is reported from small values of 100 to 1 to 10000 : 1 [5].

The assumption of an equal number of observations in each class is elementary in using the common classification methods such as decision tree analysis, Support Vector Machines, discriminant analysis, and neural networks [6]. Imbalance data occurs often in real life such as text classification [7]. The problem of treating the applications that have imbalance data with the common classifiers leads to bias in the classification accuracy (i.e., the predictive accuracy for the minority class will be much less than for the majority class) and/or considering the minority observation as noise or outliers, which will result in ignoring them from the classifier.

To handle the classification of imbalanced data problem, the research community uses data and algorithmic or both approaches. For the data approach, the main idea is to balance the class density randomly or informatively (i.e., targeted) either eliminating (downsampling) the majority class observations or replicating (oversampling) the minority class observations or doing both. While at the algorithmic approach, the main idea is to adapt the classier algorithms towards the small class, a combination of the data and algorithmic levels approaches is also used and known as cost-sensitive learning solutions.

The problems reported [4] using data approach are as follows: deleting significant information for certain instances in case of downsampling, bringing noise to original data in case of oversampling, determining the appropriate sample size in within-class concept data, specifying the ideal class distribution, and using clear criteria for selecting samples.

While the problem reported [4] using the algorithmic approach is that it needs a deep understanding about the classier used itself and the application area (i.e., why a classifier deteriorates when imbalance data occurs).

Finally, the problem in using the cost-sensitive learning approach is the assumption of previous knowledge for many errors types and imposing a higher cost to the minority class to improve the prediction accuracy. Knowing the cost matrices in most cases is practically difficult.

While data and algorithmic approaches constitute the majority efforts in the area of imbalanced data, several other approaches have also been conducted, which will be reviewed in Literature Review.

To overcome the pitfalls of data and algorithmic approaches to solve the problem of imbalanced data classification, the classification algorithm needs to be capable of dealing with imbalance data directly without resampling and should have a systematic foundation for determining the cost matrices or the threshold. One of the promising classifiers is the Mahalanobis Taguchi System (MTS), which has shown good classification results for imbalance data without resampling, it does not require any distribution assumption for the input variables, and it can be used to measure the degree of abnormality (i.e., the degree of abnormality is proportional to the magnitude of Mahalanobis Distance for the positive observations), but unfortunately it lacks a systematic foundation for threshold determination [8].

The Receiver Operating Characteristics (ROC) based approach has been reported in the research domain [9] for Support Vector Machines (SVMs) and random forests (RF) as a cost function to trade off the required metrics (i.e., sensitivity versus specificity). Three operating point selection criteria, shortest distance, harmonic mean, and antiharmonic mean, have been compared, and the results in [9] showed that there is no difference among classifiers performances. Based on that, and up to author knowledge, no previous work has been reported for using ROC based approach to find the optimum threshold for the Mahalanobis Taguchi System (MTS) approach; therefore, a Modified Mahalanobis Taguchi System (MMTS) methodology is proposed in this paper.

The aim of this work is to enhance the Mahalanobis Taguchi System (MTS) classifier performance by providing a scientific, rigorous, and systematic method using the ROC curve for determining the threshold that discriminates between the classes.

The organization of the paper is as follows: Section 2 reviews the previous work of imbalance data classifications methods, the Mahalanobis Taguchi System, and its applications. In Section 3, the proposed Modified Mahalanobis Taguchi System (MMTS) methodology is described. In Section 4, results are presented for the comparison among the suggested MMTS algorithm with the Probabilistic Mahalanobis Taguchi System (PMTS), Naive Bayes (NB), and Support Vector Machine (SVM) through several datasets. Section 5 presents a case study to demonstrate the applicability of the proposed research. And in Section 6, the results obtained from this research are summarized.

#### 2. Literature Review

In this section, an overview of the imbalance classification approaches, the Mahalanobis Taguchi System concept, its different areas of applications, weakness points, and its variants is presented.

Solutions to deal with the imbalanced learning problem can be summarized into the following approaches [10]: sampling (sometimes called the data level approach), algorithmic, and cost-sensitive approaches.

Data level approach [11] is mainly returning the balance distribution between the classes through resampling techniques. It includes the following types:(1)Random undersampling∖oversampling of the negative∖positive observations(2)Targeted undersampling∖oversampling of the negative∖positive observations(3)Mixing approach from the above two items

The problems reported in data approaches are as follows:(i)Determining the best class distribution or imbalance ratio for given observations: in Weiss and Provost [12], the relation between the classifier performance and the class distribution had been investigated; the results showed that balanced class distribution does not necessary produce optimal classification performance.(ii)Undersampling the negative data can lead to loose important information, whereas oversampling the positive one may cause noise interference [13].(iii)The uncertain criterion for selecting samples for within-class concept: that is, the class itself consists of several subclasses (i.e., how oversampling and/or undersampling will be performed for within-class concept).

Algorithmic level approach solutions are based upon creating a biased algorithm towards positive class. The algorithmic level approach has been used in many popular classifiers such as decision trees, Support Vector Machines (SVMs), association rule mining, back-propagation (BP) neural network, one-sample learning, active learning methods, and the Mahalanobis Taguchi System (MTS).

The adaptation of decision tree classifier to suit the imbalance data can be accomplished by adjusting the probabilistic estimate of the tree leaf or developing new trimming approaches [14].

Support Vector Machines (SVMs) showed good classification results for slightly imbalanced data [15], while for highly imbalanced data researchers [16, 17] reported poor performance classification results, since SVM try to reduce total error, which will produce results shifted towards the negative (majority) class. To handle the imbalance data, there are proposals such as using penalty constants for different classes found in Lin et al. [18] or changing the class border based on kernel adjustment as in Wu and Chang [19].

Therefore, in this paper, SVM was selected as one of the benchmarked algorithms to compare with ours; the results showed that SVM classification performance largely degrades with a high imbalance ratio, which supports the previous findings of the researchers (more details will be presented in Results).

Association rule mining is a recent classification approach combining association mining and classification into one approach [20–22]. To handle the imbalance data, determining many minimal supports for different classes to present their varied recurrence is required [23].

On the other hand, one-class learning [24, 25] used the target class only to determine if the new observation belongs to this class or not. BP neural network [26] and SVMs [27] are examined as one-class learning approach. In the case of highly imbalanced data, one-class learning showed good classification results [28]. Unfortunately, one-class learning algorithms drawbacks are that the size of the training data is relatively larger than those for multiclass approaches, and it is also hard to reduce the dimension of features used for separation.

Active learning approach is used to handle the problems related to the unlabeled training data. Research on active learning for imbalance data reported by Ertekin et al. [29] is based on the iterative approach by training the classifier on the data near the classification boundary instead of the whole training dataset, since the imbalance ratio for the dataset near the boundary is different from those away from the boundary. Unfortunately one of the bit falls for using this approach is that it can be computationally expensive [30].

The problem with the algorithmic approach is that it needs an extensive knowledge of specific classifier (i.e., why the algorithm fails to detect the positive cases), also understanding the application domain is critical (i.e., the effect of misclassification on the domain).

Cost-sensitive methods use both data and algorithmic approaches, where the objective is to optimize (i.e., minimize) the total misclassification cost while giving a positive class a higher misclassification cost [31, 32].

Cost-sensitive methods used different costs or penalties for different misclassification types. For example, let be the cost of wrongly classifying positive instant as a negative one, while is the cost of the contrary case. In imbalance data classification, usually, the revealing of the positive instant is more important than the negative one; hence, the cost of positive instance misclassification outweighs the cost of negatives ones (i.e., ), with correct classification cost equal to zero (i.e., ).

Different types of cost-sensitive approaches have been reported in the literature:(i)Modifying the weights of the data space: in this approach, modification to the training data density is performed using the misclassification cost criteria, in a way that the density is adjusted towards the costly class.(ii)Making the classifier objective cost-sensitive: instead of minimizing the misclassification error, the objective is tuned to reduce the misclassification cost [32].(iii)Using risk minimization approach: in a binary c4.5 (i.e., decision tree) classifier, the assignment of a class type to a leaf end is based on the high-frequency class that reaches the end, while for the cost-sensitive classifier, the assignment of the class label is based on minimizing the classification cost [33].The problem of using the cost-sensitive approach is that it is based on previous knowledge of the cost matrix for the misclassification kinds, while in most cases it is unavailable.

##### 2.1. Mahalanobis Taguchi System (MTS)

MTS is a multivariate supervised learning approach, which aims to classify new observation into one of the two classes (i.e., healthy and unhealthy classes). MTS was used previously in predicting weld quality [3], exploring the influence of chemicals constitution on hot rolling manufactured products [34], and selecting the significant features in automotive handling [35]. The MTS approach starts with collecting considerable observations from the investigated dataset, tailed by separating of the unhealthy dataset (i.e., positive or abnormal) from the healthy (i.e., negative or normal). Calculation of the Mahalanobis Distance (MD) using the negative observation is performed first, followed by scaling (i.e., dividing the MD calculated over the number of features used), which will result in an average MDs around one for the negative observations. The scaled MD for the positive date set supposes to be different from MD for those for the negative dataset. Since many features are used to calculate the MD, so that the probability to have significant features for the multivariable dataset is high, Taguchi orthogonal array is used to screen these features. The criterion for selecting the appropriate features is determined by selecting the features that possess high MD values for the positive observations. It is worth noticing that a continuous scale is constructed from the single class observations by using MTS; unlike other classification techniques, learning is done directly from the positive and negative observations. This characteristic helps the MTS classifier to deal with the imbalance data problems.

The step of determining the optimal threshold is a critical one for effective MTS classier. To determine the appropriate threshold, loss function approach was proposed by [36]; however, it is not a practical approach because of the difficulty in specifying the relative cost [37]. In order to overcome this problem, Su and Hsiao [6] used a Chebyshev’s theorem to specify the threshold and called their method a “probabilistic thresholding method (PTM)” for the MTS, whereas in MTS the threshold is assumed to be one. It has been shown in [6] that PTM classifier performance outperformed MTS classifier performance; therefore, it has been selected to be benchmarked with the proposed classifier. Unfortunately, the PTM method is based on previously assumed parameters, and the accuracy of the classification results was less than the benchmarked classifiers (this is one of the findings in this research, which will be discussed in Results).

The other research area in the MTS is related to the modification of the Taguchi method not in the threshold determination. Due to the lack of a statistical foundation [37] for the Taguchi method, the Mahalanobis Genetic Algorithm (MGA) [3] and the Mahalanobis Taguchi System using Particle Swarm Optimization (PSO) [38] have been used. Both the MGA and MTS Particle Swarm Optimization methods deal with the Taguchi system (orthogonal array) part, while the threshold determination still lacks a solid foundation or is hard to be determined in reality.

Finally, the aim of this research is to enhance the Mahalanobis Taguchi System (MTS) classifier performance through providing a scientific, rigorous, and systematic method of determining the binary classification threshold that discriminates between the two classes, which can be applied to the MTS and its variants (i.e., MGA).

#### 3. Modified Mahalanobis Taguchi System (MMTS)

The proposed model, Algorithm 1, provides an easy, reliable, and systematic way to determine the threshold for the Mahalanobis Taguchi System (MTS) and its variants (i.e., Mahalanobis Genetic Algorithm, MGA) to carry out the classification process effectively. The currently used approaches either are difficult to use in practice such as the loss function [36] due to the difficulty in evaluating the cost in each case or are based on previously assumed parameters [6].