Research Article  Open Access
Mahmoud ElBanna, "Modified Mahalanobis Taguchi System for Imbalance Data Classification", Computational Intelligence and Neuroscience, vol. 2017, Article ID 5874896, 15 pages, 2017. https://doi.org/10.1155/2017/5874896
Modified Mahalanobis Taguchi System for Imbalance Data Classification
Abstract
The Mahalanobis Taguchi System (MTS) is considered one of the most promising binary classification algorithms to handle imbalance data. Unfortunately, MTS lacks a method for determining an efficient threshold for the binary classification. In this paper, a nonlinear optimization model is formulated based on minimizing the distance between MTS Receiver Operating Characteristics (ROC) curve and the theoretical optimal point named Modified Mahalanobis Taguchi System (MMTS). To validate the MMTS classification efficacy, it has been benchmarked with Support Vector Machines (SVMs), Naive Bayes (NB), Probabilistic Mahalanobis Taguchi Systems (PTM), Synthetic Minority Oversampling Technique (SMOTE), Adaptive Conformal Transformation (ACT), Kernel Boundary Alignment (KBA), Hidden Naive Bayes (HNB), and other improved Naive Bayes algorithms. MMTS outperforms the benchmarked algorithms especially when the imbalance ratio is greater than 400. A real life case study on manufacturing sector is used to demonstrate the applicability of the proposed model and to compare its performance with Mahalanobis Genetic Algorithm (MGA).
1. Introduction
Classification is one of the supervised learning approaches in which a new observation needs to be assigned to one of the predetermined classes or categories. If the number of the predetermined classes is more than two, it is a multiclass classification problem; otherwise, the problem is known as the binary classification problem. At present, these problems have found applications in different domains such as product quality [1] and speech recognition [2].
The classification accuracy depends on both the classifier and the data types. The classifier types can be categorized according to supervised versus unsupervised learning, linear versus nonlinear hyperplane, and feature selection versus feature extraction based approach [3]. On the other hand, Sun et al. [4] reported that the parameters affecting the classification are the overlapping between data (i.e., class separability), small sample size, withinclass concept (i.e., a single class may consist of various subclasses, which do not necessary have the same size), and the data distribution for each class. If the data distribution of one class is different from distributions of others, then the data is considered imbalance. The border that separates balance from imbalance data is vague; for example, imbalance ratio, which is the ratio between the major to minor class observations, is reported from small values of 100 to 1 to 10000 : 1 [5].
The assumption of an equal number of observations in each class is elementary in using the common classification methods such as decision tree analysis, Support Vector Machines, discriminant analysis, and neural networks [6]. Imbalance data occurs often in real life such as text classification [7]. The problem of treating the applications that have imbalance data with the common classifiers leads to bias in the classification accuracy (i.e., the predictive accuracy for the minority class will be much less than for the majority class) and/or considering the minority observation as noise or outliers, which will result in ignoring them from the classifier.
To handle the classification of imbalanced data problem, the research community uses data and algorithmic or both approaches. For the data approach, the main idea is to balance the class density randomly or informatively (i.e., targeted) either eliminating (downsampling) the majority class observations or replicating (oversampling) the minority class observations or doing both. While at the algorithmic approach, the main idea is to adapt the classier algorithms towards the small class, a combination of the data and algorithmic levels approaches is also used and known as costsensitive learning solutions.
The problems reported [4] using data approach are as follows: deleting significant information for certain instances in case of downsampling, bringing noise to original data in case of oversampling, determining the appropriate sample size in withinclass concept data, specifying the ideal class distribution, and using clear criteria for selecting samples.
While the problem reported [4] using the algorithmic approach is that it needs a deep understanding about the classier used itself and the application area (i.e., why a classifier deteriorates when imbalance data occurs).
Finally, the problem in using the costsensitive learning approach is the assumption of previous knowledge for many errors types and imposing a higher cost to the minority class to improve the prediction accuracy. Knowing the cost matrices in most cases is practically difficult.
While data and algorithmic approaches constitute the majority efforts in the area of imbalanced data, several other approaches have also been conducted, which will be reviewed in Literature Review.
To overcome the pitfalls of data and algorithmic approaches to solve the problem of imbalanced data classification, the classification algorithm needs to be capable of dealing with imbalance data directly without resampling and should have a systematic foundation for determining the cost matrices or the threshold. One of the promising classifiers is the Mahalanobis Taguchi System (MTS), which has shown good classification results for imbalance data without resampling, it does not require any distribution assumption for the input variables, and it can be used to measure the degree of abnormality (i.e., the degree of abnormality is proportional to the magnitude of Mahalanobis Distance for the positive observations), but unfortunately it lacks a systematic foundation for threshold determination [8].
The Receiver Operating Characteristics (ROC) based approach has been reported in the research domain [9] for Support Vector Machines (SVMs) and random forests (RF) as a cost function to trade off the required metrics (i.e., sensitivity versus specificity). Three operating point selection criteria, shortest distance, harmonic mean, and antiharmonic mean, have been compared, and the results in [9] showed that there is no difference among classifiers performances. Based on that, and up to author knowledge, no previous work has been reported for using ROC based approach to find the optimum threshold for the Mahalanobis Taguchi System (MTS) approach; therefore, a Modified Mahalanobis Taguchi System (MMTS) methodology is proposed in this paper.
The aim of this work is to enhance the Mahalanobis Taguchi System (MTS) classifier performance by providing a scientific, rigorous, and systematic method using the ROC curve for determining the threshold that discriminates between the classes.
The organization of the paper is as follows: Section 2 reviews the previous work of imbalance data classifications methods, the Mahalanobis Taguchi System, and its applications. In Section 3, the proposed Modified Mahalanobis Taguchi System (MMTS) methodology is described. In Section 4, results are presented for the comparison among the suggested MMTS algorithm with the Probabilistic Mahalanobis Taguchi System (PMTS), Naive Bayes (NB), and Support Vector Machine (SVM) through several datasets. Section 5 presents a case study to demonstrate the applicability of the proposed research. And in Section 6, the results obtained from this research are summarized.
2. Literature Review
In this section, an overview of the imbalance classification approaches, the Mahalanobis Taguchi System concept, its different areas of applications, weakness points, and its variants is presented.
Solutions to deal with the imbalanced learning problem can be summarized into the following approaches [10]: sampling (sometimes called the data level approach), algorithmic, and costsensitive approaches.
Data level approach [11] is mainly returning the balance distribution between the classes through resampling techniques. It includes the following types:(1)Random undersampling∖oversampling of the negative∖positive observations(2)Targeted undersampling∖oversampling of the negative∖positive observations(3)Mixing approach from the above two items
The problems reported in data approaches are as follows:(i)Determining the best class distribution or imbalance ratio for given observations: in Weiss and Provost [12], the relation between the classifier performance and the class distribution had been investigated; the results showed that balanced class distribution does not necessary produce optimal classification performance.(ii)Undersampling the negative data can lead to loose important information, whereas oversampling the positive one may cause noise interference [13].(iii)The uncertain criterion for selecting samples for withinclass concept: that is, the class itself consists of several subclasses (i.e., how oversampling and/or undersampling will be performed for withinclass concept).
Algorithmic level approach solutions are based upon creating a biased algorithm towards positive class. The algorithmic level approach has been used in many popular classifiers such as decision trees, Support Vector Machines (SVMs), association rule mining, backpropagation (BP) neural network, onesample learning, active learning methods, and the Mahalanobis Taguchi System (MTS).
The adaptation of decision tree classifier to suit the imbalance data can be accomplished by adjusting the probabilistic estimate of the tree leaf or developing new trimming approaches [14].
Support Vector Machines (SVMs) showed good classification results for slightly imbalanced data [15], while for highly imbalanced data researchers [16, 17] reported poor performance classification results, since SVM try to reduce total error, which will produce results shifted towards the negative (majority) class. To handle the imbalance data, there are proposals such as using penalty constants for different classes found in Lin et al. [18] or changing the class border based on kernel adjustment as in Wu and Chang [19].
Therefore, in this paper, SVM was selected as one of the benchmarked algorithms to compare with ours; the results showed that SVM classification performance largely degrades with a high imbalance ratio, which supports the previous findings of the researchers (more details will be presented in Results).
Association rule mining is a recent classification approach combining association mining and classification into one approach [20–22]. To handle the imbalance data, determining many minimal supports for different classes to present their varied recurrence is required [23].
On the other hand, oneclass learning [24, 25] used the target class only to determine if the new observation belongs to this class or not. BP neural network [26] and SVMs [27] are examined as oneclass learning approach. In the case of highly imbalanced data, oneclass learning showed good classification results [28]. Unfortunately, oneclass learning algorithms drawbacks are that the size of the training data is relatively larger than those for multiclass approaches, and it is also hard to reduce the dimension of features used for separation.
Active learning approach is used to handle the problems related to the unlabeled training data. Research on active learning for imbalance data reported by Ertekin et al. [29] is based on the iterative approach by training the classifier on the data near the classification boundary instead of the whole training dataset, since the imbalance ratio for the dataset near the boundary is different from those away from the boundary. Unfortunately one of the bit falls for using this approach is that it can be computationally expensive [30].
The problem with the algorithmic approach is that it needs an extensive knowledge of specific classifier (i.e., why the algorithm fails to detect the positive cases), also understanding the application domain is critical (i.e., the effect of misclassification on the domain).
Costsensitive methods use both data and algorithmic approaches, where the objective is to optimize (i.e., minimize) the total misclassification cost while giving a positive class a higher misclassification cost [31, 32].
Costsensitive methods used different costs or penalties for different misclassification types. For example, let be the cost of wrongly classifying positive instant as a negative one, while is the cost of the contrary case. In imbalance data classification, usually, the revealing of the positive instant is more important than the negative one; hence, the cost of positive instance misclassification outweighs the cost of negatives ones (i.e., ), with correct classification cost equal to zero (i.e., ).
Different types of costsensitive approaches have been reported in the literature:(i)Modifying the weights of the data space: in this approach, modification to the training data density is performed using the misclassification cost criteria, in a way that the density is adjusted towards the costly class.(ii)Making the classifier objective costsensitive: instead of minimizing the misclassification error, the objective is tuned to reduce the misclassification cost [32].(iii)Using risk minimization approach: in a binary c4.5 (i.e., decision tree) classifier, the assignment of a class type to a leaf end is based on the highfrequency class that reaches the end, while for the costsensitive classifier, the assignment of the class label is based on minimizing the classification cost [33].The problem of using the costsensitive approach is that it is based on previous knowledge of the cost matrix for the misclassification kinds, while in most cases it is unavailable.
2.1. Mahalanobis Taguchi System (MTS)
MTS is a multivariate supervised learning approach, which aims to classify new observation into one of the two classes (i.e., healthy and unhealthy classes). MTS was used previously in predicting weld quality [3], exploring the influence of chemicals constitution on hot rolling manufactured products [34], and selecting the significant features in automotive handling [35]. The MTS approach starts with collecting considerable observations from the investigated dataset, tailed by separating of the unhealthy dataset (i.e., positive or abnormal) from the healthy (i.e., negative or normal). Calculation of the Mahalanobis Distance (MD) using the negative observation is performed first, followed by scaling (i.e., dividing the MD calculated over the number of features used), which will result in an average MDs around one for the negative observations. The scaled MD for the positive date set supposes to be different from MD for those for the negative dataset. Since many features are used to calculate the MD, so that the probability to have significant features for the multivariable dataset is high, Taguchi orthogonal array is used to screen these features. The criterion for selecting the appropriate features is determined by selecting the features that possess high MD values for the positive observations. It is worth noticing that a continuous scale is constructed from the single class observations by using MTS; unlike other classification techniques, learning is done directly from the positive and negative observations. This characteristic helps the MTS classifier to deal with the imbalance data problems.
The step of determining the optimal threshold is a critical one for effective MTS classier. To determine the appropriate threshold, loss function approach was proposed by [36]; however, it is not a practical approach because of the difficulty in specifying the relative cost [37]. In order to overcome this problem, Su and Hsiao [6] used a Chebyshev’s theorem to specify the threshold and called their method a “probabilistic thresholding method (PTM)” for the MTS, whereas in MTS the threshold is assumed to be one. It has been shown in [6] that PTM classifier performance outperformed MTS classifier performance; therefore, it has been selected to be benchmarked with the proposed classifier. Unfortunately, the PTM method is based on previously assumed parameters, and the accuracy of the classification results was less than the benchmarked classifiers (this is one of the findings in this research, which will be discussed in Results).
The other research area in the MTS is related to the modification of the Taguchi method not in the threshold determination. Due to the lack of a statistical foundation [37] for the Taguchi method, the Mahalanobis Genetic Algorithm (MGA) [3] and the Mahalanobis Taguchi System using Particle Swarm Optimization (PSO) [38] have been used. Both the MGA and MTS Particle Swarm Optimization methods deal with the Taguchi system (orthogonal array) part, while the threshold determination still lacks a solid foundation or is hard to be determined in reality.
Finally, the aim of this research is to enhance the Mahalanobis Taguchi System (MTS) classifier performance through providing a scientific, rigorous, and systematic method of determining the binary classification threshold that discriminates between the two classes, which can be applied to the MTS and its variants (i.e., MGA).
3. Modified Mahalanobis Taguchi System (MMTS)
The proposed model, Algorithm 1, provides an easy, reliable, and systematic way to determine the threshold for the Mahalanobis Taguchi System (MTS) and its variants (i.e., Mahalanobis Genetic Algorithm, MGA) to carry out the classification process effectively. The currently used approaches either are difficult to use in practice such as the loss function [36] due to the difficulty in evaluating the cost in each case or are based on previously assumed parameters [6].

The proposed model is based on using the Receiver Operating Characteristics (ROC) curve [39] for the MTS threshold determination. As shown in Figure 1, point (, ) represents the optimum theoretical solution (best performance) for any classifier. The closer the classifier performance to this point is, the better it is. The curve drawn in the figure represents the MTS classifier performance for different threshold values. Changing the threshold will change the point location on the curve (i.e., points , , , and ). Therefore, the problem of finding the optimum threshold can be reformulated into the problem of finding the closest point that lies on the curve to point (, ).
MMTS can be summarized in the following steps.
Step 1 (construction of the initial model stage). Assume there are two classes: negative (the one with majority observations) and positive (the one with the minority observations). A set of data is sampled from both classes. Using the negative observations only, reference Mahalanobis Distances are calculated using (1) with all features used. The Mahalanobis Distances (MD) for the positive observations are also calculated by using the same equation with all features, with the inverse of the correlation matrix of the negative observation used. Selection of the new features is performed by using the orthogonal array approach; then a recalculation of MDs for the negative and the positive observation is performed. An arbitrary threshold is assumed (i.e., one), and accordingly the true positive rate, the true negative rate, and the fitness function can be estimated.
Step 2 (optimization stage). If the stopping criteria (i.e., fitness function value is zero, the number of maximum iterations is reached, and/or the differences among successive fitness value are less than a certain value) are not met yet, an optimization model (i.e., genetic algorithm) is invoked to obtain a better threshold value that minimizes the desired fitness function. Accordingly, new features will be selected using the orthogonal array approach, and true positive rate, false positive rate, and the fitness function will be also updated.
If the stopping criteria are met, then the training stage is done, and the model is ready for testing observations.
Step 3 (testing stage). In this stage, the optimum threshold and the associated features are determined from the previous stage and the Mahalanobis Distance for the new observation is calculated based on those parameters. If the Mahalanobis Distance for this observation is less than the optimum threshold, then it will be classified as negative; otherwise, it will be classified as positive.
Now, after providing an overview of how MMTS algorithm works, detailed calculation of the Mahalanobis Distance, the true positive and the negative rates, and the fitness function will be presented in the followings subsection.
3.1. Mahalanobis Distance (MD)
In order to demonstrate the MTS threshold determination mathematically, let us assume that negative data (also called healthy or normal observations) and the positive data (also called unhealthy or abnormal observations) are available, where the number of positive observations is and the number of negative observations is , and both positive and negative observations consist of variables (or features).
Given a sample of size , the Mahalanobis Distance (MD) for the th observation can be calculated bywhere , , is total number of features (or variables), is the normalized vector obtained by normalizing the values of : that is, , where and are the average and the sample standard deviation of variable , respectively, is the transpose of observation and variable for , and is the inverse of the correlation matrix of the negative variables.
Using (1), , , , the inverse of the correlation matrix, the mean, and the sample standard deviation of the feature , for the negative data, respectively, the MD of the positive observations can be calculated.
The next step is to determine the threshold that will be used to discriminate the negative observations from the positive ones based on the MD magnitude, which means that the new observation can be classified into either a positive or negative observation according to the following criteria: if , the observation is negative; otherwise, it is positive.
The contribution of this paper mainly is in the area of establishing a reliable and systematic threshold for classification. A rough method for determining the threshold is to plot the positive and negative MD observations versus their orders and decide upon the threshold manually. This method is not accurate, especially when dealing with the overlapping values of the MDs.
3.2. Proposed Threshold Determination
The essential classifier performance can be explained by examining the confusion matrix Table 1. The ratio between negative to positive observations (left to right columns in Table 1) is representation for the class distribution (i.e., imbalance ratio). In that sense, any performance metrics using both columns will be sensitive to the imbalance data issue, such as accuracy and error rate, (14) and (15), respectively. To overcome this problem, the Receiver Operating Characteristic (ROC) curves are recommended by the research community.
 
: true negative, : false negative, : false positive, : true positive, based on threshold , : negative observations, and : positive observations. 
From the confusion matrix, Table 1, the following can be defined:(i) is the total number of observations classified as negative from the pool of the negative observations (i.e., the negative observations whose ).(ii) is the total number of observations classified as negative from the pool of the positive observations (i.e., the positive observations whose ).(iii) is the total number of observations classified as positive from the pool of the negative observations (i.e., the negative observations whose ).(iv) is the total number of observations classified as positive from the pool of the positive observations (i.e., the positive observations whose ).Now, the true positive rate and the false negative rate at the threshold can be defined as
Using and for different values of threshold , the ROC for the MMTS can be constructed.
The ROC plot is an  plot in which (2) is plotted on the vertical axis and (3) is plotted on the horizontal axis.
Since uses the right column in the confusion matrix and uses the left column in the confusion matrix, they are unaffected by the imbalance data problem. The ROC is beneficial because it provides a tool to show the advantages (represented by true positives) versus disadvantages (represented by false positives) of the classifier relating to data density.
Figure 1 represents MTS classifier ROC curve, created by changing the MTS threshold (i.e., each point on the curve such as , , and represents the different threshold for MTS classifier). The closest point lies on the curve (i.e., threshold) to point (0, 1) which is considered the optimum threshold among the other candidates. Mathematically, this can be converted into the following optimization model.
3.2.1. Nonlinear Optimization Model
The following optimization model is used to determine the optimum threshold that discriminates between the negative and the positive observations, depending on minimizing the Cartesian distance between the MMTS ROC classifier curve and the theoretical optimum point (i.e., , ).where is Euclidean distance between point and any point that lies on the ROC curve such as , , or . is the false positive rate at point which is equal to zero. is the true positive rate at point which is equal to one. is the false positive rate at the threshold . is the true positive rate at the threshold .
Accordingly, the optimization model becomes
The optimization model is a nonlinear one, where the objective function is the Euclidean distance between points on the ROC MMTS curve and the “” point (i.e., , ). The first two constraints ((6) and (7)) are the theoretical optimum values of true∖false rate of the positive observations while the last two constraints (inequalities (8) and (9)) are the lower and the upper boundaries of the true positive rate and the false positive rate.
3.2.2. Taguchi System
Since more features mean a higher cost of monitoring and require more processing time, it is important to exclude the unnecessary features from having an efficient classifier. MTS approach uses orthogonal array (OA) experiments to screen the important features. Each factor in the orthogonal array design can be calculated independently of all other factors since the design is balanced (i.e., the factors levels are weighted equally) (readers are referred to Woodall et al. [37] for further information about an OA).
The metric of the Taguchi orthogonal array is the signaltonoise ratio, where uses (in our case) “the larger the better” criterion, which can be calculated for different treatment usingwhere is an index that represents run or row in the orthogonal design and its domain varied from 1 to , where is the total number of features. Based on the above equation, the feature mean gain can be calculated bywhere is an index that represents the feature, , and is the total number of features. The feature will be included if it has a positive gain; otherwise, it should be excluded.
4. Results
In this section, the description of the dataset used in this study, brief of the used benchmarked classifiers, an overview of the metrics used for imbalanced data classifiers, and the results of classifiers performance for different datasets will be presented.
4.1. Dataset
The binary or multiclass imbalance ratio threshold, which is the ratio between negative to positive observations border that separates balance from imbalance dataset, is still an open area for the research community. In this paper, we investigated a wide range of IR, from 1.25 up to 2088, considering a dataset to be imbalanced if IR is equal or higher than 1.25. Table 2 contains a description of the selected datasets properties. All the datasets (except for the welding dataset) were obtained from the UCI machine learning repository [41].
 
Fisher discriminant ratio; data overlapping index, imbalance ratio = ; based on KruskalWallis nonparametric test; is there any statistical significant difference among classifiers performance (yes/no)? [40]. 
It should be noted in this study that the imbalance ratio effect on the classification results should be explored. Accordingly, the datasets were selected related to this criterion (i.e., to investigate at a wide range of IR). Unfortunately, imbalance ratio is not the only reason that causes degradation in classifier performance. The maximum Fishers Discriminant Ratio (ratio) is also considered as a major factor in classifier performance degradation. A low value of ratio means that observations are mixed together and overlapped regions are large, and therefore it is difficult to discriminate between these observations. Estimates of the different metrics were obtained by means of 10 repetitions; the data has been randomly partitioned by 35% as the training set and the remainder of the testing set for each repetition. MMTS and the benchmarked algorithms have been evaluated for each of the ten repetitions simultaneously.
4.2. Benchmarked Classifiers Used in the Study
In this section, an overview of the benchmarked classifiers, with their parameters, and the machine specifications used for analysis will be presented.
4.2.1. Support Vector Machines (SVMs)
The first work regarding SVMs was published by Cortes and Vapnik [42], continued by significant contributions from other researchers [43]. SVMs showed a good classification performance for the rare and noisy data, which makes them favorable in a number of applications from cancer detection [44] to text classification [45].
The idea of the SVMs classifier is based on establishing the most appropriate hyperplane that separates class observations from each other (Figure 2). The most appropriate hyperplane means the one with the largest width of the margin parallel to the hyperplane with no interior points.
More details about SVMs methodology can be found in [46].
4.2.2. Mahalanobis Taguchi System (MTS) Based on Probabilistic Thresholding Method (PTM)
In the PTM method, Chebyshev’s theorem is employed to determine the threshold (12) that separates the normal observations from abnormal ones; see [6]:where is the threshold that separates negative from positive observations, is the negative data mean MDs, is the negative data standard deviation MDs, is a small value, and is the portion of the negative observations whose MDs are less than the lower value of the positive MD observations.
4.2.3. Naive Bayesian Classifier
Bayes theorem is the center of Naive Bayesian classifier (NB) in which class conditional independence is assumed. This assumption means that the influence of features on a given class is independent of each other. Mathematically,where is a variable vector of size and is the class.
Even with such unrealistic assumption, Naive Bayes still found noticeable success stories comparable with other types of sophisticated classifiers, for example, NB used in text classification [47], medical diagnosis [48], and systems performance management [49].
4.3. Experimental Settings
The parameters values setting for the examined classifiers were selected from the suggestions of the corresponding authors as follows:(i)MMTS: the MMTS does not need any tuning parameters, which is one of the important benefits of using MMTS over the traditional MTS.(ii)PTM: for the PTM algorithm, a small parameter is set to 0.05, based on the recommendation from [6].(iii)SVM: for the SVM algorithm, to map observations from the data space to the kernel space, the linear function was used.(iv)NB: for the NB algorithm, kernel distribution was selected to fit the conditional features distributions.
It is worth mentioning that no tuning parameters for any of the examined classifiers were performed; consequently, baseline line comparisons among the classifiers with the default setting were established, which leads to the most robust classifier selection [50].
Finally, MATLAB R2013a was used for the data analysis on HP machine with a processor Intel (R) Core (TM) i7 CPU 2.2 GHz and 4.00 GB RAM. For the genetic algorithm, the following parameters were used in the implementation: population size, 20 chromosomes, with the number of features corresponding to the bit number, 0.8, a crossover fraction, 0.01, a mutation rate, 100, and the limit for the number of generations, and for the stopping criteria, value of the fitness function cumulative change was less than 10–6 over 50 iterations.
4.4. Metrics
Several metrics such as accuracy (14), error (15), specificity (16), precision (17), sensitivity or recall (18), (19), and (20) are used by the research community as comprehensive assessments of classifiers performances. The most important metrics among the abovementioned ones are the sensitivity and the specificity, whereas the first one (sometimes called recall) can be seen as the accuracy of the positive observations: that is, how many positive observations were classified correctly. On the other hand, specificity can be understood as the accuracy of the negative observations: that is, how many negative observations were classified correctly.
Unfortunately, the examination of accuracy and error rates ((14) and (15)) reveals that these metrics are not sensitive to the data distribution [10]. For example, the given dataset consists of ninety percent of negative observations and ten percent of positive ones. If the classifier ignores the positives observations and classifies all instances as negative, it means that the classifier has ninety percent accuracy (i.e., error rate, 10 percent), which is a good classification performance for the entire dataset, but it cannot detect the positive instances as if it does not exist. In this context, it can be seen that accuracy and error rate metrics are biased towards one class on behalf of the other.
In order to overcome the above problem, several metrics such as [51] (19), the area under a Receiver Operating Characteristic (AUCROC) curve [52], and [19] (20) are used to assess the imbalance data classifier performance.
The most common used metrics for the evaluation of the imbalance data classification performance are and , where the last one uses weighted importance of the recall and precision (controlled by , the default value of is 1), which results in better assessment than accuracy metric, but still biased to one class [10]. Therefore, will be used as a main metric for the analysis criterion.
4.5. Classification Results
In this section, performance presentation for the classification results of MMTS with the other four investigated classification algorithms: Support Vector Machines (SVMs), Probabilistic Mahalanobis Taguchi System (PTM), Naive Bayes (NB), and Mahalanobis Taguchi System (MTS) (based on previously assumed threshold equal to one). In order to investigate the robustness performance of the studied classifiers related to the class imbalance criterion, fourteen different UCI [41] datasets and one data (welding) from ElBanna et al. [40] were used.
Table 3 summarizes the median values with the upper and the lower 95% confidence level interval based on nonparametric Wilcoxon Signed Rank Test for values of the investigated data for the five classifiers. In order to discriminate between the classifiers performances among each other, nonparametric pairwise comparison Wilcoxon test was performed to test the null hypothesis that the two classifiers have equal medians versus the alternating hypothesis that the first classifier’s median is larger than the second one; the results of these comparison are summarized in the ranking score of each classifier for each dataset. Based on this table, one can observe the following:(i)The MMTS classifier has a higher classification performance than MTS across the whole fourteen investigated datasets.(ii)The MMTS has a superior classification performance comparable with the other benchmarked classifiers when the imbalance ratio (IR) is high (i.e., IR ≥ 463).(iii)The MMTS and SVM have equal classification performance when the imbalance ratio (IR) is medium (i.e., 189 ≤ IR ≤ 417).(iv)The SVM has a superior classification performance comparable with the other benchmarked classifiers when the imbalance ratio (IR) is low (i.e., 1 ≤ IR ≤ 189).(v)The MMTS has the most robust classification performance over the investigated IR range (i.e., the MMTS ranks eight∖six times as the first∖second one, resp.).(vi)The NB has the least classification performance comparable with the other benchmarked classifiers over the investigated IR range.(vii)The effect of the ratio is dominated by the imbalance ratio (IR) effect (i.e., the IR is more important than the ratio).
 
Fisher discriminant ratio; data overlapping index; imbalance ratio = ; MTS: Mahalanobis Taguchi System classifier at threshold = 1; MMTS: Modified Mahalanobis Taguchi System classifier; PTM: Probabilistic Mahalanobis Taguchi System classifier; SVM: Support Vector Machines classifier; NB: Naive Bayes classifier; LL: lower limit, Med.: median, and UL: upper limit based on 95% confidence interval by using one sample Wilcoxon method; [40]. 
4.5.1. MMTS versus Modified SVMs and NB Classifiers
Many published works [16, 19, 53, 54] pointed out that SVMs classification performance drops significantly when dealing with the imbalance data; therefore, modified SVMs classifiers have been suggested to overcome this issue at both data and algorithmic levels. At the data level, Synthetic Minority Oversampling Technique (SMOTE) [11] has been applied successfully to handle the imbalance data issue, while at the algorithmic level, Adaptive Conformal Transformation (ACT) [54] and Kernel Boundary Alignment (KBA) [19] are among the most popular SVMs modified classifiers for imbalance data handling.
Therefore, in order to assess the MMTS classification performance against imbalance data classifiers, UCI datasets and their classification performance results using SVMs, SMOTE, ACT, and KBA from [19] were used, where the same experimental settings were used for the MMTS classifier in order to compare the benchmarked classifiers results.
Using the performance classification results obtained from [19] and the test performed using the MMTS classifier, performance metrics in the form of the 95% confidence intervals are reported in Table 4. It can be seen that the of the MMTS classifier are higher than those for the benchmarked classifiers at relatively high imbalance ratio (i.e., for the Abalone dataset), while for the yeast dataset, MMTS were less than KBA and ACT but better than SVM and SMOTE. Finally, MMTS was the least performance among the classifiers for the car dataset.

Using the same dataset in [19], modified NB algorithms such as tree augmented Naive Bayes (TAN), Hidden Naive Bayes (HNB), Average OneDependence Estimators (AODE), and Weighted Average of OneDependence Estimators (WAODE) are used to compare the MMTS classification performance with them. Table 5 shows that the MMTS classification results for the examined datasets have the highest values comparable with the others.

5. Case Study
The case presented will be in the manufacturing sector in the area of resistance spot welding. Due to its cost and simplicity, resistance spot welding is the dominant joining process in the autoindustry. The reasons behind chosen spot welding joining process over other joining processes can be summarized as follows: being inexpensive and having fast process, its applicability to join different types of materials (coated steel, low carbon steel, aluminum, etc.) with varying thickness, and its relative robustness to the different noise factors existing in the plant such as fitup variations. Despite the abovementioned advantages, weld quality cannot be estimated with high certainty due to factors such as tip wear, sheet metal debris, variation in the power supply; therefore, it is common practice in the autoindustry to add extra welds to increase their confidence in the structural integrity of the welded assembly [40].
Recently worldwide competition pushes automotive OEMs to improve their productivity, reduce nonvalue added activity, and reduce cost. Therefore, autoindustry is extremely concerned with the elimination of these redundant welds. To achieve this objective of using the optimum number of required welds that sustain the required strength of the structure, weld quality must be achieved.
To achieve an acceptable weld quality, nondestructive weld assessment should be performed. This assessment can be translated into the problem of classifying the dynamic resistance profile (input signal) for those welds into normal or abnormal welds.
The welding data, summarized in Table 6, are used for this case having similar conditions to the one used in ElBanna et al. [40]. The experimental setup, the materials used, and all the other related information can be found in the same reference. The data consisted of 3,294 welds, from which 3,288 were normal welds, and the others were expulsion welds performed by an alternating current (AC) constant current controller. Each weld has 28 features, which represents the dynamic resistance value in the 28 half cycles or welding time. The welds were performed by an alternating current (AC) welding machine that has a capacity of 180 KVA with 680 lb of welding force provided by a pneumatic gun. An HWPAL25 truncated electrode type with a 6.4 mm face diameter was used with a welding time of 14 cycles and 11.3 KA as the initial input secondary current. Tip dressing was performed 10 times (approximately every 300 welds) in order to return the electrode tip to its original diameter by removing the excess material. The constant current control applied a current stepper, one Ampere per weld, to compensate for the increase in the electrode diameter or what is known as mushrooming effect.

5.1. Implementation
The first step after obtaining the dataset was to split them into training and testing groups. In this case, the training data was 1,153 observations (i.e., training ratio is 35%), in which two observations were expulsion welds (i.e., positive observations), and the others were normal welds (i.e., negative observations).
Running the MMTS and the other benchmarked algorithms, in addition to the Mahalanobis Genetic Algorithm (MGA) [3] over the welding data, Table 7 shows the results for the 10 repetitions in terms of the following metrics: specificity, sensitivity, precision, , and . In addition, the suggested threshold is reported for the MMTS and PMTS algorithms. As mentioned before, will be used as the main metric, but the results for other metrics will be reported here for future researchers to use.
 
NAN since the dominator is zero. 
In order to determine if there is a significant difference among the classifiers performances (i.e., ), Table 7, nonparametric KruskalWallis test is used, in which the value obtained from performing this test on the welding data is 0.000, which reveals that there is at least one classifier performance that is significantly different from the others. In order to rank the classifiers, the pairwise Mann–Whitney test is used.
Table 8 shows the values obtained from comparing the performances of the classifiers between any two classifiers using the Mann–Whitney test and the resulting classifiers rank. It can be seen clearly that the MMTS outperforms the other classifiers.
 
The null hypothesis is tested versus the alternative hypothesis , at a specified level of significance 0.05; Mahalanobis Genetic Algorithm [3]. 
This result is also emphasized in the ROC curves and the area under the curve (AUC) values for the examined classifiers (Figure 3).
6. Conclusions
The Mahalanobis Taguchi System (MTS) is one of the most promising binary classification approaches to handling the imbalance data problem. Unfortunately, the MTS suffers from the lack of a systematic rigorous method for determining the threshold to discriminate between the two classes. In this paper, a nonlinear optimization model with the objective of minimizing the Euclidean distance between MTS classifier ROC curve and the theoretical optimal point (i.e., = 100% and = 0%) is used to determine this threshold.
In order to assess the suggested algorithm, the MMTS has been benchmarked with several popular algorithms: Mahalanobis Taguchi System (MTS), Support Vector Machines (SVMs), Naive Bayes (NB), Probabilistic Mahalanobis Taguchi System (PTM), Synthetic Minority Oversampling Technique (SMOTE) with SVM, Adaptive Conformal Transformation (ACT), Kernel Boundary Alignment (KBA), Hidden Naive Bayes (HNB), and other improved Naive Bayes algorithms over benchmarked datasets with a wide range of imbalance ratio (i.e., 1.25 ≤ IR ≤ 2088). The results showed that the MMTS has a superior performance for high imbalance ratio (i.e., IR ≥ 463), while for the medium imbalance ratio (i.e., 189 ≤ IR ≤ 417), the MMTS has an equal classification performance with the SVMs. For the low imbalance ratio (), the SVM was the best among the classifiers. It has been noticed that the effect of the maximum Fishers Discriminant Ratio (ratio) is dominated by the imbalance ratio (IR) effect (i.e., IR is more important than ratio). MMTS showed a very robust classification performance across the range of the imbalance ratio; it also showed better classification performance results comparable with KBA, ACT (i.e., state of the art Modified SVM classifiers for imbalance data), HNB, NBtree, and other modified Naive Bayes classifiers when imbalance ratio is relatively high.
In order to demonstrate the MMTS applicability, a case study in the welding area was used. The results showed that the MMTS classifier performance outperformed the benched marked classifiers performances and MGA. The case results emphasize that the MMTS is one of the most suitable classifier algorithms when there is a high imbalance ratio.
For future research work, the problems of multiclass imbalanced data and the mixed data need to be tackled thoroughly.
Disclosure
Permanent address of Mahmoud ElBanna is as follows: Industrial Engineering Department, German Jordanian University, P.O. Box 35247, Amman 11180, Jordan.
Conflicts of Interest
The author declares that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This project was funded By German Jordanian University, Deanship of Graduate Studies and Scientific Research, Seed Fund no. 10/2015.
References
 T. Ojala, M. Pietikäinen, and D. Harwood, “A comparative study of texture measures with classification based on feature distributions,” Pattern Recognition, vol. 29, no. 1, pp. 51–59, 1996. View at: Publisher Site  Google Scholar
 G. Hinton, L. Deng, D. Yu et al., “Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012. View at: Publisher Site  Google Scholar
 M. ElBanna, “A novel approach for classifying imbalance welding data: Mahalanobis genetic algorithm (MGA),” International Journal of Advanced Manufacturing Technology, vol. 77, no. 14, pp. 407–425, 2015. View at: Publisher Site  Google Scholar
 Y. Sun, A. K. C. Wong, and M. S. Kamel, “Classification of imbalanced data: a review,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 23, no. 4, pp. 687–719, 2009. View at: Publisher Site  Google Scholar
 F. Provost and T. Fawcett, “Robust classification for imprecise environments,” Machine Learning, vol. 42, no. 3, pp. 203–231, 2001. View at: Publisher Site  Google Scholar
 C.T. Su and Y.H. Hsiao, “An evaluation of the robustness of MTS for imbalanced data,” IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 10, pp. 1321–1332, 2007. View at: Publisher Site  Google Scholar
 Z. Zheng, X. Wu, and R. Srihari, “Feature selection for text categorization on imbalanced data,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 80–89, 2004. View at: Publisher Site  Google Scholar
 C.T. Su and Y.H. Hsiao, “Multiclass MTS for simultaneous feature selection and classification,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 2, pp. 192–205, 2009. View at: Publisher Site  Google Scholar
 B. Song, G. Zhang, W. Zhu, and Z. Liang, “ROC operating point selection for classification of imbalanced data with application to computeraided polyp detection in CT colonography,” International Journal of Computer Assisted Radiology and Surgery, vol. 9, no. 1, pp. 79–89, 2014. View at: Publisher Site  Google Scholar
 H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, 2009. View at: Publisher Site  Google Scholar
 N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority oversampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002. View at: Google Scholar
 G. M. Weiss and F. Provost, “Learning when training data are costly: The effect of class distribution on tree induction,” Journal of Artificial Intelligence Research, vol. 19, pp. 315–354, 2003. View at: Google Scholar
 I. Mani and I. Zhang, “knn approach to unbalanced data distributions: a case study involving information extraction,” in Proceedings of the Workshop on Learning from Imbalanced Datasets, 2003. View at: Google Scholar
 B. Zadrozny and C. Elkan, “Learning and making decisions when costs and probabilities are both unknown,” in Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 204–213, ACM, August 2001. View at: Google Scholar
 N. Japkowicz and S. Stephen, “The class imbalance problem: a systematic study,” Intelligent Data Analysis, vol. 6, no. 5, pp. 429–449, 2002. View at: Google Scholar
 R. Akbani, S. Kwek, and N. Japkowicz, “Applying support vector machines to imbalanced datasets,” in Proceedings of the 15th European Conference on Machine Learning (ECML '04), vol. 3201 of Lecture Notes in Computer Science, pp. 39–50, Springer, September 2004. View at: Google Scholar
 B. M. Abidine, L. Fergani, B. Fergani, and M. Oussalah, “The joint use of sequence features combination and modified weighted SVM for improving daily activity recognition,” Pattern Analysis and Applications, pp. 1–20, 2016. View at: Publisher Site  Google Scholar
 Y. Lin, Y. Lee, and G. Wahba, “Support vector machines for classification in nonstandard situations,” Machine Learning, vol. 46, no. 13, pp. 191–202, 2002. View at: Publisher Site  Google Scholar
 G. Wu and E. Y. Chang, “KBA: kernel boundary alignment considering imbalanced data distribution,” IEEE Transactions on Knowledge & Data Engineering, vol. 17, no. 6, pp. 786–795, 2005. View at: Publisher Site  Google Scholar
 X. Yin and J. Han, “Cpar: Classification based on predictive association rules,” in Proceedings of the 2003 SIAM International Conference on Data Mining, pp. 331–335, SIAM, 2003. View at: Google Scholar
 W. Li, J. Han, and J. Pei, “CMAR: Accurate and efficient classification based on multiple classassociation rules,” in Proceedings of the 1st IEEE International Conference on Data Mining, ICDM'01, pp. 369–376, IEEE, December 2001. View at: Google Scholar
 B. Liu, W. Hsu, and Y. Ma, “Integrating classification and association rule mining,” in Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, 1998. View at: Google Scholar
 B. Liu, Y. Ma, and C. K. Wong, “Improving an association rule based classifier,” in Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery, pp. 504–509, Springer, 2000. View at: Google Scholar
 S. S. Khan and M. G. Madden, “A survey of recent trends in one class classification,” in Proceedings of the Irish conference on Artificial Intelligence and Cognitive Science, pp. 188–197, Springer, 2009. View at: Google Scholar
 B. Krawczyk and B. Cyganek, “Selecting locally specialized classifiers for oneclass classification ensembles,” Pattern Analysis and Applications, pp. 1–13, 2015. View at: Google Scholar
 N. Japkowicz, “Supervised versus unsupervised binarylearning by feedforward neural networks,” Machine Learning, vol. 42, no. 12, pp. 97–122, 2001. View at: Publisher Site  Google Scholar
 L. M. Manevitz and M. Yousef, “Oneclass svms for document classification,” Journal of Machine Learning Research, vol. 20, pp. 139–154, 2001. View at: Google Scholar
 S. Rajasegarar, C. Leckie, J. C. Bezdek, and M. Palaniswami, “Centered hyperspherical and hyperellipsoidal oneclass support vector machines for anomaly detection in sensor networks,” IEEE Transactions on Information Forensics and Security, vol. 5, no. 3, pp. 518–533, 2010. View at: Publisher Site  Google Scholar
 S. Ertekin, J. Huang, and C. L. Giles, “Active learning for class imbalance problem,” in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07, pp. 823824, ACM, July 2007. View at: Publisher Site  Google Scholar
 S. Ertekin, J. Huang, L. Bottou, and C. Lee Giles, “Learning on the border: active learning in imbalanced data classification,” in Proceedings of the 16th ACM Conference on Information and Knowledge Management, CIKM 2007, pp. 127–136, ACM, November 2007. View at: Publisher Site  Google Scholar
 B. Zadrozny, J. Langford, and N. Abe, “Costsensitive learning by costproportionate example weighting,” in Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM '03), pp. 435–442, Melbourne, Fla, USA, November 2003. View at: Google Scholar
 C. X. Ling, Q. Yang, J. Wang, and S. Zhang, “Decision trees with minimal costs,” in Proceedings of the TwentyFirst International Conference on Machine Learning, ICML 2004, ACM, July 2004. View at: Google Scholar
 P. Domingos, “MetaCost: a general method for making classifiers costsensitive,” in Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 155–164, San Diego, Calif, USA, August 1999. View at: Publisher Site  Google Scholar
 P. Das and S. Datta, “Exploring the effects of chemical composition in hot rolled steel product using mahalanobis distance scale under mahalanobistaguchi system,” Computational Materials Science, vol. 38, no. 4, pp. 671–677, 2007. View at: Publisher Site  Google Scholar
 E. A. Cudney, K. Paryani, and M. K. Ragsdell, “Applying the MahalanobisTaguchi system to vehicle handling,” Concurrent Engineering Research and Applications, vol. 14, no. 4, pp. 343–354, 2006. View at: Publisher Site  Google Scholar
 G. Taguchi and R. Jugulum, The MahalanobisTaguchi Strategy: A Pattern Technology System, John Wiley & Sons, 2002.
 W. H. Woodall, R. Koudelik, K.L. Tsui, S. B. Kim, Z. G. Stoumbos, and C. P. Carvounis, “A review and analysis of the MahalanobisTaguchi system,” Technometrics, vol. 45, no. 1, pp. 1–15, 2003. View at: Publisher Site  Google Scholar  MathSciNet
 A. Pal and J. Maiti, “Development of a hybrid methodology for dimensionality reduction in MahalanobisTaguchi system using Mahalanobis distance and binary particle swarm optimization,” Expert Systems with Applications, vol. 37, no. 2, pp. 1286–1293, 2010. View at: Publisher Site  Google Scholar
 C. E. Metz, “Basic principles of Roc analysis,” Seminars in Nuclear Medicine, vol. 8, no. 4, pp. 283–298, 1978. View at: Publisher Site  Google Scholar
 M. ElBanna, D. Filev, and R. B. Chinnam, “Online qualitative nugget classification by using a linear vector quantization neural network for resistance spot welding,” International Journal of Advanced Manufacturing Technology, vol. 36, no. 34, pp. 237–248, 2008. View at: Publisher Site  Google Scholar
 A. Asuncion and D. J. Newman, UCI Machine Learning Repository, University of California, School of Information and Computer Science, Irvine, CA, USA, 2007, http://www.ics.uci.edu/~mlearn/MLRepository.html.
 C. Cortes and V. Vapnik, “Supportvector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995. View at: Publisher Site  Google Scholar
 J. A. K. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural Processing Letters, vol. 9, no. 3, pp. 293–300, 1999. View at: Publisher Site  Google Scholar
 T. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, and D. Haussler, “Support vector machine classification and validation of cancer tissue samples using microarray expression data,” Bioinformatics, vol. 16, no. 10, pp. 906–914, 2000. View at: Publisher Site  Google Scholar
 T. Joachims, “Text categorization with support vector machines: learning with many relevant features,” in Proceedings of the European Conference on Machine Learning, pp. 137–142, Springer, 1998. View at: Google Scholar
 L. Bottou and L. ChihJen, “Support vector machine solvers,” Large Scale Kernel Machines, pp. 301–320, 2007. View at: Google Scholar
 A. McCallum and K. Nigam, “A comparison of event models for naive bayes text classification,” in Proceedings of the AAAI98 Workshop on Learning for Text Categorization, vol. 752, pp. 41–48, Citeseer, 1998. View at: Google Scholar
 I. Kononenko, “Inductive and bayesian learning in medical diagnosis,” Applied Artificial Intelligence, vol. 7, no. 4, pp. 317–337, 1993. View at: Publisher Site  Google Scholar
 J. L. Hellerstein, T. S. Jayram, I. Rish et al., “Recognizing enduser transactions in performance management,” IBM Thomas J. Watson Research Division, 2000. View at: Google Scholar
 A. Fernández, V. López, M. Galar, M. J. Del Jesus, and F. Herrera, “Analysing the classification of imbalanced datasets with multiple classes: binarization techniques and adhoc approaches,” KnowledgeBased Systems, vol. 42, pp. 97–110, 2013. View at: Publisher Site  Google Scholar
 M. Kubat and S. Matwin, “Addressing the curse of imbalanced training sets: onesided selection,” in ICML, vol. 97, pp. 179–186, 1997. View at: Google Scholar
 A. P. Bradley, “The use of the area under the ROC curve in the evaluation of machine learning algorithms,” Pattern Recognition, vol. 300, no. 7, pp. 1145–1159, 1997. View at: Publisher Site  Google Scholar
 G. Wu and E. Y. Chang, “Classboundary alignment for imbalanced dataset learning,” in Proceedings of the ICML 2003 Workshop on Learning from Imbalanced Data Sets II, pp. 49–56, Washington, DC, USA. View at: Google Scholar
 G. Wu and E. Y. Chang, “Adaptive featurespace conformal transformation for imbalanceddata learning,” in ICML, pp. 816–823, 2003. View at: Google Scholar
Copyright
Copyright © 2017 Mahmoud ElBanna. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.