Machine Learning with Applications to Autonomous SystemsView this Special Issue
A Structural SVM Based Approach for Binary Classification under Class Imbalance
Class imbalance situations, where one class is rare compared to the other, arise frequently in machine learning applications. It is well known that the usual misclassification error is not suitable in such settings. A wide range of performance measures such as AM and QM have been proposed for this problem. However, due to computational difficulties, few learning techniques have been developed to directly optimize for AM or QM metric. To fill the gap, in this paper, we present a general structural SVM framework for directly optimizing AM and QM. We define the loss functions oriented to AM and QM, respectively, and adopt the cutting plane algorithm to solve the outer optimization. For the inner problem of finding the most violated constraint, we propose two efficient algorithms for the AM and QM problem. Empirical studies on the various imbalanced datasets justify the effectiveness of the proposed approach.
Classification problem with class imbalance where one class is rare compared to the other is a common yet important problem in supervised learning. It arises in many applications, ranging from medical diagnosis and text retrieval to credit risk prediction and fraud detection [1–4]. Due to its practical importance, it has been identified as one of the ten most challenging problems in data mining research . For reasons of simplicity and with no loss in generality, only binary classification problems under class imbalance are considered in this paper. However, it is important to keep in mind that the class imbalance problem is pervasive in other areas as well such as multiclass classification and association rule mining.
It is well known that the usual binary learning algorithms are ill-suited in the imbalanced domains, because those classifiers will cause a bias towards the majority class and result in a lower sensitivity in detecting the minority class examples . In the literature of solving class imbalance problems, a variety of approaches have been proposed, which can be mainly categorized into two groups: the data-oriented methods and the algorithm-oriented methods.
The data-oriented methods use various sampling techniques to oversample instances in the minor class [6–8] or undersample those in the major class [9, 10], so that the resulting data is balanced. A typical example is the SMOTE approach  which increases the number of minor class instances by creating synthetic samples.
The second group algorithm-oriented methods aim at the extension and modification of existing classification algorithms so that they can be more effective in dealing with imbalanced data. For example, Liu et al. and Kang and Ramamohanarao have presented two different modified decision tree algorithms for improving the standard C4.5, such as CCPDT  and HeDEx , while Köknar-Tezel et al., Joachims et al., and Lipton et al. have proposed various approaches to improve traditional SVM’s performance on the imbalanced settings [13–22].
Those two groups are both effective and it is difficult to say which one is better. However, since, in this paper, our goal is to improve the existing statistical learning algorithm, in the following we are interested in algorithm-oriented method and propose a modified SVM approach by directly optimizing imbalance measure. It seems that our algorithm is similar to the algorithms in [15–22]; however, we design different objective functions and use different optimization techniques with theirs. More specifically, this paper makes the following contributions.(1)We adopt 1-slack structural SVM as the framework and define the loss functions oriented to AM and QM, which are rarely considered in the literature of optimizing imbalance metrics.(2)We show that the QM loss is a lower bound of the AM one, which means our QM classifier may be more accurate than the AM one.(3)For the inner computational challenge of the AM loss, we propose to decompose it nicely and apply the Hardy-Littlewood-Polya inequality to solve it in time, while, for the case of QM, such decomposition is impossible. We present an efficiently greedy method for solving this problem, which also requires time.(4)Empirical evaluations on the imbalanced datasets demonstrate that the proposed algorithms are not only significantly better than standard binary learning algorithm but also competitive to other existing imbalanced algorithms.The remainder of the paper is organized as follows. In Section 2 the related work is presented. Section 3 discusses the details of our proposed algorithms and the empirical results on the benchmark datasets are reported in Section 4. Section 5 concludes the paper and discusses the future work.
2. Related Work
2.1. Problem Setup and Notations
As discussed in Introduction, in this paper, we only consider the binary classification problem. Given training dataset , where is the th example and is the corresponding class label. The binary classification problem is to construct a classifier function , which gives generalization performance. We assume that the classifier function is of the form and the decision function of the form is used when finding the label of an unseen example. Note that we have not included the bias term in the classifier function for notational convenience. However, it can be incorporated in a straightforward way.
In machine learning area, a common way to find the linear parameter is minimizing a regularized risk function:where is a constant that controls the trade-off between training error minimization and margin maximization. is a suitable loss function which measures the discrepancy between a true label and a predicted value from using . Different loss functions yield different learners. One of the most famous loss functions is the hinge loss in SVM, which has the form of .
2.2. Relevant Background
Standard SVM has been used to optimize an estimation of classification error on the training set and was shown to be a very powerful tool for classification problems when data is balanced. However, if the data is highly imbalanced, classification error is not always a good measure, and the standard SVM can be misleading. To solve this problem, a number of modified algorithms have been proposed. For example, Köknar-Tezel and Latecki  and Shao et al.  proposed approaches to improve SVM on imbalanced datasets, which they called GP and WLTSVM, respectively. But their works are both focused on improving sampling techniques (e.g., modifying SMOTE in GP) for SVM and do not solve the problem of training bias in the design of SVM learning algorithm per se. Recently, with the advances in learning to rank, direct optimization of the ranking measure technique has been extended to design SVM for imbalanced setting and a variety of algorithm-oriented methods have been proposed. Joachims  and Aiolli  presented algorithms to optimize AUC for the imbalanced data, and the experimental results on the unbalanced sets proved their effectiveness. Along the lines of the above works, Paisitkriangkrai and Narasimhan et al. further gave algorithms by optimizing partial AUC and successfully applied their approaches to the real-world tasks [17–19]. Optimizing the F-measure is another popular method for imbalance learning. Joachims , Chinta et al. , Maratea et al. , and Lipton et al.  used different approximates to the F-measure and designed different classifies. Numerical experiments on the benchmark datasets demonstrated their algorithms’ effectiveness.
However, it is well known that, in evaluating imbalanced setting, there are many other performance measures besides AUC and F-measure, which include AM (arithmetic mean)  and QM (quadratic mean) . The AM is the arithmetic mean of the true positive and true negative rates and can be defined asThe QM is a quadratic mean measure and is defined aswhereAlthough AM and QM are popular in the imbalanced setting, surprisingly, little has been focused on designing the algorithms based on them. Until very recently, Menon provided a consistent algorithm, which aimed at directly optimizing AM measure . This approach is effective, but it is only suitable for the AM measure; whether it can be extended to other measures such as QM is still unknown. In contrast to Menon’s work, in this paper, we will present a general learning framework, whose loss function allows us to incorporate different imbalanced measure. We exploit it for optimizing AM and QM. In the following, we will discuss our approach in detail.
3. DOPMID: Direct Optimization of Performance Measure for Imbalanced Dataset
3.1. The Framework of DOPMID
We referred to the classifier we presented as DOPMID (Direct Optimization of Performance Measure for Imbalanced Dataset). The framework of DOPMID is based on structural SVM proposed by Joachims et al. . Specifically, we use the 1-slack SVM formulation, presented in (OP1) (optimization problem 1), to learn a linear . Note that the following approach can be extended to nonlinear function/non-Euclidean instance spaces by using kernels :For simplicity, in the paper, we assume that the training dataset has been ordered by the positive instances ahead of the negative ones, and we define , , where is the number of the positive instances, is the number of the negative instances, and . stands for any possible permutation of predicted list from using the parameter . represents a mapping function from input list to output list. is a function used to measure the difference between the real output and the predicted output . This function must satisfy the following conditions:In contrast to the traditional SVM which has slack , there is only a single slack variable in the (OP1) above. We refer to it as the “1-slack” SVM.
3.2. The Loss Functions Oriented to AM and QM
For the framework above, we need to further define the functions and , in order to determine the optimization target.
In this paper, we first define asThen we define oriented to AM and QM, respectively, asIn equality (8) and (9), the function is an indicator function, which can be demonstrated asIt is obvious that and defined in (8) and (9) satisfy the constraint conditions in (6). It has been proved that if the function satisfies (6), the slack is a convex upper bound on the training loss regularized by the norm of the weight vector .
In the following, we will show the fact that although , are both upper bound, is a lower bound than .
Proof. Since the slack is a convex upper bound on the training loss, we can rewrite (OP1) as Then we replace (8) and (9) with and get the AM bound and QM bound, respectively:We can simplify (14) asSincewe obtainSince we obtainCombining inequality (17) and (19), we getwhich means that and proves the claim.
We can solve the (OP1) by substituting (7), (8), and (9) with (5), but unfortunately there is still a question: for each , inequality (5) has an exponential number of constraints. To solve this problem, we propose to use the cutting plane algorithm, which is based on the fact that, for any , a small subset of the constraints is sufficient to find an -approximate solution to the problem. The detail of the cutting plane algorithm is shown in Algorithm 1.
The algorithm starts with no constraints and iteratively finds for each possible output associated with the most violated constraint. If the corresponding constraint is violated by more than we introduce into working set and resolve (OP1) with the updated . It can be shown that Algorithm 1’s outer loop is guaranteed to halt within iterations for any desired precision .
Since the quadratic program in each iteration of this algorithm is of constant size, the only bottleneck in the algorithm is how to solve the (OP2), which is known as the problem of “finding the most violated constraint.” In the following, we will show how it can be performed efficiently for the AM loss and QM loss, respectively.
3.3. Efficient Algorithms for Finding the Most Violated Constraint
First of all, we rewrite (7) as
3.3.1. The Algorithm for AM Loss
We can calculate and according to current weight ; hence we apply the Hardy-Littlewood-Polya inequality and observe that is maximized by sorting the terms , in decreasing order. Note that this permutation is easily obtained by applying Quick Sort in time.
3.3.2. The Algorithm for QM Loss
Unlike for the AM loss, the (OP2) can be decomposed linearly in the instances. The (OP2) for the QM loss is quite different and needs a substantially extended algorithm, which we will describe in the following. First, we substitute (9) and (21) with (OP2). This givesFrom the above, we can see that the decomposition technology used in is not suitable for , since and can no longer be absorbed in and . To solve this problem, in the following, we will provide a more trick optimization for the . The algorithm we proposed is based on the fact that, for each in , there is only two possible values, which denote +1, −1. So in the following we will present an efficient algorithm, which can find in time (see Algorithm 2).
The idea behind Algorithm 2 is the fact that the most violated constraint must have the following form: positive instances are labeled positive and other positive ones are labeled negative; negative instances are classified as negative and other negative ones are classified as positive. So we can get by testing each with +1, −1. Specifically, Algorithm 2 starts with a “perfect classification” (Step to Step ) and then uses the greedy algorithm to find the most violated constraint with maximum value for (Step to Step ).
Algorithm 2 is very efficient, and its running time can be split into two parts. The first part is the sort (Step , Step ), which requires . The second part is the following steps, which requires time. Though in the worst case this is , the number of the positive instances in the imbalanced datasets is very small, which means the running time for the second part is simply . So Algorithm 2 has complexity of .
4.1. Datasets and Baselines
The main goal of our experiments is to evaluate whether the classifiers we proposed can outperform the existing binary classifiers in the imbalanced setting. In particular, we select three datasets with varying degrees of class imbalance, taken from the libsvm dataset repository (downloaded from: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/), which are named as satimage, w1a, and vowel. The characteristics of those sets are summarized in Table 1.
The “#Examples” and “#Features” denote the number of examples and features, respectively. “Min(%)” represents the proportion of examples in the minority class.
On those imbalanced datasets, we compare our classifiers (DOPMID-AM, DOPMID-QM) against the following classifiers: SVM-Hinge , safe-level-SMOTE , CCPDT , SVM-F1 , and AM-Consist . Among them, SVM-Hinge is a traditional balanced binary classifier which seeks to minimize the hinge loss, and the latter four algorithms are all for the imbalanced setting. We choose those four as comparisons because we are interested in how our algorithms perform, when compared with other imbalanced algorithms. In the following experiments, unless otherwise noted, the parameters of our algorithms are all chosen in the set , and the error parameters of algorithms are set to 0.001.
4.2. Experimental Results
The experimental results are reported in terms of AM , QM , and F1 , which are all commonly adopted as the performance metrics for evaluating imbalanced learning classifiers. More specifically, we compare our proposed classifiers from the following two aspects.
4.2.1. Comparison with SVM-Hinge
We design this comparison in order to see whether the algorithms we proposed above can really be better-behaved than the standard SVM on the imbalanced datasets. The experimental results on the three sets are illustrated in Figure 1.
(a) The results in terms of AM
(b) The results in terms of QM
(c) The results in terms of F1
From Figure 1, we can see that as expected, on all the three datasets, DOPMID-AM and DOPMID-QM are significantly better than SVM-Hinge in terms of all the measures. Due to space limitation, we only report the statistics measured by AM as an example. When compared with SVM-Hinge, DOPMID-AM increases 32.21% on satimage set, 14.42% on w1a set, and 6.87% on the vowel set, while, for the DOPMID-QM, the improvement corresponds to 33.80%, 14.42%, and 13.87%, respectively. All the results prove the effectiveness of our method and once again indicate that we can obtain a more accurate unbalanced classifier by directly optimizing the imbalanced evaluation metrics.
Meanwhile, it can be seen from the figures that DOPMID-QM outperforms DOPMID-AM on most of the points. More specifically, on the experimental sets, DOPMID-QM is better than DOPMID-AM with 6 out of all the 9 points and is similar to DOPMID-AM with the other three points. These results, thus, suggest that DOPMID-QM can be more accurate than DOPMID-AM. This may be due to the fact that the loss function of DOPMID-QM is lower than the DOPMID-AM’s, which can create more precise classifier. This observation is also consistent with Chen’s conjure that it is possible to create more accurate model by defining tighter bound .
4.2.2. Comparison with Other Imbalanced Algorithms
In the second section, we are interested in how our algorithms perform, when compared with other imbalanced binary classifiers. More specifically, we compare our DOPMID with safe-level-SMOTE, CCPDT, and SVM-F1 and AM-Consist. We choose those four algorithms, because as discussed in Introduction, they represent two different methods for dealing with imbalanced problem. Safe-level-SMOTE uses oversampling technique and belongs to the data-oriented method. We adopt it instead of SMOTE, since it can produce better accuracy than SMOTE by using different weight degrees on the synthetic examples . The latter three algorithms all belong to algorithm-oriented methods. CCPDT is an efficient decision tree algorithm, which improved C4.5 on imbalanced datasets by using Fisher’s exact test to prune branches of tree. SVM-F1 and AM-Consist are both SVM based approaches. SVM-F1 modified traditional SVM by optimizing F1 measure, while AM-Consist is a very recently proposed consistent classifier that aims at optimizing AM measure, which is the same as DOPMID-AM. We include it to make the effectiveness of our algorithms more convincing. It should be noted that, in the paper , the authors have proposed two consistent AM algorithms named as Plugin and Balanced, respectively. In our experiments, we select the Balanced as the AM-Consist, because a detailed analysis in their supplementary material shows that the Balanced performs better than the Plugin. Figure 2 depicts the behaviors of those algorithms on the satimage, w1a, and vowel datasets.
(a) The results in terms of AM
(b) The results in terms of QM
(c) The results in terms of F1
As can be seen from Figure 2, the performance of algorithms in comparison varies from one dataset to another, and there is no one algorithm that can outperform other algorithms on all the datasets. For example, when measured by AM, safe-level-SMOTE performs best on vowel set, while it is the worst one on the other two sets. SVM-F1 performs well on satimage and w1a in terms of F1; however, it yields poor performances on w1a and vowel in terms of QM. Similarly, CCPDT achieves a better performance than AM-Consist on satimage dataset, but it is worse than AM-Consist on w1a set. Different from those comparison algorithms, DOPMID-QM we proposed appears to perform more stably across all the datasets. Statistics show that when compared with other imbalanced classifiers, DOPMID-QM performs best on 4 of 9 points and is the second best on 4 of 9 points. It is the third best one on the w1a when measured by F1. All those demonstrate that even compared with others imbalanced binary classifiers, our DOPMID-QM approach is still effective and is suitable for the imbalanced settings.
Finally, we compare our DOPMID-AM with AM-Consist, because they both improve traditional SVM by optimizing AM measure. The results from Figure 2 are somewhat surprising where DOPMID-AM wins 5 of 9 points and fails in 4 points. More specifically, when compared with DOPMID-AM, AM-Consist performs better on w1a set and yields poorer performances on the other two sets. One possible explanation is that AM-Consist is a consistent algorithm and may be more suitable for the set with large number of examples (such as w1a dataset).
AM and QM are popular used performance measures in the imbalanced setting. In this paper, we have proposed a structural SVM based method, termed DOPMID for optimizing them. Specifically, we designed the objective functions oriented to AM and QM, respectively, and showed that the QM function is a tighter bound of the AM one. For the problem that the objective functions have exponential number of constraints, we introduce the cutting plane algorithm for outer optimization, which only needs time. Then, for the inner computational challenge of the AM loss, we presented to decompose it nicely and applied the Hardy-Littlewood-Polya inequality to solve it, while, for the QM loss, we proposed an efficiently greedy algorithm, which still only required time. Our experiments on the imbalanced datasets showed that DOPMID is superior to the existing baseline techniques in terms of performance and stability. In future work, we hope to extend our approach to the multiclassification under class imbalance.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work is supported by the Humanities and Social Sciences Project of Chinese Ministry of Education (Grant no. 13YJC870003), the Natural Science Foundation of China (Grant no. 61402002), and Key Program of Natural Science Project of Educational Commission of Anhui Province (Grant no. KJ2015A070), Youth Foundation of Anhui University (Grant no. KJQN1119), the Doctor Foundation of Anhui University (Grant no. 01001902), and the Foundation for the Key Teacher by Anhui University.
D. Vassis, B. A. Kampouraki, and P. Belsis, “Using neural networks and SVMs for automatic medical diagnosis: a comprehensive review,” in 4th International Conference on Integrated Information (IC-ININFO '14), vol. 1644 of AIP Conference Proceedings, pp. 32–36, AIP Publishing, Madrid, Spain, September 2014.View at: Publisher Site | Google Scholar
A. Kshirsagar and L. Dole, “A review on data mining methods for identity crime detection,” International Journal of Electrical, Electronics and Computer Systems, vol. 2, no. 1, pp. 312–318, 2014.View at: Google Scholar
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.View at: Google Scholar
C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, “Safe-level-SMOTE: safe-level-synthetic minority over-sampling TEchnique for handling the class imbalanced problem,” in Advances in Knowledge Discovery and Data Mining: Proceedings of the 13th Pacific-Asia Conference, PAKDD 2009 Bangkok, Thailand, April 27–30, 2009, vol. 5476 of Lecture Notes in Computer Science, pp. 475–482, Springer, Berlin, Germany, 2009.View at: Publisher Site | Google Scholar
J. L. Hsu, P. C. Hung, and H. Y. Lin, “Applying under-sampling techniques and cost-sensitive learning methods on risk assessment of breast cancer,” Journal of Medical Systems, vol. 39, no. 4, pp. 1–13, 2015.View at: Google Scholar
W. Liu, S. Chawla, D. A. Cieslak, and N. V. Chawla, “A robust decision tree algorithms for imbalanced data sets,” in Proceedings of the 10th SIAM International Conference on Data Mining (SDM '10), pp. 766–777, Sydney, Australia, December 2010.View at: Google Scholar
S. Kang and K. Ramamohanarao, “A robust classifier for imbalanced datasets,” in Advances in Knowledge Discovery and Data Mining, pp. 212–223, Springer, 2014.View at: Google Scholar
S. Paisitkriangkrai, C. Shen, and A. V. D. Hengel, “Efficient pedestrian detection by directly optimizing the partial area under the ROC curve,” in Proceedings of the 14th IEEE International Conference on Computer Vision (ICCV '13), pp. 1057–1064, IEEE, Sydney, Australia, December 2013.View at: Publisher Site | Google Scholar
H. Narasimhan and S. Agarwal, “A structural SVM based approach for optimizing partial AUC,” in Proceedings of the 30th International Conference on Machine Learning (ICML '13), pp. 516–524, June 2013.View at: Google Scholar
H. Narasimhan and S. Agarwal, “SVMpAUCtight: a new support vector method for optimizing partial AUC based on a tight convex upper bound,” in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '13), pp. 167–175, ACM, 2013.View at: Publisher Site | Google Scholar
A. Menon, H. Narasimhan, and S. Agarwal, “On the statistical consistency of algorithms for binary classification under class imbalance,” in Proceedings of the 30th International Conference on Machine Learning (ICML '13), pp. 603–611, 2013.View at: Google Scholar
T. Joachims, Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms, Kluwer Academic Publishers, 2002.
W. Chen, T. Y. Liu, and Y. Y. Lan, “Ranking measures and loss functions in learning to rank,” in Advances in Neural Information Processing Systems, pp. 315–323, 2009.View at: Google Scholar