Abstract

In imbalanced learning methods, resampling methods modify an imbalanced dataset to form a balanced dataset. Balanced data sets perform better than imbalanced datasets for many base classifiers. This paper proposes a cost-sensitive ensemble method based on cost-sensitive support vector machine (SVM), and query-by-committee (QBC) to solve imbalanced data classification. The proposed method first divides the majority-class dataset into several subdatasets according to the proportion of imbalanced samples and trains subclassifiers using AdaBoost method. Then, the proposed method generates candidate training samples by QBC active learning method and uses cost-sensitive SVM to learn the training samples. By using 5 class-imbalanced datasets, experimental results show that the proposed method has higher area under ROC curve (AUC), F-measure, and G-mean than many existing class-imbalanced learning methods.

1. Introduction

In the classification problem field, the scenario of imbalanced data sets appears when the number of samples that represent the different classes is very different among them [1]. Class-imbalanced problems widely exist in the fields of medical diagnosis, fraud detection, network intrusion detection, science and engineering problems, and so on. We consider the binary-class-imbalanced data sets, where there is only one positive (minority) class and one negative (majority) class. Most of data are in the majority class, and little data are in the minority class. Many traditional classification methods tend to be overwhelmed by the majority class and ignore the minority class. The classification performance for the positive class becomes unsatisfactory.

It is important to select the suitable training data for classification in the class-imbalanced classification problem. Resampling is one of the effective techniques for adjusting the size of training sets. Many resampling methods are used to reduce or eliminate the extent of data set imbalance, such as oversampling the minority class, undersampling the majority class, and the combination of both methods. Resampling techniques can be used with many base classifiers, such as support vector machine (SVM), C4.5, Naïve Bayes classifier, and AdaBoost, to address the class-imbalanced problem. So, it provides a convenient and effective way to deal with imbalanced learning problems using standard classifiers [2]. Additionally, modified learning algorithmic solutions are the effective approaches to the imbalanced data classification problem. These solutions are obtained by modifying existing learning algorithms so that they can deal with imbalanced problems effectively. Integrated approach, cost-sensitive learning, feature selection, and single-class learning belong to the solutions. Cost-sensitive learning deals with class imbalance by incurring different costs for the two classes and is considered an important type of methods to handle class imbalance. The difficulty with cost-sensitive classification is that costs of misclassification are often unknown [3].

Although the existing imbalance-learning methods applied for normal SVMs can solve the problem of class imbalance, they can ignore potential useful information in major samples, and probably lead to overfitting problem. This paper presents a cost-sensitive ensemble method. The proposed method uses AdaBoost method to train subclassifiers according to the ratio of imbalanced samples, integrates these sub-classifiers into a classifier, and uses cost-sensitive SVM to train the candidate data selected by a query-by-committee (QBC) algorithm.

The rest of the paper is organized as follows. Following the introduction, Section 2 presents a comprehensive study on the class-imbalanced problem and discusses the existing class-imbalanced solutions. Section 3 simply introduces cost-sensitive SVM. Section 4 proposes a cost-sensitive ensemble method for class-imbalanced data sets. In Section 5, we apply a statistical test to compare the performance of the proposed method with the existing methods. Finally, Section 6 concludes this paper.

Many techniques are proposed to solve classification problems based on imbalanced data sets. There are two major categories of techniques developed to address the class-imbalance issue. One is resampling and the other is modified learning algorithmic solutions [4].

Resampling is one of the effective techniques for adjusting the size of a training dataset. In general, it can be further divided into undersampling approach and over-sampling approach. Undersampling uses only some samples of the majority class to reduce the data size and removes samples of the majority class to balance a data set. So the risk is that the reduced sample set may not represent the full characteristics of the majority class. There are many studies which discuss under-sampling methods. For example, Kim [5] proposes an under-sampling method based on a self-organizing map (SOM) neural network to obtain sampling data which retains the original data characteristics. Yen and Lee [6] present a cluster-based under-sampling approach for selecting the representative data as training data. The proposed method improves the classification accuracy for the minority class. Aiming at the deficiency of under-sampling where many majority-class samples are ignored, Liu et al. [7] propose two effective informed under-sampling methods, EasyEnsemble and BalanceCascade. EasyEnsemble method samples several subsets from the majority-class, trains a learner using each of them, and combines the outputs of those learners. BalanceCascade method trains the learners sequentially. In each step of BalanceCascade, the majority class samples which are correctly classified by the current trained learners are removed from further consideration.

The over-sampling approach is to add more new data instances to the minority class to balance a data set. These new data instances can either be generated by replicating the data instances of the minority class or by applying synthetic methods. However, over-sampling often involves making exact copies of samples which may lead to overfitting [8]. synthetic minority oversampling technique (SMOTE) [1] is an intelligent over-sampling method using synthetic samples. SMOTE method adds new synthetic samples to the minority class by randomly interpolating pairs of the closest neighbors in the minority class. SMOTEBoost algorithm [9] combines SMOTE technique and the standard boosting procedure. It utilizes SMOTE for improving the accuracy over the minority class and utilizes boosting not to sacrifice accuracy over the entire data set. Wang et al. [10] propose an adaptive over-sampling technique based on data density (ASMOBD), which can adaptively synthesize different number of new samples around each minority sample according to its level of learning difficulty. Gao et al. [11] propose probability density function estimation based on over-sampling approach for two class-imbalanced classification problems.

At the algorithmic level, the solutions mainly include cost-sensitive learning, integrated approach, and modified algorithms. Many cost-sensitive learning methods have been proposed [12, 13]. A common strategy of these methods is to intentionally increase the weights of samples with higher misclassification cost in the boosting process. However, misclassification costs are often unknown, and a cost-sensitive classifier may result in over-fitting training. Sun et al. [14] investigate cost-sensitive boosting algorithms for advancing the classification of imbalanced data and propose three cost-sensitive boosting algorithms by introducing cost items into the learning framework of AdaBoost. Guo and Viktor [15] propose a modified boosting procedure, DataBoost, to solve the imbalanced problem. DataBoost combines the boosting and ensemble-based learning algorithms. In terms of modified algorithms, several specific attempts using SVMs have been made at improving their class prediction accuracy in the case of class imbalances [16, 17]. The results obtained with such methods show that SVMs have the particular advantage of being able to solve the problem of skewed vector spaces, without introducing noise. Wang and Japkowicz [13] combine modifying the data distribution approach and modifying the classifier approach in class-imbalanced problem and use support vector machines with soft margins as the base classifier to solve the skewed vector spaces problem.

In addition, Wang et al. [18] develop two models to yield the feature extractors and propose a method for extracting minimum positive and maximum negative features for imbalanced binary classification. Based on the divide-and-conquer principle, the scalable instance selection approach OligoIS is proposed in [19] for class-imbalanced data sets. OligoIS can deal with the class-imbalanced problem that is scalable to data sets with many millions of instances and hundreds of features.

3. Cost-Sensitive SVM

SVM has been widely used in many application areas of machine learning. The goal of the SVM-learning algorithm is to find a separating hyperplane that separates these data points into two classes. In order to find a better separation of classes, the data are first transformed into a higher-dimensional feature space. However, regular SVM is invalid to the imbalanced data sets. For imbalanced data sets, the learned boundary is too close to the minority samples, so SVM should be biased in a way that will push the boundary away from the positive samples [16]. Using different error costs for the positive and negative classes, SVM can be extended to the cost-sensitive setting by introducing an additional parameter that penalizes the errors asymmetrically.

Consider that we have a binary classification problem, which is represented by a data set , where represents a k-dimensional data point and represents the class of that data point, for . Let and . The support vector technique requires the solution of the quadratic programming problem as follows [20]: subject to where the training vectors are mapped into a higher-dimensional space by the function . Parameter represents the cost of misclassifying the positive sample, and represents the cost of misclassifying the negative sample. The optimal result can be obtained when / equals the minority-to-majority class ratio. The slack variables hold for misclassified samples, and therefore, can be thought of as a measure of the amount of misclassifications. This quadratic-optimization problem can be solved by constructing a Lagrangian representation and transforming it into the following dual problem: subject to where is the Lagrangian parameter. Note that the kernel trick is used in (3).

4. An Ensemble Method Based on Cost-Sensitive SVM and QBC

This paper presents an ensemble method based on cost-sensitive SVM and QBC, called CQEnsemble, specifically designed for imbalanced data classification. The proposed method applies division and boost techniques to a simple QBC strategy [21, 22] and improves classification precision on the basis of maximizing data balance. In order to overcome the shortages of over-sampling and under-sampling, the CQEnsemble method trains sub-classifiers using AdaBoost algorithm [23] according to the ratio of imbalanced samples and integrates these sub-classifiers into a classifier. AdaBoost can be used in conjunction with many other learning algorithms to improve their performance. In this way, the proposed method not only fully uses the minority class information but also feedbacks the different aspects of information of the majority class.

Suppose that an imbalanced dataset contains samples from the majority class and samples from the minority class where . First, the CQEnsemble method divides training data set into equivalent subsets, where is greater than or equal to 3. Then, we randomly select two subsets and generate two sub-classifiers as QBCs committees to vote for the other equivalent subsets. We add samples, in which the vote results are different in two QBC’s committees, to candidate data set. It is difficult to decide the category of these samples. So, these samples probably include abundant information. Last, we integrate candidate data set and two selected subsets into new training datasets, train, and get a classifier using cost-sensitive SVM method. Experiments of this paper show that the CQEnsemble method can get comprehensive classification information when the value of is 5.

Based on the description above, the proposed CQEnsemble method is described as follows.

Algorithm 1 (the CQEnsemble method).
Input. Imbalanced data set .
Output. An ensemble classifier .
Step  1. Suppose that the training set is and the total number of samples is . Divide into equivalent subsets randomly, labeled as .
Step  2. Select two subsets randomly and label them as conveniently. For each subset do  
Step  2.1. Compute the ratio of the number of majority-class samples to the number of minority-class samples .
Step  2.2. Divide the majority-class samples into subsets.
Step  2.3. Merge the minority-class samples and each subset to the training set, and get training sets.
Step  2.4. Classify each training set in Step 2.3 using AdaBoost algorithm, and get weak classifiers , where .
Step  2.5. Regard these weak classifiers as features, and integrate into classifier .
End for
Step  3. Use classifiers to respectively train samples in the rest subsets, and add samples in which the results are different in two classifiers to new candidate set .
Step  4. Merge two selected subsets to the candidate set , and get a new training set .
Step  5. Classify data set using cost-sensitive SVM method, and get a classifier .

5. Experiment and Analysis

In this section, we first give several evaluation measures for class-imbalanced problem, and then present and discuss, in detail, the results obtained by the experiments carried out in this research.

5.1. Evaluation Measures

Accuracy is an important evaluation metric for assessing the classification performance and guiding the classifier modeling. However, accuracy is not a useful measure for imbalanced data, particularly when the number of instances of the minority class is very small compared with the majority class [24]. For example, if we have a ratio of 1 : 100, a classifier that assigns all instances to the majority class will have 99% accuracy. But this measurement is meaningless to some applications where the learning concern is the identification of the rare cases.

Several measures have been developed to deal with the classification problem with the class imbalance, including F-measure, G-mean, and AUC [25]. Given the number of true positives (TPs), false positives (FPs), true negatives (TNs), and false negatives (FNs), we can obtain the confusion matrix presented in Table 1 after a classification process. We can also define several common measures. The TP rate TPR, recall R, or sensitivity is defined as The TN rate TNR or specificity is defined as Precision is defined as the fraction of relevant instances that are retrieved as follows:

Based on these measures, other measures have been presented, such as F-measure and G-mean. F-measure is often used in the fields of information retrieval and machine learning for measuring search, document classification, and query classification performance. F-measure considers both the precision and the recall to compute the score [26]. It can be interpreted as a weighted average of the precision and recall as follows: G-mean is defined by two parameters called sensitivity and specificity . Sensitivity shows the performance of the positive class, and specificity shows the performance of the negative class. G-mean measures the balanced performance of a learning algorithm between these two classes. G-mean is defined as

A receiver operating characteristic (ROC) curve is a graphical plot which depicts the performance of a binary classifier as its discrimination threshold is varied. In an ROC curve, the true positive rate (sensitivity) is plotted in function of the false positive rate (specificity) for different cut-off points. Each point on the ROC curve represents a (sensitivity, specificity) pair corresponding to a particular decision threshold. The ideal point on the ROC curve would be (0, 1); that is, all positive samples are classified correctly, and no negative samples are misclassified as positive. An ROC curve depicts relative trade-offs between benefits (true positives) and costs (false positives) across a range of thresholds of a classification model. However, it is difficult to decide which one is the best method when comparing several classification models. AUC is the area under an ROC curve. It has been proved to be a reliable performance measure for imbalanced and cost-sensitive problems [25]. AUC provides a single measure of a classifier’s performance for evaluating which model is better on average.

5.2. Experimental Results and Analysis

In our experiments, we used 5 data sets to test the performance of the proposed method. These data sets are from the UCI Machine Learning Repository [27]. Information about these data sets is summarized in Table 2. These data sets vary extensively in their sizes and class proportions. We take the minority class as the target class and all the other categories as majority class. When more than two classes exist in the data set, the target class is considered to be positive and all the other classes are considered to be negative. We compared the performance of 5 methods, including AdaBoost, SMOTE [1], SMOTEBoost [9], EasyEnsemble [7], and our proposed CQEnsemble method.

In our experiments, F-measure, G-mean, and AUC are used as metrics. For each data set, we perform a 5-fold cross validation. In each fold four out of five samples are selected to be training set, and the left one out of five samples is testing set. This process repeats 5 times so that all samples are selected in both training set and testing set.

Figure 1 shows the average F-measure values of the compared methods. The results show that CQEnsemble has higher F-measure than other compared methods on haberman, pima, and letter data sets. EasyEnsemble achieves the highest F-measure on transfusion data set among these methods, and AdaBoost achieves the highest F-measure on phoneme data set. The results indicate that CQEnsemble can further improve the F-measure metric of imbalanced learning.

The average G-mean values of the compared methods are summarized in Figure 2. The results show that CQEnsemble has higher G-mean than other compared methods on most of datasets, while EasyEnsemble is slightly higher G-mean than CQEnsemble on transfusion dataset. From Figures 1 and 2, EasyEnsemble has the highest F-measure and G-mean on transfusion dataset among these methods.

Figure 3 shows the AUC metric of each method for haberman, transfusion, pima, phonem,e and letter data sets. The results show that the proposed CQEnsemble method obtains the highest average AUC among these compared methods. These methods are equivalent for letter data set. After all, SMOTE method is the weakest in 5 methods; EasyEnsemble method is slightly better than AdaBoost, SMOTE, and SMOTEBoost, while CQEnsemble method is better than EasyEnsemble method. The results show that the CQEnsemble method effectively avoids the shortages of resampling methods.

CQEnsemble attains higher average F-measure, G-mean, and AUC than almost all the other methods, except that CQEnsemble is slightly worse comparable to EasyEnsemble with F-measure, G-mean, and AUC on transfusion data set. The experimental results imply that the proposed CQEnsemble method is better than AdaBoost, SMOTE, SMOTEBoost, and EasyEnsemble methods on most of data sets. These experiments also indicate that the combination of division-boost method and cost-sensitive learning can further improve the performance of imbalanced learning.

6. Conclusions

In this paper, we propose CQEnsemble method based on cost-sensitive SVM and QBC to solve imbalanced data classification. CQEnsemble method divides the majority class into several subsets according to the proportion of imbalance samples. CQEnsemble method selects the effective training samples to join the last training set based on QBC active learning algorithm, so it avoids the shortages of the over-sampling and under-sampling. Experiment results show that the proposed method has higher F-measure, G-mean, and AUC than many existing class-imbalance learning methods.

Acknowledgments

This work is supported by China Postdoctoral Science Foundation (no. 20110491530) and Science Research Plan of Liaoning Education Bureau (no. L2011186).