Abstract

Imbalanced datasets are frequently found in many real applications. Resampling is one of the effective solutions due to generating a relatively balanced class distribution. In this paper, a hybrid sampling SVM approach is proposed combining an oversampling technique and an undersampling technique for addressing the imbalanced data classification problem. The proposed approach first uses an undersampling technique to delete some samples of the majority class with less classification information and then applies an oversampling technique to gradually create some new positive samples. Thus, a balanced training dataset is generated to replace the original imbalanced training dataset. Finally, through experimental results on the real-world datasets, our proposed approach has the ability to identify informative samples and deal with the imbalanced data classification problem.

1. Introduction

In the literature and in real-world problems, the scenario of imbalanced data distribution appears when the size of samples in one class is greatly larger than the size of samples in the other class. Many applications such as fraud detection, intrusion prevention, risk management, and medical research often have the imbalanced class distribution problem. Classifiers constructed based on imbalanced datasets usually perform well on the majority class data but poorly on the minority class data [1]. However, in many cases, the minority class data are the most important ones to detect, for example, in the medical field for disease diagnosis or in the industrial field for fault diagnosis.

Class imbalance has been appointed as one of the most challenging problems in the data mining field [2]. Many traditional classification methods tend to be overwhelmed by the majority class and ignore the minority class. Their classification performances on the minority class are negatively affected. Actually, these traditional classifiers, such as support vector machine (SVM), decision trees, and neural networks, are designed to optimize the overall performance on the whole dataset. In order to cope with the class imbalance problem, researchers have proposed many methods from the view of data-level approaches and algorithmic approaches. The data-level approaches balance the training dataset of the classifier by resampling techniques, while the algorithmic approaches deal with the development of new algorithms expressly designed to cope with uneven datasets. The two approaches are independent of each other and can be combined to enhance each other’s performance [3].

Resampling is one of the effective approaches for balancing the training dataset of a classifier, which includes undersampling and oversampling techniques. In this paper, a new hybrid sampling approach combining oversampling and undersampling is presented to address the class imbalance problem. The proposed approach first uses undersampling to delete some samples of the majority class with less classification information and then applies oversampling to gradually create some new positive samples. Thus, a balanced training dataset is generated to replace the original imbalanced training dataset. Through experimental results on the real-world datasets, our approach has the ability to identify informative samples and deal with the imbalanced data classification problem. In addition, the proposed approach selects SVM as a base classifier. As we have known, SVM is one of the effective approaches for solving pattern recognition problems, which is an approximate implementation of the structural risk minimization principal based on statistical learning theory (SLT) rather than the empirical risk minimization method [4].

The rest of the paper is organized as follows. Section 2 presents a comprehensive study on the class imbalance problem and discusses the existing class imbalance solutions. Section 3 gives a simple description of support vector machine and then proposes a hybrid sampling SVM approach for addressing class imbalance problem. In Section 4, we compare the performance of the proposed approach with the existing methods. Finally, Section 5 concludes this paper.

Since many real applications have met the class imbalance problem, researchers have proposed several methods to solve this problem. In general, there are two kinds of approaches to cope with the class imbalance problem: data-level approaches and algorithmic approaches [2]. In this section, we will review some of the most effective methods that have been proposed within these two categories.

At the data-level approaches, resampling is one of the effective approaches which can obtain a more or less balanced class distribution. The resampling techniques try to balance out the dataset either randomly or deterministically, which include undersampling methods, oversampling methods, and hybrid methods.

Undersampling methods create a subset of the original dataset by randomly or selectively deleting some of the samples of the majority class while keeping the original population of the minority class [5, 6]. Despite the fact that this method results in information loss for the majority class, it must be noted that undersampling is generally quite successful at countering the class imbalance problem, especially when it uses sophisticated data elimination methods. EasyEnsemble and BalanceCascade proposed by Liu et al. [7] are two effective informed undersampling methods. Kim [8] proposes an undersampling method based on a self-organizing map (SOM) neural network to obtain sampling data which retains the original data characteristics. García and Herrera [9] present an evolutionary undersampling method for classification with imbalanced datasets. Yen and Lee [10] propose cluster-based undersampling approaches. Its basic idea is to select the representative data as training data, which improve the classification accuracy for minority class in the imbalanced class distribution environment.

Oversampling methods [11] generate a superset of the original dataset by replicating some of the samples of the positive class or creating new samples from the original positive class instances. A widely used oversampling technique is called SMOTE (synthetic minority oversampling technique) [11], which creates new synthetic samples to the minority class by randomly interpolating pairs of the closest neighbors in the minority class. SMOTE is effective to increase the significance of the positive class in the decision region. There exist many methods based on the SMOTE for generating more appropriate instances [12]. Borderline-SMOTE [13] is another approach based on the synthetic generation of instances proposed in SMOTE. Gao et al. [14] propose probability density function estimation based oversampling approach for two-class imbalanced classification problems. RWO-sampling [15] is a random walk oversampling approach to balance different class samples by creating synthetic samples through randomly walking from the real data. RWO-sampling also expands the minority class boundary after synthetic samples have been generated.

Hybrid methods use the oversampling technique combined with the undersampling technique to balance the class size. AdaOUBoost [16] adaptively oversamples the minority positive samples and undersamples the majority negative samples to form different subclassifiers and combines these subclassifiers according to their accuracy to create a strong classifier. Cateni et al. [17] present a new resampling approach to address the class imbalance problem, which combines a normal distribution-based oversampling technique and a similarity-based undersampling technique. Cao et al. [18] propose a hybrid probabilistic sampling combined with diverse random subspace ensemble for imbalanced data learning. Luengo et al. [19] analyze the usefulness of the data complexity measures and propose an approach based on SMOTE-based oversampling and evolutionary undersampling to deal with the class imbalance problem.

Algorithmic approach is another way to deal with the imbalanced data problem, which tries to modify the classifiers to suit the imbalanced datasets. Cost-sensitive learning is an effective solution based on algorithmic approaches, which can improve the performance of classification by setting different misclassification cost to the majority and minority datasets. In the cost-sensitive framework, the costs of misclassifying minority samples are higher with respect to other kinds of errors in order to encourage their correct classification. Cost-sensitive learning is one of the most important topics in machine learning and data mining and has attracted high attention in recent years [3, 20, 21]. Many algorithms combining resampling and cost-sensitive learning have also been proposed [22].

Many works make some modification of the classification algorithms. Several specific attempts using SVMs have been made to improve their class prediction accuracy in the case of class imbalances [2326]. Fu and Lee [27] present a certainty-based active learning algorithm to deal with the imbalanced data classification problem. In order to improve the classification of imbalanced data, Oh [28] proposes a new error function for the error back-propagation algorithm of multilayer perceptrons.

As we have known, in recent years, an ensemble of classifiers have arisen as a possible solution to the class imbalance problem attracting great interest among researchers because of their flexible characteristics [29, 30]. Ensembles are designed to increase the accuracy of a single classifier by training several different classifiers and combining their decisions to output a single class label. Liu et al. [31] present an ensemble of SVMs to improve the prediction performance, which incorporates both oversampling and undersampling. Guo and Viktor [32] present an approach DataBoost-IM, which generated new data and classified imbalanced data by an ensemble classifier. Oh et al. [33] propose an ensemble learning method combined with active example selection to deal with the class imbalance problem. Woźniak et al. [34] present ensemble classifiers from a new point of view including approaches to imbalanced data classification.

3. The Proposed Hybrid Sampling SVM Approach

In this section, we first give a description of support vector machine, and then we present our proposed approach.

3.1. Review of Support Vector Machine

SVM was first introduced to solve the pattern classification and regression problems by Vapnik and his colleagues [4, 35]. In recent years, SVM has drawn considerable attentions due to its high generalization ability of a wide range of applications and better performance than other traditional learning machines. The goal of the SVM learning algorithm is to find a separating hyperplane that separates these data points into two classes.

Consider a binary classification problem consisting of training samples , where represents a -dimensional data point and denotes its class label. The decision boundary of a linear classifier can be written in the following form: where w and are parameters of the model. To have more flexible ways to deal with nonlinear separable data, we can first transform the training samples into a high-dimensional feature space using a nonlinear mapping . Therefore, (1) can be rewritten as .

The support vector technique requires the solution of the following optimization problem: subject to , for .

This optimization problem can be solved by constructing a Lagrangian representation and transforming it into the following dual problem:

Once all are found using quadratic programming techniques, we can use the KKT conditions to express the optimal primal variable w in terms of the optimal dual variables, according to . Note that w depends only on the for which , which are called the support vectors (SVs).

When the optimal pair is determined, the SVM decision function is then given by

If , then the test sample x is classified as a positive class; otherwise, it is classified as a negative class.

Furthermore, the dot product in the transformed space can be expressed as the kernel function . Thus, the kernel is the key that determines the performance of the SVM. Several typical kernel functions are the lineal kernel , the polynomial kernel , and the RBF kernel .

3.2. The Proposed Hybrid Sampling Approach

This paper proposes a hybrid sampling approach based on support vector machine to address the imbalanced data classification problem. The proposed approach first uses SVM method to generate a classification hyperplane and applies an undersampling technique to reduce negative samples which include less classification information. And then, we divide the training dataset into several subsets, in which we synthesize new positive samples using an oversampling technique. Once the majority class has been undersampled and the minority class has been oversampled, a new balanced training dataset is created and is used to train an SVM classifier. The proposed approach effectively balances the initial imbalanced dataset and improves classification precision on the basis of maximizing data balance.

The framework of our proposed approach is presented as follows. Given the training dataset , where represents the class labels of the negative and positive samples, respectively, is the size of the training dataset. Suppose that an imbalanced dataset contains samples from the majority class and samples from the minority class, where and . The imbalance ratio IR is .

In undersampling phase, the proposed approach first trains an SVM classifier for training dataset and obtains a classification hyperplane and then deletes some negative samples with less information by undersampling. Our approach is based on the distance between sample and the hyperplane as follows:

We proportionately delete some negative samples far away from the hyperplane according the calculated distances. After undersampling, the imbalance ratio is reduced; for instance, the imbalance ratio is the half of original IR. We label the training dataset after undersampling as including negative samples and positive samples.

In oversampling phase, the proposed approach first randomly divides the training dataset into disjoint subsets . Each contains positive samples and negative samples. We randomly select a subset, for instance, , and oversample the positive samples using SMOTE method for subset . Then, a new training dataset is generated by merging the new synthetic samples into . We train an initial SVM classifier for dataset .

For the th one of the rest k−1 subsets, we select negative samples near the hyperplane of classifier according to (5) and generate synthetic instances using SMOTE method. Similarly, a new training dataset is generated by merging all positive samples and the new synthetic samples into . We train an SVM classifier for dataset .

In the oversampling for the positive class, the smaller the size of positive samples within a subset is, the more the instances are oversampled.

Based on the description above, the proposed hybrid sampling SVM approach is described in Algorithm 1.

Input: Imbalanced training data set
Output: An SVM classifier
Step  1. Train an SVM classifier for training dataset and delete some negative samples using (5).
Step  2. Divide randomly T into k disjoint equivalent subsets .
Step  3. Select subset and over-sample the positive samples using SMOTE method; generate a
new training data set by merging the new synthetic samples into ; train an initial SVM
classifier for data set .
Step  4. For each subset in the rest subsets do
 Step  2.1. Compute the distances between negative samples and the hyperplane of classifier
 according to (5).
 Step  2.2. Select negative samples with the smallest distances; generate synthetic instances
 using SMOTE method.
 Step  2.3. Merge all positive samples and the new synthetic samples into , and obtain data
 set .
 Step  2.4. Train an SVM classifier for dataset .
Step  5. Classify data set using SVM method, and obtain a classifier .

4. Experiment and Analysis

In this section, we evaluate the performances for our proposed hybrid sampling approach on real datasets. In the following, we first describe several evaluation measures for class imbalanced problem and then compare -measure and -mean of our method with the other methods.

4.1. Evaluation Measures

In general, the performance of a classifier is evaluated based on its overall accuracy on an independent test dataset. However, the overall classification accuracy on an imbalanced dataset is mainly dominated by the majority class. Therefore, accuracy is not an appropriate evaluation measure for imbalanced data. Researchers use different metrics to evaluate the performance of imbalanced data classification methods. These metrics include the accuracy rate, -measure, geometric mean (-mean), and AUC [36].

The result of classification can be categorized into four cases as follows. TP (true positive) is the number of actual positives that were correctly classified as positives. FP (false positive) is the number of actual negatives that were incorrectly classified as positives. TN (true negative) is the number of actual negatives that were correctly classified as negatives. FN (false negative) is the number of actual positives that were incorrectly classified as negatives.

Accuracy is the most used evaluation metric for assessing the classification performance and guiding the classifier modeling. The overall accuracy is defined as

-mean is the geometric mean of accuracies measured separately on each class, which is commonly utilized when performance of both classes is concerned and expected to be high simultaneously. -mean is defined as where and denote sensitivity and specificity, respectively. Sensitivity, also called the TP rate (TPrate) or the recall (Recall), shows the performance of the positive class as follows: Specificity, also called the TN rate (TNrate), shows the performance of the negative class as follows:

-measure is often used in the fields of information retrieval and machine learning for measuring search, document classification, and query classification performance. -measure considers both the Precision and the Recall to compute the score. Generally, for a classifier, if the Precision is high, then the Recall will be low; that is, the two criteria are trade-off. Precision and Recall are combined to form a criterion -measure, which is shown in expression (10). Consider where is set to 1 in this paper. The Precision for minority class is the correct-classified percentage of samples which are predicted as minority class by the classifier. It is defined as follows:

AUC is the area under the receiver operating characteristic (ROC) curve. ROC consists of plotting the true positive rate as a function of the false positive rate along all possible threshold values for the classifier. An ROC curve depicts relative trade-offs between true positives and false positives across a range of thresholds of a classifier. However, it is difficult to compare several classification models through curves. Therefore, it is common for results to be reported with respect to AUC. AUC can be interpreted as the expected proportion of positive samples ranked before a uniformly drawn random negative sample [36].

In the following, we use the two criteria -measure and -mean discussed above to evaluate the performance of our approaches by comparing our methods with the other methods.

4.2. Experimental Results and Analysis

In this subsection, we will compare our proposed approach to address the class imbalance problem with several techniques. The experiments use 6 datasets which have different degrees of imbalance from KEEL [37], including Cmc2, Glass7, Abalone7, Vowel, Yeast, and Letter4. Information about these datasets is summarized in Table 1. They are very varied in their size of classes, size of attributes, size of samples, and imbalance ratio. When more than two classes exist in the dataset, the target class is considered to be positive and all the other classes are considered to be negative. For each dataset, the size of samples (number of samples), the size of attributes (number of attributes), the size of samples of each class (number of positives and number of negatives), and imbalance ratio are listed. We calculate class imbalance ratio of the size of the majority class to the size of the minority class.

We compared the performance of 4 methods, including undersampling (Under), SMOTE [11], EasyEnsemble [7], and our proposed method. For undersampling, we use a random sampling method. SMOTE is used with five neighbours. EasyEnsemble selects C4.5 decision tree as the baseline classifier. In all our experiments, we perform a 10-fold cross validation.

In our experiments, -measure and -mean are used as metrics. Table 2 shows the average -measure obtained by 4 methods. The results indicate that our proposed approach has higher -measure than that of other compared methods on Cmc2, Abalone7, Vowel, Yeast, and Letter4 datasets. EasyEnsemble outperforms other compared methods on Glass7 dataset. The results indicate that our proposed approach can further improve the -measure metric of imbalanced learning.

Table 3 lists the results of the average -mean of the compared methods. The results show that our proposed approach has higher -mean than other compared methods on most of datasets, while EasyEnsemble is slightly higher -mean than our proposed approach on Glass7 dataset. This is consistent with our analysis in -measure.

5. Conclusions

For the class imbalance problem, resampling technique is an effective approach to resolve it. This paper proposes a hybrid sampling SVM approach, which combines undersampling and oversampling techniques. The proposed approach generates a relatively balanced dataset without significant loss of information and without the addition of a great number of synthetic samples. Thus, SVM classifier employed by our proposed approach can effectively improve the classification accuracy of original imbalanced dataset. Experimental results show that the proposed approach outperforms existing oversampling and undersampling techniques.

Conflict of Interests

The author declares that there is no conflict of interests regarding the publication of this paper.