Abstract

Imbalanced data classification is gaining importance in data mining and machine learning. The minority class recall rate requires special treatment in fields such as medical diagnosis, information security, industry, and computer vision. This paper proposes a new strategy and algorithm based on a cost-sensitive support vector machine to improve the minority class recall rate to 1 because the misclassification of even a few samples can cause serious losses in some physical problems. In the proposed method, the modification employs a margin compensation to make the margin lopsided, enabling decision boundary drift. When the boundary reaches a certain position, the minority class samples will be more generalized to achieve the requirement of a recall rate of 1. In the experiments, the effects of different parameters on the performance of the algorithm were analyzed, and the optimal parameters for a recall rate of 1 were determined. The experimental results reveal that, for the imbalanced data classification problem, the traditional definite cost classification scheme and the models classified using the area under the receiver operating characteristic curve criterion rarely produce results such as a recall rate of 1. The new strategy can yield a minority recall of 1 for imbalanced data as the loss of the majority class is acceptable; moreover, it improves the -means index. The proposed algorithm provides superior performance in minority recall compared to the conventional methods. The proposed method has important practical significance in credit card fraud, medical diagnosis, and other areas.

1. Introduction

Imbalanced data classification is gaining importance in data mining and machine learning [13]. An imbalance in a particular dataset occurs when the number of instances in one class (the minority class) is significantly smaller than that in the other class (the majority class). The minority class is generally of interest in data classification; hence, it is considered a positive (+) class, whereas the majority class is considered a negative (−) class [4, 5]. For instance, in the medical field, as there is more interest in the disease samples than in the health samples, the disease samples comprise the minority class. Traditional machine learning models mainly consider the accuracy of the two categories as equally important, i.e., they do not consider the accuracy of some categories to be more important than that of other categories, particularly when the sample size of the minority category is small [6, 7]. Although these models have been successfully applied to various balanced data classification problems, their performance decreases substantially when applied to imbalanced datasets [810] owing to the lack of sufficient training data in the positive class, which are required to perform an accurate classification of its instances. To address this issue, the positive class of a dataset has attracted increased attention [11], and traditional classifiers have undergone numerous improvements in order to be able to handle imbalanced data applications.

In certain applications, the tolerance to errors in the positive class is extremely low. For instance, in the case of industrial system fault diagnosis [12], the classifier must deal with an imbalanced dataset, i.e., the number of available healthy class instances outnumbers the faulty class ones. Accordingly, it is necessary to develop a classifier that accounts for this type of imbalanced data distribution and warns of all possible failures, even if there may be many false-warning occurrences. Credit card fraud detection is a well-known classification problem [13]. In order to target a specific customer segment, banks use data mining algorithms to classify customers as buyers and nonbuyers. In this context, if a model correctly detects a potential customer for a campaign, there will be a particular profit related to gaining that customer; if a potential buyer is not identified, the profits that would be gained from him/her might be lost. However, if a potential nonbuyer is identified as a buyer, credit card fraud would occur, causing the bank potential massive losses. Similarly, a 100% recall model reduces some of the profits but does not risk credit card fraud occurrences. Furthermore, failing to diagnose a cancerous lesion is unacceptable and can have devastating effects on a patient, although this situation rarely occurs [14]. In general, the classifier is only used as an aid to manual diagnosis, which means that the classifier can only diagnose patients at risk of cancer and provide an artificial judgment. There are disastrous consequences if the classifier misses any patient who may have cancer. In this study, we developed a cost-sensitive support vector machine (SVM) model to increase the recall of a positive sample from an actual background to 100%. Based on this model, we proposed a medium algorithm strategy to increase the positive-class recall rate to 1. We introduced different penalty factors, namely, C+ and C, for each of the positive and negative SVM slack variables during the training process and adjusted the classification boundary by altering the positive-class margin. This approach ensured that the positive-class samples would be more generalized to achieve the target recall rate of 1 when the boundary reached a certain position. For parameter selection, we employed the grid search approach and selected a model with a recall of 100% and a higher specificity for the negative class. The SVM model was adopted to address the data imbalance and exhibited an acceptable performance [1521].

Our study makes a significant contribution to the field because we were able to confirm that (1) increasing the recall rate to 100% is a feasible classification indicator; (2) the decision boundary could be altered successfully by correcting the positive margin; and (3) the -means increases when the recall rate is increased to 1 in some datasets. The experimental results demonstrated that these advantages improve the positive-class classification performance to a greater extent than those achieved in previous studies.

The remainder of this paper is organized as follows. In Section 2, the cost-sensitive SVM model is briefly reviewed. In Section 3, the proposed cost-sensitive model is introduced to improve the recall rate of the positive classes. Section 4 describes the tests performed using the new method on actual data. Finally, Section 5 discusses and summarizes the advantages and disadvantages of the proposed method and suggests future research directions.

In this section, we will briefly discuss different imbalanced dataset classification problems. The existing classification methods for imbalanced data can be roughly divided into two categories [22, 23]: data-level and algorithm-level methods [2427]. We will first discuss the most effective approaches and then discuss the advantages and limitation of these proposed approaches.

Data-level approaches, which are also known as sampling methods [28], typically involve data preprocessing. These approaches rebalance highly skewed class distributions using various resampling methods, such as oversampling of the positive instances and undersampling of the negative instances, and at times, both methods are combined as well [29]. The simplest way to balance a dataset is by undersampling (randomly or selectively) the majority class, while keeping the original data of the minority class. However, this method results in loss of information of the majority class [30]. Another approach that can be used is oversampling in which the minority class instances are randomly duplicated to rebalance class distribution. Although oversampling does not result in loss of information of the majority class, it can cause overfitting. To solve this issue, Chawla et al. [31] proposed a method called Synthetic Minority Oversampling Technique (SMOTE) to generate new instances by linear interpolation between closely lying minority class samples. SMOTE generates new minority samples by interpolating between k-nearest minority class neighbors and has a better classification effect than random oversampling. However, the samples generated through this method may cause an overlap between the two categories.

In contrast, using algorithm-level approaches [32, 33], researchers have been able to introduce cost-sensitive learning to reduce the degree of imbalance by assigning a higher learning cost to positive-class samples [3436]. Algorithm-level approaches directly modify the learning procedure to improve the sensitivity of the classifier toward minority classes. One such crucial approach to class-imbalanced learning was proposed by Veropoulos et al. [37], who used different penalty constants for different classes to assign higher costs to errors in classifying positive-class instances than those in classifying negative-class instances. However, this method does not consider the distance between the two types of samples and the classification hyperplane.

Some studies successfully applied the aforementioned methods in several different fields such as cancer diagnosis [38], sentiment analysis, and text classification [39]. In this study, we improved Veropoulos’s method, modified the distance between the positive samples and the classification hyperplane, and developed an SVM classification algorithm with special classification purposes.

3. Cost-Sensitive Support Vector Machine

The goal of classification is to map feature vectors x ∈ X to class labels [40]. As in previous studies [41, 42], a training set is given, where is an instance with an n-tuple of attribute values that belong to a certain instance space X and is a label.

A standard classification problem of a linear SVM can be expressed assubject towhere C > 0 is the penalty parameter.

The standard SVM model for the design of classification algorithms minimizes the probability of error, assuming that all misclassifications have the same cost. To control the classification recall, the penalty-regularized model proposed by Veropoulos et al. [37] was closely inspected. The key idea of this model is to introduce uneven loss functions to reweight the penalties of the samples in the imbalanced classes [7] and reduce the bias of the classification boundary toward the negative class. By predetermining the class labels,where and denote the index set for the positive and negative classes, respectively. When and are categorized, different costs are assigned to the two classes. The standard SVM can be expanded tosubject towhere is the cost of a false negative and is the cost of a false positive. are slack variables.

In the regularization model proposed by Veropoulos et al. [37], the weight vector, , is a d-dimensional transposed vector normal to the decision boundary; the bias, b, is a scalar for offsetting the decision boundary; and the slack variables, , measuring the losses are used to urge samples to satisfy the boundary constraints in the optimization. Thus, the cost value for the positive class is typically higher than that for the negative class. The dual problem of the primal model is written using Lagrange multipliers as follows:subject to

4. Increasing Minority Recall Model

The key objective of this study was to identify a misclassification cost value with a special purpose using a specific method, assuming that the costs of all types of misclassifications are not equal and that the true costs of misclassification cannot be determined. The goals are to increase the recall rate of the positive class of all datasets to 100% for physical problems and to improve the accuracy of the negative class as much as possible. When positive recall is increased to 1, the accuracy of the negative class may be affected but does not decrease significantly and is within an acceptable range.

4.1. Strategies to Improve Recall Rates for Minority Samples

Presently, in cost-sensitive learning, the cost-sensitive factor is often determined by a random interval or by using the sample number ratio between categories as the misclassification cost [43]. However, we developed a class of imbalanced datasets whose data structure enables us to search for misclassification costs with a “special purpose,” which is increasing the positive recall rate to 1 because misclassification of positive samples can cause massive losses in physical problems. Modifying the loss function forces the classification algorithms to be biased toward the positive classes, and the classification boundary leans toward the negative class. The key idea is to adjust the margin of the positive class to cause the classification boundary to shift. Because the theoretical threshold of 0 is used as the judgment threshold of the sign function, as long as the 0 point is on the left side of the classification boundary, as many positive classes as possible will be included. Using the grid search method, the theoretical threshold and classification boundary are adjusted, resulting in a recall rate of 1.

When the data distribution is as shown in Figure 1, the use of this method will be limited because there is too much overlap between classes. The interface adjusted by us can classify positive-class samples in the imbalanced dataset not only correctly but also simultaneously. In addition, it can divide the samples of the negative classes in the overlapping area into samples of positive classes.

4.2. Proposed Support Vector Machine Model

The SVM uses minimization hinge loss function , where .

Veropoulos et al. [37] extended the loss function in the biased support vector machine (B-SVM) classification as follows:

Equations 8 and (9) assign different cost values to instances in the positive and negative classes, respectively. The misclassification costs from samples in the negative class are generally exploited to outweigh those in the positive class. As our objective was to increase the recall rate of the positive-class samples to 1, we put a constraint on the positive-class margin and extended the loss function as follows:

In Figure 2, ·controls the slope of the positive class, k controls the intersection of the abscissa axis, and the intersection is 1/k. controls the negative slope, and the intersection is 1. If the loss is 0, the classification confidence of the loss function must be sufficiently high. For the positive class, the loss is 0 when the degree of confidence is greater than 1/k.

By replacing the original loss with the loss functions shown in (10) and (11), the original SVM can be extended tosubject to

Here, and are two types of costs, and the positive margin can be changed by adjusting the value of k. In addition to the adjustable penalty, the motivation of this study is to provide the loss function of the imbalanced classes a different hinge point. A biased decision boundary caused by imbalanced classes can be recovered with the help of a scalable margin. Herein, we develop the cost-sensitive model for solving the SVM class-imbalanced problem that has both an adjustable penalty and a scalable margin.

To solve the newly created problem, the Lagrangian function is introduced. The dual problem of the primal model can be written using Lagrange multipliers as follows:where  ≥ 0 and  ≥ 0.

Therefore, the dual optimization model of (12) is defined assubject to

Our objective is to solve the dual problem (Algorithm 1).

Algorithm:
Positive margin adjustment using SVM.
Given: a sequence of N examples XTrain and XValidation
Output: G #Output combination classifier
Variables:
#Karush–Kuhn–Tucker conditions (KKT) initial alpha
#G-means value
Cp, Cn, k #Positive cost, Negative cost, Positive margin calibration variable
T #the selected running iterations
Function:
S #classifier model
RG-means (G) #Obtain the Recall G-means values from G
Begin
Initialize
= 0
= 0
T= 1
Set the 3D grid search range of Cp, Cn, k:
(a)Select optimization variables and and solve the optimization problem using the sequential minimal optimization (SMO) algorithm to obtain and , and update to .
(b)If the KKT conditions in (16)–(19) are satisfied within the allowable range of precision, , the KKT condition can be used for the next step; otherwise, continue with process (b).
(c)Get .
(d)Finally, is obtained, and and are calculated as follows:
nstruct a classifier model S=
(e)G = sign (S)
(f)-means (G(XValidation)) at a condition recall of 1
(g)If , then
Return Gt
End
4.3. Experiment
4.3.1. Performance Evaluation

In a classification problem, evaluation measures play a key role in assessing the performance of the classification model. The overall prediction accuracy is used to evaluate the classification of a balanced dataset; however, it is not an effective metric for an imbalanced dataset because it does not consider the prediction accuracy of either class. This lack of consideration is mainly due to the fact that the negative sample size is sometimes much larger than the positive sample size. In such cases, with an imbalance of 99 to 1, a classifier that classifies everything as negative will be 99% accurate, but it will be completely useless as a classifier. Therefore, more attention should be paid to the positive class. The current classification indicators are based on the confusion matrix presented in Table 1.

In the confusion matrix, true positive (TP) is the number of positive-class instances that have been correctly classified, true negative (TN) is the number of negative-class instances that have been correctly classified, false positive (FP) is the number of negative instances that have been incorrectly classified as positive, and false negative (FN) is the number of positive instances that have been incorrectly classified as negative.

To select the classification index of the positive class, we directly selected the recall (recall = TP/(TP + FN)) and specificity (specificity = TN/(TN + FP)), ensuring both a recall rate of 1 and high specificity.

To balance the effects of recall and specificity on the classification results, an evaluation index, -means, can be constructed using the geometric mean of equilibrium recall and specificity:

Although the number of positive samples may be very few, the omission and misjudgment of positive samples will be considered sufficient. Even if the classification accuracy rates for all the samples are excellent, the -means value may be very low. The -means indicator is effective for the classification evaluation of imbalanced data; however, it attaches equal importance to recall and specificity. As more attention was paid to recall in the present study, -means was extended to recall -means, which increases the emphasis on recall. Ultimately, the proposed method performed the best with more emphasis on recall.

4.3.2. Experiments

We used ten class-imbalanced datasets with various positive ratios and compared our algorithm with the B-SVM [37], cost-sensitive support vector machine (CS-SVM) [44], and BP neural network algorithms [45], as well as two special classification costs. These experiments proved that it is feasible to adjust the positive-class margin to achieve a positive recall rate of 1.

4.3.3. Open Dataset

As shown in Table 2, we compared the classification differences between the proposed method and other methods under different classification standards. We selected ten real-world imbalanced datasets from the UCI machine learning data repository [46]. Among them, ecoli1, ecoli2, glass6, car1v3, car1v4, glass5, segment1, and glass6 were constructed using a multiclassification problem.

4.3.4. Experimental Design

We compared the proposed method with several SVM-based methods and neural network, including the B-SVM, CS-SVM, and BP approaches, as well as two special classification costs (no cost and prorated cost). For the prorated cost, we set the misclassification cost based on the following equation:

The RBF Gaussian kernel is used for all SVM algorithms, where is the bandwidth parameter that must be predetermined. We attempted to select the best value through a grid search of the range between 0.01 and 20. For different datasets, while searching for parameters such as C and k, different grid widths were selected to accelerate parameter selection. For both positive and negative class costs, the range of [0.01, 100] was selected, and the search range of K was fixed at [0.01, 10]. Logistic was selected as the activation function. The learning rate was selected within the range of [0.000001, 1], the number of hidden layers was selected from [1–3], and the search range of the number of neurons in each hidden layer was [2, 1000]. Five-fold cross-validation was used in each set of parameter experiments to determine the set of parameters with the best generalization performance. We chose the recall, specificity, -means, and area under the receiver operating characteristic curve (AuROC) instead of global evaluation indicators to evaluate the performance of the classification method for imbalanced datasets. No cost and prorated cost were classified using a judgment threshold of 0, whereas the B-SVM, CS-SVM, and BP approaches used AuROC as the classification standard for the recall, specificity, and -means.

5. Results and Discussion

To demonstrate the ability of the proposed method to adjust the classification boundary visually, we select two sets of data (yeast5 and breast cancer d) and projected them into a two-dimensional space for visualization. Firstly, to demonstrate the influence of the value of k on the change in the decision boundary under the two sets of data, we used different k values, as shown in Figure 3. In Figure 3, from left to right, the values of k are as follows: k = 1, k = 3, k = 2.416, k = 1, k = 3, and k = 2. The values were selected after a broad survey, and the recall of the adjusted k value of both datasets is 1. For the breast cancer d dataset, it is obvious that the classification effect of the adjusted k value is better than the k value that is randomly selected by the other two groups; hence, we will not explain this dataset in detail. However, for the yeast5 dataset, the other two kinds of k will also reach a recall of 1. The preliminary observation is that the classification effect when k = 1 is similar to that when k = 2; therefore, we compared their respective -means. When k = 1, -means = 0.936; when k = 3, -means = 0.900; and when k = 2, -means = 0.964. The comparison revealed that the adjusted k value not only increased the recall to 1 but also did not reduce the -means value. Therefore, changing the margin of the positive class can adjust the position of the classification boundary so that the positive classes can be properly classified by the boundary.

In addition, the receiver operating characteristic (ROC), through which the threshold is determined, is a widely used evaluation index for classification problems. For the classification problem, we obtained a set of predicted values and classified the data by traversing the predicted values and used the predicted values as thresholds. Predicted values less than the threshold are classified as negative, and predicted values greater than the threshold are classified as positive. Therefore, for each set of predicted values, we can determine a unique set of TPR and FPR. The threshold determination criterion of the ROC curve is maximum (true positive rate (TPR)-false positive rate (FPR)) when the TPR-FPR is the largest, and the corresponding threshold is the threshold of the ROC curve.

Frequently, the classification threshold under the ROC criterion cannot increase the recall to 1, such as in the experiments on four datasets whose results are presented in Figure 4. The experiments prove that the threshold converges to a certain value as k increases. As shown in Figure 4, the classification ROC threshold of the four datasets decreases gradually with increasing k and exhibits an obvious trend of asymptotic stability. Based on the experimental results, an appropriate threshold can be chosen to ensure that the requirement of 1 can be achieved again by adjusting k. We completed five cross-validation experiments to reduce randomness, and the results indicate that our scheme is universal and widely applicable.

Table 3 summarizes the recall and specificity obtained with different methods. In Tables 46, the best results are highlighted (bold font). To examine the significance of the positive examples in the imbalanced datasets, we studied the recall of the positive examples, as presented in Table 4. In the datasets shown, the average recall rate of the proposed method is the largest, effectively increasing the recall rate to 1 in each case. The -means index is shown in Table 5. It can be observed from Table 5 that although the average -means value of our method is not the highest, this value is not reduced much compared to that in the other three ideal classification cases. Moreover, the -means value of our proposed method is higher than that of the other five cases for some datasets. It can be observed that when the recall rate of the positive class reaches 1, the accuracy of the negative class does not decrease significantly and is within the acceptable range. These results demonstrate that the proposed method outperforms the other methods for the positive class of imbalanced datasets and that there may be a certain degree of improvement in the -means index.

Figure 5 shows the combined -means indicators obtained using several methods. Because we consider the recall rate to be extremely important, -means multiplied by recall is , used as the new indicator, which improves the weight of recall in -means, so that the new indicator weighted emphasis on the recall. The -means indices that attach more importance to recall are presented in Table 6. When the new indicator is adopted, our proposed method has the highest average value.

Figure 6 clearly depicts the recall -means evaluation indicator under the comprehensive influence of the classification effects of the positive and negative classes. These results verify that the proposed method performs well on all the datasets.

For the statistical analysis, we implement Studentʼs t-test to verify whether significant differences exist between the proposed method and other methods in the experiment. The t-value in Studentʼs t-test is calculated as follows:where represents the example mean of the data; is the standard variance of the data; and n is the sample size. In this case, the sample size is set to 10. As a case study, we compare the proposed method with other methods. We calculate the t-value using recall and -means data listed in Tables 4 and 5. The null hypothesis should be , and the alternative hypothesis should be . Let x1 be the sample mean obtained by the proposed method and x2 be the sample mean of the other three methods considered for the comparison. The same is true for the -means test. Three t-tests were conducted for the three models listed in Table 4, and the results were compared. For the recall of BP, CS-SVM, and B-SVM models, the t-value obtained is 2.258, 2.631, and 2.671, respectively. We found that the t-value is 1.813 with the probability threshold of 0.05 using Student’s t-distribution table. The calculated t-values 2.258, 2.631, and 2.671 are all greater than the t-value 1.813; thus, at the 0.05 level of significance, the null hypothesis is rejected in favor of the alternative hypothesis. The recall value obtained by the proposed method is greater than that of other methods. For the t-test of the -means of BP, CS-SVM, and B-SVM models, the t-value obtained is 1.320, −0.689, and −0.483, respectively. The calculated t-values 1.320, −0.689, and −0.483 are all lower than the t-value 1.813; thus, the null hypothesis cannot be rejected. Therefore, at the 0.05 level of significance, we believe that no significant difference exists in the -means indicators obtained by several models.

Based on the above conclusions, at the 0.05 level of significance, the recall of the proposed method is significantly greater than that of other methods, and no significant difference exists in the -means.

6. Conclusions

In this paper, a cost-sensitive SVM algorithm based on an imbalanced margin was proposed for the classification of imbalanced data. This method was based on the theory proposed by Veropoulos et al. [37], and its feasibility was verified using both theoretical and experimental results. The recall rate of small classes was improved by adjusting the SVM positive classification margin. The proposed method was also compared with other traditional methods. The experimental results demonstrate that a small-class recall rate of 1 can be achieved using the proposed method. However, the proposed approach still has some disadvantages. Specifically, the accuracy of positive classes is lost in some datasets, but in many cases, the performance improves compared with that of the traditional methods. When the classification evaluation criteria are changed (i.e., more emphasis is placed on the positive classes), the average evaluation index of the proposed method is the highest. Such classification results are of great significance in the fields of finance, medicine, engineering, and astronomy, to name some. In future work, we will test the experimental setup employed in this study using different machine learning models and attempt to apply this method to practical problems. Additionally, we will extend the proposed method to multiclass classification problems by adopting a one-versus-all approach.

Data Availability

The datasets used to support the findings of this study have been deposited in the UCI machine learning repository (http://archive.ics.uci.edu/ml/datasets).

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors would like to thank Editage (http://www.editage.com) for English language editing. This work was supported by Heilongjiang Province Statistical Science Project (no. 2020B06).