Abstract

The class imbalance problems often reduce the classification performance of the majority of standard classifiers. Many methods have been developed to solve these problems, such as cost-sensitive learning methods, synthetic minority oversampling technique (SMOTE), and random oversampling (ROS). However, the existing methods still have some problems due to the possible performance loss of useful information and overfitting. To solve the problems, we propose an adaptive ensemble method by using the most advanced feature of self-adaption by considering an average Euclidean distance between test data and training data, where the average distance is calculated by -nearest neighbors (KNN) algorithm. Simulation results are provided to confirm that the proposed method has a better performance than existing ensemble methods.

1. Introduction

Imbalanced data refers to a data set that has great differences in the number of the classes. Currently, imbalanced data has been applied to real-word domain and plays a key role in civilian and government applications, such as text classification [1], facial age estimation [2], speech recognition [3], and governmental decision-making support systems [4]. The research of imbalanced data is of great significance in the fields of credit fraud, data mining, and illegal account invasion. Hence, more and more researchers pay great attentions to class imbalanced issues due to the fact that the traditional classification of imbalanced data processing is not suitable for classifying minority classes. Imbalanced problem also caught the attention of related areas such as machine learning and data mining [4].

The present study focuses more on binary class imbalance problem, where data set is sorted into majority classes and minority classes. In the data set, the traditional balanced data means that the numbers of each class are equal, and the imbalanced data means that the numbers of the various classes are significantly different. The details of binary class imbalanced and balanced data are shown in Figures 1(a) and 1(b). The traditional classification algorithms, such as naive Bayes [5], random forest [6], -nearest neighbors (KNN) [7], and RIPPER [8], aim at generating models that can optimize the accuracy over classification, but they neglect the minority class. In order to solve the problem mentioned above, many methods have been proposed about binary class imbalanced data in data level and algorithm level, respectively. In the data level, the major idea is to transform imbalanced into balanced data mainly by using sampling method or to create new examples for imbalanced into balanced data, such as SMOTE and ROS, while the algorithm-level solutions primarily include ensemble learning methods [9] and cost-sensitive analysis. Generally speaking, these methods solved the problem of imbalanced data in the accuracy of minority classes. However, there still exist some drawbacks in these traditional imbalanced data classification methods for handling binary class imbalanced data problems [10]. For example, boosting and bagging based ensemble methods may lose some valuable information in the iteration process owing to the use of sampling methods. As a result, this may cause the data overfitting problem. Moreover, it is hard to get the optimized misclassification cost in the cost-sensitive learning methods, and different misclassification costs lead to different classification results. Hence, the classification results are not stable.

To overcome the above problems, we proposed an adaptive ensemble method that is an improvement of existing ensemble method [10]. Our main idea is to transform imbalanced binary problem into multiple balanced problems, which neither reduce the number of majority classes nor increase that of the minority classes. Then we build multiple base classifiers to deal with these balanced problems and lastly we use an adaptive ensemble rule to assemble the base classification results obtained from base classifiers. Common ensemble rules including Max Rule, Majority Vote Rule, Product Rule, Min Rule, and Sum Rule were proposed in [11] and several novel ensemble rules including MaxDistance Rule, MinDistance Rule, ProDistance Rule, MajDistance Rule, and SumDistance Rule were put forward in [10]. In [10], the test results indicate that their methods have a better performance compared with many conventional imbalanced data processing methods over some standard imbalanced data sets. Meanwhile, the results of their experiments proved that SplitBal + MaxDistance has a better performance than other combinations. Throughout this paper, SplitBal + MaxDistance is referred to as SMD.

We have two improvement points for SMD and we define it as SplitBal + MaxDistanceandAvePr (SMDA), which shares the same process with SMD except the ensemble rule. By using base classification algorithms including naive Bayes, random forest, logistic regression, and SVC [12], empirically, our proposed method is evaluated over 38 highly imbalanced data sets. After that, the numerical results show that our method is superior to SMD.

The rest of this paper is organized as follows. Section 2 introduces the works related to our research. Section 3 shows the proposed method. Section 4 reports our experimental procedure, describes details on the setup of experiments, and analyzes the processed data results. Finally, in Section 5, we summarize the study and draw the conclusion.

Over the past decades, the imbalanced data problem has always been a difficult problem in data mining. There are also other data characteristics such as data shift [13] and class overlapping [14], which can influence the performance of conventional classification algorithms for dealing with imbalanced problems. However, we still focus on the imbalance characteristic between classes.

So far, many measures have been proposed to solve the binary class imbalance problem [1014, 1621]. These measures can be broadly by data level and algorithm level, as shown in Table 1. The existing measures can adapt the imbalanced class in algorithm level, while preprocess of measures can adjust data from being imbalanced to balanced in data level. Our methods could be regarded as in algorithm level; in this section, we will introduce some methods that belong to algorithm level.

The algorithm level includes cost-sensitive learning, ensemble learning, and recognition-based learning. Cost-sensitive learning approaches obtain the lowest classification error by adjusting the class misclassification cost. MetaCost [15] is a kind of this algorithm, which uses cost-sensitive procedure to make the classification algorithm cost-sensitive. Ensemble learning is used to reduce the variance and bias by integrating the results of many classification algorithms on imbalanced data. Representatively, boosting can adaptively identify the samples, which is classified as error, so it can obtain a good performance on imbalance problem. Bagging improves the classification performance by processing the base classifiers. Autoassociation, RIPPER, and recognition based learning provide the discrimination model created on the examples of the target class alone which have been certified to be effective in dealing with high-dimensional and complicated binary imbalanced data.

However, these ensemble methods may have some unavoidable drawbacks such as changing the raw data spatial distribution or lead to overfitting caused by sampling methods. In addition, these ensemble algorithms may lose the connection between the test data and training data. In other words, test data can be classified as the class which is closer than the spatial distribution distance. Our method avoids some weaknesses of these traditional imbalanced problems processing methods mentioned previously by transforming the imbalanced problem into several balanced ones; thus it is not like the existing imbalance problems handling methods. Furthermore, our method takes into account the distance between data factors in the ensemble rule, because, in theory, the closer they are, the more similar they are, and our ensemble method is adaptive, which is different from SMD.

3. Our Proposed Method

Our proposed method includes three parts: data balancing, modeling, and classifying. Figure 2 describes the frame of the proposed method. For data balancing, in the process of our method, we first divide the majority data set into several parts which are equal to the amount of minority class data. Then we combine the part with the minority class into a new balanced data set. So many balanced data sets are received. For modeling, next, each new balanced data set is used to create a base classifier with a given base classification algorithm. As for classifying, lastly, these base classification results are put into an adaptive ensemble classifier to classify test data. In the modeling component, we directly apply a base algorithm to every balanced data set. Subsequently, we will introduce two procedures of data balancing and classifying as follows.

3.1. Data Balancing

Existing measures to balance the imbalanced data usually lead to the loss of information as well as overfitting. Therefore, it can be realized to transform imbalanced data set into multiple balanced data sets without importing noise data or lessening the raw data. It is well known that the majority of class data sets are usually more than the minority in an imbalanced data set. So we divide the majority class data set into multiple sets, and each set is equal to minority class in number. Considering the similarities of a class, we can split the majority class data set into multiple sets (SplitBal). Then each set is added to the minority class data set to build a new balanced data set. Lastly, we could obtain multiple balanced data sets.

3.2. Classifying

After modeling, we can build multiple classifiers with the processed balanced data acquired from data balancing. Then we can get some classification results from these classifiers. Next we combine these classification results together. Like [9], we make some assumptions as follows: assume that there are binary class data sets and two class labels; the class labels are and . Then we could get base classifiers with a given base algorithm. For the classifier (), it will classify the test data as with the probability and as with the probability . Moreover, and represent the ultimate ensemble results for classes and , respectively. Five ensemble rules and their description and details are shown in [10]. But these ensemble rules just adopt the results of the classification while overlooking the connection between the test data and training data. The test data tends to be grouped into the class whose average Euclidean instances are closer to the test data. A new multilabel classifier that uses neighbor distance was mentioned in [22]. Considering the average Euclidean instances between the test data and the training data, five novel ensemble rules were proposed in [9]. In these ensemble methods, indicate the average Euclidean distance between new data and the data with class label in the data. From Figure 3, we can learn the process of obtaining with KNN. The details of MaxDistance are shown in Table 2.

However, in the rules and , a fixed value should be added in the denominator with the purpose of preventing the average Euclidean distance from being equal to 0. Then the results of experiment in [10] show that the value can be defined anywhere from 0 to 1, so they add the distance with 1. But we argue that the value added to average in the denominator could be defined with 0, and the value also should be adaptive from 0 to 1 with different classification algorithms. The reasons are as follows. The average distance could not be 0, because from Figure 3 we can know that it is impossible for the new data to be the same as all the train data. When we use different base classification algorithms, the value added to average in the denominator should be mutative. So we define the added value as , which ranges from 0 to 1. From the MaxDistance Rule, we can find that the effect of is weak even though it has been considered as the best important decision element in most traditional algorithms. And in [22], we find that EMLA (average of ) always has a better performance than other ensemble rules, so we combine the EMLA with and as shown in Table 3. Finally, the classification results and are obtained with the ensemble rules in Tables 2 and 3, respectively. If , the test data is considered as ; otherwise it is considered as .

4. Numerical Simulation

In this paper, we have adopted 38 public imbalanced data sets which came from Keel data set repository [23]. The details of these 38 data sets are shown in Table 4, including imbalance radio, total attributes (ATT), total number of data sets, and the number of minority (positive) class data sets. For more detailed information about the employed data sets, interested authors are referred to http://sci2s.ugr.es/keel/imbalanced.php.

We use the 5-fold cross-validation strategy in the following experiment. Four different base classification algorithms, naive Bayes, random forest, logistic regression, and SVC, were selected as base classifiers. We use AUC [24] as our algorithm metric which has more advantages than -Mean and -Measure [25]. In our experiment, every AUC of every data set will be tested repeatedly and then take an average.

Our study in this paper made up two experiments. The first experiment is to determine the added value in our rule. Then the second experiment is to compare the proposed method SMDA with SMD method when handling the imbalanced binary problems using different base classification algorithms.

Experiment 1. We first use the data set yeast3 (shown in Table 5) to test the AUC value of our method by using different values in our ensemble rules. Then we choose the fixed value which can make the best AUC. In experiment, the values are 0, 0.2, 0.4, 0.6, 0.8, and 1.0. From Figure 4, we can know that should be defined as 1, 0, 1, and 0 when the base classification algorithms are naive Bayes, random forest, logistic regression, and SVC, respectively.

Experiment 2. Performance results are evaluated in comparisons of SDMA and SMD. For every imbalanced data set, the detailed AUC values for both methods with the four different base classification algorithms are shown in Table 6. The end of the row represents the average of AUC values of the two methods with each classification algorithm. From Figure 5, we can observe that there are 33 AUC values of SMDA which are greater than or equal to SMD by using logistic regression. In addition, 25 AUC values of SMDA are greater than SDMA using SVC, while they are greater than or equal to SMD using random forest. It is noticed that 30 AUC values of SMDA are great than or equal to SMD using naive Bayes. In Figure 6, we can see that the average AUC values of our method are greater than SMD overall. Therefore, we can obtain that that SMDA has a better performance than SMD in dealing with the data sets mentioned above.

5. Conclusion

An adaptive ensemble method based on spatial characteristics for dealing with the binary class imbalanced problems has been given in this paper. Different from the existing methods mentioned in this paper, our method firstly uses an adaptive ensemble rule for dealing with imbalanced binary problem. Furthermore, our method neither alters the raw data distribution nor suffers from unexpected mistakes or data loss.

Our method applies random splitting to the majority class instances to transform the imbalanced binary class data into multiple balanced binary class data. After that, we use a base classification algorithm to build multiple base classifiers. Finally, we use the proposed adaptive ensemble rule to assemble the classification results received from base classifiers. The experimental results show that (i) the added variable value to the distance in our methods is adaptive, which changes with the classification algorithm and ranges from 0 to 1, and (ii) our ensemble rule SMDA has a better performance than SMD, so we could obtain that the proposed method currently performs better than the existing methods mentioned in this paper.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Authors’ Contributions

Lei Wang and Lei Zhao conceived and designed the experiments; Lei Zhao and Guan Gui performed the experiments; Baoyu Zheng and Ruochen Huang analyzed the data and also gave comprehensive comments and suggestions; Lei Wang wrote the paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grants nos. 61671253 and 61401069), the Priority Academic Development Program of Jiangsu Higher Education Institutions, China, the Major Projects of the Natural Science Foundation of the Jiangsu Higher Education Institutions (16KJA510004), the Open Research Fund of the State Key Laboratory of Integrated Services Networks, Xidian University (ISN17-04), and the Open Research Fund of National Local Joint Engineering Research Center for Communication and Network Technology, Nanjing University of Posts and Telecommunications (TXKY17005).