Abstract

Learning with imbalanced data is one of the emergent challenging tasks in machine learning. Recently, ensemble learning has arisen as an effective solution to class imbalance problems. The combination of bagging and boosting with data preprocessing resampling, namely, the simplest and accurate exploratory undersampling, has become the most popular method for imbalanced data classification. In this paper, we propose a novel selective ensemble construction method based on exploratory undersampling, RotEasy, with the advantage of improving storage requirement and computational efficiency by ensemble pruning technology. Our methodology aims to enhance the diversity between individual classifiers through feature extraction and diversity regularized ensemble pruning. We made a comprehensive comparison between our method and some state-of-the-art imbalanced learning methods. Experimental results on 20 real-world imbalanced data sets show that RotEasy possesses a significant increase in performance, contrasted by a nonparametric statistical test and various evaluation criteria.

1. Introduction

Recently, classification with imbalanced data sets has emerged as one of the most challenging tasks in data mining community. Class imbalance occurs when examples of one class are severely outnumbered by those of other classes. When data are imbalanced, traditional data mining algorithms tend to favor the overrepresented (majority or negative) class, resulting in unacceptably low recognition rates with respect to the underrepresented (minority or positive) class. However, the underrepresented minority class usually represents the positive concept with great interest than the majority class. The classification accuracy of the minority class is more preferred than the majority class. For instance, the recognition goal of medical diagnosis is to provide a higher identification accuracy for rare diseases. Similar to most of the existing imbalanced learning methods in the literature, we also focus on two-class imbalanced classification problems in our current study.

Class imbalance problems have appeared in many real-world applications, such as fraud detection [1], anomaly detection [2], medical diagnosis [3], DNA sequences analysis [4], etc. On account of the prevalence of potential applications, a large amount of techniques have been developed to deal with class imbalance problems. Interested readers can refer to some review papers [57]. These proposals can be divided into three categories, depending on the way they work.(i)External approaches at data level: this type of methods consists of resampling data in order to decrease the effect of imbalanced class distribution. These approaches can be broadly categorized into two groups: undersampling the majority class and oversampling the minority class [8, 9]. They have the advantage of being independent from the classifier used, so they are considered as resampling preprocessing techniques.(ii)Internal approaches at algorithmic level: these approaches try to adapt the decision threshold to impose a bias on the minority class or by adjusting misclassification costs for each class in the learning process [1012]. These approaches are more dependent on the problem and the classifier used.(iii)Combined approaches that are based on data preprocessing and ensemble learning, most commonly used boosting and bagging: they usually include data preprocessing techniques before ensemble learning.

The third group has arisen as popular methods for solving imbalanced data classification, mainly due to their ability to significantly improve the performance of a single classifier. In general, there are three kinds of ensemble patterns that are integrated with data preprocessing techniques: boosting-based ensembles, bagging-based ensembles, and hybrid ensembles. In the first boosting-based category, these methods alter and bias the weight distribution towards the minority class to train the next classifier, including SMOTEBoost [13], RUSBoost [14], and RAMOBoost [15]. In the second bagging-based category, the main difference lies in the way how to take into account each class of instances when they are randomly drawn in each bootstrap sampling. There are several different proposals, such as UnderBagging [16] and SMOTEBagging [17].

The main characteristic of the third category is that they carry out hierarchical ensemble learning, combining both bagging and boosting with resampling preprocessing technique. The simplest method in this group is exploratory undersampling, which was proposed by Liu et al. [18], also known as   EasyEnsemble. It uses bagging as the main ensemble learning framework, and each bag member is actually an AdaBoost ensemble classifier. Hence, it combines the merits of boosting and bagging and strengthens the diversity of ensemble classifiers. The empirical study confirms that   EasyEnsemble is highly effective in dealing with imbalanced data classification tasks.

It is widely recognized that diversity among individual classifiers is pivotal to the success of ensemble learning system. Rodriguez et al. [19] proposed a novel forward extension of bagging, rotation forest, which promotes diversity within the ensemble through feature extraction based on PCA. Moreover, many ensemble pruning techniques have been developed to select more diverse subensemble classifiers. For example, Li et al. [20] proposed a novel diversity regularized ensemble pruning method, namely DREP method, and greatly improved the generalization capability of ensemble classifiers.

Motivated by the above analysis, we will propose a novel ensemble construction technique,   RotEasy, in order to enhance the diversity between component classifiers. The main idea of   RotEasy is to inherit the advantages of   EasyEnsemble and rotation forest by integrating them. We conducted a comprehensive suite of experiments on 20 real-world imbalanced data sets. They provide a complete perspective on the performance of the proposed algorithm. Experimental results indicate that our approach outperforms the compared state-of-the-art imbalanced learning methods significantly.

The remainder of this paper is organized as follows. Section 2 presents some related learning algorithms with the aim to facilitate discussions. In Section 3, we describe in detail the proposed methodology and its rationale. Section 4 introduces the experimental framework, including experimental data sets, the compared methods, and the used performance evaluation criteria. In Section 5, we show and discuss the experimental results. Finally, conclusions and some future work are outlined in Section 6.

In order to facilitate our later discussions, we will give a brief introduction to exploratory undersampling, rotation forest, and DREP ensemble pruning method.

2.1. Exploratory Undersampling

Undersampling is an efficient method for handling class imbalance problems, which uses only a subset of the majority class. Since many majority examples are ignored, the training set becomes more balanced and the training process becomes faster. However, some potentially useful information contained in these ignored majority examples is neglected. Liu et al. [18] proposed exploratory undersampling to further exploit these ignored examples while keeping the fast training speed, also known as   EasyEnsemble.

Given a minority set and a majority set ,    EasyEnsemble independently samples several subsets from , where . For each majority subset combined with the minority set , AdaBoost [22] is used to train the base classifier . All generated base classifiers are fused by weighted voting for the final decision. The pseudocode for   EasyEnsemble is shown in Algorithm 1.

(i)   Input: A minority training set and a majority training set , . T: the number of
subsets undersampling from , : the number of iterations in Adaboost learning.
(ii)  Training Phase:
(iii) For     to     do
  (1) Randomly sample a subset from , .
  (2) Learn an ensemble classifier using and . is an Adaboost ensemble with
   number of weak classifiers , corresponding weights and threshold :
         .
(iv) Endfor
(v)   Output: The final ensemble:
         .
Here, means that is predicted as the positive class. Conversely, it means that
belongs to the negative class.

EasyEnsemble generates balanced subproblems, in which the th subproblem is to learn an Adaboost ensemble . So it looks like an “ensemble of ensembles.” It is well known that boosting mainly reduces bias, while bagging mainly reduces variance. It is evident that   EasyEnsemble has benefited from good qualities of boosting and a bagging-like strategy with balanced class distribution.

Experimental results in [18] show that   EasyEnsemble has higher AUC, F-measure, and G-mean values than many existing imbalanced learning methods. Moreover,   EasyEnsemble has approximately the same training time as that of undersampling, which is significantly faster than other algorithms.

2.2. Rotation Forest

Bagging consists in training different classifiers with multiple bootstrapped replicas of the original training data. The only factor encouraging diversity between individual classifiers is the proportion of different samples in the training data, and so bagging appears to generate ensembles of low diversity. Hence, Rodriguez et al. [19] proposed a novel forward extension of bagging, rotation forest, which promotes diversity within the ensemble through feature extraction based on Principal Component Analysis (PCA).

In each iteration of rotation forest algorithm, it consists in randomly splitting the feature set into subsets, running feature extraction based on PCA separately on each subset, and then reassembling a new extracted feature set while keeping all the components. A decision tree classifier is trained with the transformed data set. Different splits of the feature set will lead to different rotations. Thus diverse classifiers are obtained. On the other hand, the information about the scatter of the data is completely preserved in the new space of extracted features. In this way, accurate and more diverse classifiers are built.

In the study of Rodriguez et al. [19], through the analysis tool of kappa-error diagram, they showed that rotation forest has similar diversity-accuracy pattern as bagging, but is slightly more diverse than bagging. Hence, rotation forest promotes diversity within the ensemble through feature extraction. The pseudocode of rotation forest is listed in Algorithm 2.

(i)   Input: : the objects in the training data set (an matrix).
     : the class labels of the training set (an matrix).
     : number of classifiers in the ensemble.
     : number of feature subsets.
     : the set of class labels.
(ii)  Training Phase:
(iii) For     to     do
   (1) Calculate the rotation matrix :
    (a) Randomly split the feature set into subsets .
    (b) For to do
      Let be the data set for the features in .
      Select a bootstrap sample of 75% number of objects in .
      Apply PCA on and store the component coefficients in a matrix .
    (c) Endfor
    (d) Arrange the into a block diagonal matrix .
    (e) Construct by rearranging columns of to match the order of features in .
  (2) Build the classifier using as the training set.
(iv) Endfor
(v)  Output: For a given , calculate its class label assigned by the ensemble classifier :
        ,
where is an indicator function.

2.3. DREP Ensemble Pruning

With the goal of improving storage requirement and computational efficiency, ensemble pruning deals with the problem of reducing ensemble sizes. Furthermore, theoretical and empirical studies have shown that ensemble pruning can also improve the generalization performance of the complete ensemble.

Guided by theoretical analysis on the effect of diversity on the generalization performance, Li et al. [20] proposed Diversity Regularized Ensemble Pruning (DREP) method, which is a greedy forward ensemble pruning method with explicit diversity regularization. The pseudocode of DREP method is presented in Algorithm 3.

(i) Input: : ensemble to be pruned,:
validation data set, : the tradeoff parameter.
(ii) Output: pruned ensemble .
  (1)  initialize .
  (2)   the classifier in with the lowest error on .
  (3)   and .
  (4)  repeat
  (5)   for each do
  (6)   compute diff
  (7)   endfor
  (8)   sort classifiers in the ascending order of .
  (9)    the first classifiers in the sorted list.
  (10)  the classifier in which most reduces the error of on .
  (11)   and
  (12) until the error of on can not be reduced.

In Algorithm 3, the diversity is measured based on pairwise difference and is defined as follows: Starting with the classifier with the lowest error on validation set , DREP method iteratively selects the best classifier based on both empirical error and diversity. Concretely, at each step it first sorts the candidate classifiers in the ascending order of their differences with current subensemble, and then from the front part of the sorted list it selects the classifier which can most reduce the empirical error on the validate data set. These two criteria are balanced by the parameter , that is, the fraction of classifiers that are considered when minimizing empirical error. Obviously, a large value means that more emphasis is put on the empirical error, while a small pays more attention on the diversity. Thus it can be expected that the obtained ensemble will have both large diversity and small empirical error.

Experimental results show that, with the help of diversity regularization, DREP is able to achieve significantly better generalization performance with smaller ensemble size than other compared ensemble pruning methods.

3. RotEasy: A New Selective Ensemble Algorithm Based on EasyEnsemble and Rotation Forest

Based on the above analysis, we propose a novel selective ensemble construction technique   RotEasy, integrating feature extraction, and ensemble pruning with   EasyEnsemble to further improve the ensemble diversity.

The main steps of   RotEasy can be summarized as follows: firstly, a subset of size from the majority class is undersampled. Secondly, we construct an inner-layer ensemble through integrating rotation forest and AdaBoost. Lastly, DREP method is used to prune the learned ensemble with the aim to enhance the ensemble diversity. The pseudocode of   RotEasy method is listed in Algorithm 4.

(i)  Input: : the majority set, : the minority set, : the number of subsets undersampling from ,
    : the number of inner-layer ensemble, : validation dataset, : tradeoff parameter,
    : the set of class labels.
(ii)  For     to     do
  (a) Randomly undersampling a subset from , .
  (b) Learning the inner-layer ensemble :
    (1) Set , the weak classifier , initial weight
     distribution on the training set as .
    (2) for     to     do
    (3) Calculate the rotation matrix using , based on Algorithm 2.
    (4) Get the sampling subset , using weight distribution .
    (5) Learn by providing the transformed subset , as the input of classifier .
    (6) Calculate the training error over : .
    (7) Set the weight : .
    (8) Update over : ,
       where is the normalization constant: .
    (9) Endfor
(iii) Endfor
(iv)  Pruning: Apply the DREP method on the validation subset to prune the ensemble .
Denote the pruned ensemble members as , their corresponding normalized weights
and rotation matrices .
(v)  Classification Phase: For a given , calculate its class label as follows:
        

It should be pointed out that some parameters in   RotEasy need to be specified in advance. With respect to the values of and , we set them in the same manner as that of   EasyEnsemble. As for the validation set , we randomly split the training set into two parts with approximately the same size, one part is used to train ensemble members, and the other one is used to prune ensemble classifiers. The best value for the parameter can be found by a line-search strategy over . In fact, the performance of   RotEasy is very robust to the variation of values, and this will be confirmed in the later experimental analysis.

4. Experimental Framework

In this section, we present the experimental framework to examine the performance of our proposed   RotEasy method and compare it with some state-of-the-art imbalanced learning methods.

4.1. Experimental Data Sets

To evaluate the effectiveness of the proposed method, extensive experiments were carried out on 20 public imbalanced data sets from the UCI repository. In order to ensure a thorough performance assessment, the chosen data sets vary in sample size, class distribution, and imbalance ratio.

Table 1 summarizes the properties of data sets: the number of examples, the number of attributes, sample size of minority and majority class, and the imbalance ratio, that is, sample size of the majority class divided by that of the minority class. These data sets are sorted by imbalance ratio in the ascending order. For several multiclass data sets, they were modified into two-class cases by keeping one class as the positive class and joining the remainder into the negative class.

4.2. Benchmark Methods

Regarding ensemble-based imbalanced learning algorithms, we compare our   RotEasy approach with some competitive relevant algorithms, including RUSBoost [14], SMOTEBoost [13], UnderBagging [16], SMOTEBagging [17], AdaCost [10], RAMOBoost [15], rotation forest [19], and   EasyEnsemble [18].

In our experiments, we use classification and regression tree (CART) as the base classifier in all compared methods, because it is sensitive to the changes of training samples, and can still be very accurate. We set the total amount of base classifiers in the ensemble to be . These benchmark methods and their parameters are described as follows.(1)CART. It is implemented by the “classregtree” function with default parameter values in MATLAB software.(2)RUSBoost (ab. RUSB). A majority subset is sampled (without replacement) from , . Then, AdaBoost is used to train an ensemble classifier using and .(3)SMOTEBoost (ab. SMOB). It firstly uses SMOTE to get new minority class examples. Both classes contribute to the training data with instances. Then AdaBoost is used to train the ensemble classifiers using the new minority class samples and majority samples. In SMOTE algorithm, the number of nearest neighbors is set to be .(4)UnderBagging (ab. UNBag). It removes instances from the majority class by random undersampling (without replacement) in each bagging member. Both classes contribute to each iteration with instances.(5)SMOTEBagging (ab. SMBag). Both classes contribute to each bag with instances. In each bag, a SMOTE resampling rate (a%) is set (ranging from 10% in the first iteration to 100% in the last). This ratio defines the number of positive instances () randomly resampled (with replacement) from the original positive class. The rest of positive instances are generated by the SMOTE algorithm. The number of nearest neighbors used in SMOTE is set to be .(6)AdaCost (ab. AdaC). The cost factor of positive and negative instances is set to be , respectively, according to the study in Yin et al. [23].(7)RAMOBoost (ab. RAMO). According to the suggestion of [15], the number of nearest neighbors in adjusting the sampling probability of the minority is set to be , the number of nearest neighbors used to generate the synthetic data instances is set to be , and the scaling coefficient is set to be 0.3.(8)Rotation forest (ab. RotF). The feature set is randomly split into subsets and PCA is applied to each bootstrapped subset. The number of features in each subset is set to be .(9)EasyEnsemble (ab. Easy). It is firstly randomly undersampling (without replacement) the majority class in each outer-layer iteration. Then, AdaBoost is used to train inner-layer ensemble classifier. The number of sample subsets is set to be , and the number of AdaBoost iterations is set to be .(10)Unpruned   RotEasy (ab. RotE-un). The number of undersampled subsets is ; inner-layer ensemble is constructed through integrating rotation forest and AdaBoost.(11)Our proposed method (ab.   RotEasy). The number of undersampled subsets is , and the number of inner ensemble iterations is . Then DREP method is applied on the validation subset to prune the above ensemble.

RotE-un and RotEasy, we randomly split the training data set into two parts: as training set, as validation set. The parameter is selected in with an interval of 0.05.

4.3. Evaluation Measures

The evaluation criterion plays a crucial role in both the guidance of classifier modeling and the assessment of classification performance. Traditionally, total accuracy is the most commonly used empirical metric. However, accuracy is no longer a proper measure in the class imbalance problem, since the positive class makes little contribution to the overall accuracy.

For the two-class problem we consider here, the confusion matrix records the results of correctly and incorrectly classified examples of each class. It is shown in Table 2.

Specially, we obtain the following performance evaluation metrics from the confusion matrix:True positive rate: the percentage of positive instances correctly classified, , also known as Recall;True negative rate: the percentage of negative instances correctly classified, False positive rate: the percentage of negative instances misclassified, False negative Rate: the percentage of positive instances misclassified, ;F-measure: the harmonic mean of and , , F-measure;G-mean: the geometric mean of and , -;AUC: the area under the receiver operating characteristic (ROC). AUC provides a single measure of the classification performance for evaluating which model is better on average.

5. Experimental Results and Analysis

This section shows the experimental results and their associated statistical analysis for the comparison with standard imbalanced learning algorithms. All the reported results are obtained by ten trials of stratified 10-fold cross-validation. That is, the total data is split into 10 folds, with each fold containing 10% of data patterns for prediction. For each fold, each algorithm is trained with the examples of the remaining folds, and the prediction accuracy rate tested on the current fold is considered to be the performance result. For each data set, we compute the mean value of 100 prediction accuracy as the final prediction result.

Firstly, we investigated the sensitivity of proposed   RotEasy algorithm with respect to the variation of hyperparameter .

5.1. Sensitivity of the Hyperparameter

In the DREP ensemble pruning method, there is a trade-off parameter between ensemble diversity and empirical error. We should first examine the influence of the parameter on the algorithm performance. To do so, we considered various values of in with increment 0.05 in this study.

Figure 1 shows the curves of performance results as a function of parameter on several training data sets, based on AUC, G-mean, and F-measure evaluation metrics, respectively.

As seen in Figure 1, the performance of   RotEasy varies by a small margin along with the change of parameter . Thus, the proposed   RotEasy algorithm is insensitive to the variation of parameter . Hence, it is proper that we fix the value of parameter to be 0.5 in the subsequent experiments.

5.2. Performance Comparison

In this section, we will compare our proposal   RotEasy against the previously presented state-of-the-art methods. Before going through further analysis, we first show the AUC, G-mean,and F-measure values of all the methods on each data set in Tables 3, 4, and 5 respectively. We also draw the box plots of test results for all methods on the “Scrapie" data set in Figure 2. In this figure, the numbers shown on the horizontal axis indicate the corresponding algorithms introduced in Section 4.2. We can clearly see the relative performance of all the methods from these box plots.

It is obvious from Tables 35 that   RotEasy always obtains the highest average values of AUC, G-mean, and F-measure. It outperforms all other methods by a large margin. Furthermore,   EasyEnsemble and the new unpruned   RotEasy (RotE-un) achieve better performance than other benchmark methods. However,   RotEasy still outperforms them with a certain degree and becomes the best algorithm.

Moreover, we also investigate the computational efficiency of newly proposed   RotEasy algorithm, through computing the running time of all algorithms and pruned ensemble size of   RotEasy algorithm on all data sets. These results are listed in Table 6. From the last column of Table 6, we can see that the size of pruned ensemble drops from 100 to around 30. Then, it will greatly improve computational efficiency of   RotEasy algorithm in prediction stage, particularly when we encounter the large-scale classification problems.

Hence, the average running time of   RotEasy is the shortest in all methods, comparable to that of   EasyEnsemble and UnderBagging. The RAMOBoost algorithm has the longest running time. Other compared algorithms can be ranked in the order from fast to slow as RUSB, AdaC, CART, RotF, SMOB, SMBag.

5.3. Statistical Tests

In order to show whether the newly proposed method offers a significant improvement over other methods for some given problems, we have to give the comparison a statistical support. A popular way to compare the overall performances is to count the number of problems on which an algorithm is the winner. Some authors use these counts in inferential statistics with a form of two-tailed binomial test, also known as the sign test.

Here, we employed the sign test utilized by Webb [24] to compare the relative performance of all considered algorithms. In the following description, row indicates the mean performance of the algorithm with which a row is labeled, while col indicates that of the algorithm with which a column is labeled. The first row represents the mean performance across all data sets. Rows labeled as represent the geometric mean of the performance ratio col/row. Rows labeled as represent the win-tie-loss statistic, where the three values refer to the numbers of data sets for which , , and , respectively. Rows labeled as represent the test values of a two-tailed sign test based on the win-tie-loss record. If the value of is smaller than the given significance level, the difference between the two considered algorithms is significant and otherwise it is not significant.

Tables 7, 8, and 9 show all the pairwise comparisons of considered algorithms based on AUC, G-mean, and F-measure metrics, respectively. The results show that   RotEasy obtains the best performance among the compared algorithms.   RotEasy not only achieves the highest mean performance, but also always gains the largest win records in the light of the last columns in Tables 79.

In terms of three used evaluation measures, the top three best algorithms are ranked in the same order of   RotEasy, unpruned   RotEasy,  and EasyEnsemble. Other compared algorithms are approximately ranked in the order from better to worse as SMOB, RUSB, UNBag, RAMO, RotF, AdaC, SMBag, and CART. This result is consistent with the findings of previous study [6, 7, 18].

6. Conclusions and Future Work

In this paper, we presented a new method   RotEasy for constructing ensembles based on combining the principles of   EasyEnsemble, rotation forest, and diversity regularized ensemble pruning methodology.   EasyEnsemble uses bagging as the main ensemble learning framework, and each bagging member is composed of an AdaBoost ensemble classifier. It combines the merits of boosting and bagging ensemble strategy and becomes the most advanced approach handling class imbalance problems. The main innovation of RotEasy is to use the more diverse AdaBoost-based rotation forest as inner-layer ensemble instead of AdaBoost in the   EasyEnsemble, and then further enhance the diversity through using DREP ensemble pruning method.

To verify the superiority of our proposed   RotEasy approach, we established empirical comparisons with some state-of-the-art imbalanced learning algorithms, including RUSBoost, SMOTEBoost, UnderBagging, SMOTEBagging, AdaCost, RAMOBoost, rotation forest, and   EasyEnsemble. The experimental results on 20 real-world imbalanced data sets show that   RotEasy outperforms other compared imbalanced learning methods in term of AUC, G-mean, and F-measure, due to the ability of strengthening diversity. The improvement over other standard methods was also confirmed through the nonparametric sign test.

Based on the present work, there are also some interesting research work that deserved to be further investigated: (1) to integrate latest evolutionary undersampling with our proposed ensemble framework [21, 25], instead of common random undersampling; (2) to generalize this technique into multiclass imbalanced learning problems, while only binary class imbalanced classification were considered in current experiment [2628]; (3) to extend our study into semisupervised learning of imbalanced classification problems [29, 30].

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported by the National Basic Research Program of China (973 Program) under Grant no. 2013CB329404, the Major Research Project of the National Natural Science Foundation of China under Grant no. 91230101, the National Natural Science Foundations of China under Grant no. 61075006 and no. 11201367, the Key Project of the National Natural Science Foundation of China under Grant no. 11131006.