Advancements in Mathematical Methods for Pattern Recognition and its ApplicationsView this Special Issue
Research Article | Open Access
Two-Stage Bagging Pruning for Reducing the Ensemble Size and Improving the Classification Performance
Ensemble methods, such as the traditional bagging algorithm, can usually improve the performance of a single classifier. However, they usually require large storage space as well as relatively time-consuming predictions. Many approaches were developed to reduce the ensemble size and improve the classification performance by pruning the traditional bagging algorithms. In this article, we proposed a two-stage strategy to prune the traditional bagging algorithm by combining two simple approaches: accuracy-based pruning (AP) and distance-based pruning (DP). These two methods, as well as their two combinations, “AP+DP” and “DP+AP” as the two-stage pruning strategy, were all examined. Comparing with the single pruning methods, we found that the two-stage pruning methods can furthermore reduce the ensemble size and improve the classification. “AP+DP” method generally performs better than the “DP+AP” method when using four base classifiers: decision tree, Gaussian naive Bayes, K-nearest neighbor, and logistic regression. Moreover, as compared to the traditional bagging, the two-stage method “AP+DP” improved the classification accuracy by 0.88%, 4.06%, 1.26%, and 0.96%, respectively, averaged over 28 datasets under the four base classifiers. It was also observed that “AP+DP” outperformed other three existing algorithms Brag, Nice, and TB assessed on 8 common datasets. In summary, the proposed two-stage pruning methods are simple and promising approaches, which can both reduce the ensemble size and improve the classification accuracy.
Aiming at improving the predictive performance, ensemble methods with bagging  and boosting [2, 3] as representatives are in general constructed with a linear combination of a set of fitting models, instead of a single fit of a base classifier or learner [4, 5]. It is well known that an ensemble is usually much more accurate than a single (weaker) learner [1, 6, 7]. Numerous fitting models are generated to reduce the classification error as small as possible with a large ensemble size . As a result, this potentially requires large space for storing the ensemble models, which are often relatively often time-consuming for practical application . On the other hand, these drawbacks can be resolved by removing a part of base classifiers (learners or models) from the original ensemble without loss of predictive performance, which is called ensemble pruning [9–13]. An obvious benefit of ensemble pruning is to fit a relatively small-scale ensemble, which can not only reduce the storage space and improve the computational efficiency, but also increase the generalization of the pruned ensemble when compared with the original one .
The traditional bagging algorithm (also known as bootstrap aggregating) , as representatively the simplest ensemble method, is composed of two key ingredients, bootstrap and aggregation. Specifically, a number of data subsets for training base learners are independently generated from the original training dataset using the bootstrap sampling  with replacement. Then, the bagging algorithm aggregates the outputs of all base learners using voting strategy for classification tasks . Although different sampling strategies have been proposed, for instance, neighborhood sampling in bagging , they always lead to large space requirement for storing the base learners and time-consuming computational cost for predictions. In the past decade, therefore, several studies have drawn attention to the bagging pruning for reducing the ensemble size as well as retaining or improving the classification performance . For example, Hothorn and Lausen (2003)  proposed a double-bagging method to deal with the problems of variable and model selection bias. This approach combined linear discriminant analysis and classification trees to generate ensemble machines. Furthermore, Zhang et al. (2009)  extended their work by using boosting to prune the double-bagging ensembles. Zhang and Ishikawa (2007)  used a hybrid real-coded genetic algorithm to prune the bagging ensemble. Hernández-Lobato et al. (2011) adopted either semidefinite programming or ordered aggregation strategies to identify an optimal subset of regressors in a regression bagging ensemble. Xie et al. (2012)  introduced an ensemble pruning method, called MAD-Bagging. It utilized the margin distribution based classification loss as the optimization objective. Chung and Kim (2015)  suggested a PL-bagging method that employed positive Lasso to assign weights to base learners in the combination step. Over recent years, Galar et al. (2016)  designed an ordering-based ensemble pruning for imbalanced datasets. Zhang et al. (2017)  introduced a novel ensemble pruning techniques called PST2E to obtain smaller but stronger variable selection ensembles. Jiang et al. (2017)  proposed a novel strategy of pruning forest to enhance ensemble generalization ability and reduce ensemble size. Onan et al. (2017)  proposed a hybrid ensemble pruning approach based on consensus clustering and multiobjective evolutionary algorithm. Guo et al. (2018)  presented a margin and diversity based ordering ensemble pruning. Although these pruning methods for bagging can improve the performance of the traditional bagging, the majority of them are relatively complicated and not intuitive for practical use. Furthermore, there are even no suitable model (learner) selections for unknown samples with specificity.
In this work, we proposed a two-stage bagging pruning approach, which is actually composed of two independent methods: accuracy-based pruning (AP) and distance-based pruning (DP). These two methods can be performed by a combination way in any order that finally comprised the two-stage strategy. The former, i.e., the AP procedure, used similar rule as the nice bagging  and the trimmed bagging  by excluding the worst classifiers and aggregated the rest. Specially, for all models established in the traditional bagging, those base models that had the highest prediction performance measured using accuracy (or the lowest error rates) validated on their out-of-bag samples were selected and retained. For the latter, i.e., the DP procedure, we utilized the specificity of a test sample to select a part of fitting models in the ensemble. This kind of specificity is simply measured as the Euclidean distance between the test sample and the center of the out-of-bag samples corresponding to each model in the traditional bagging. The models closer to the test sample (with smaller distance values) were collected to establish the final ensemble for label prediction. Unlike other existing pruning methods, we adopted these two simple and intuitive rules to implement the two-stage bagging pruning strategy aiming at building a novel ensemble method with reduced ensemble size and higher prediction performance.
The remainder of this paper is organized as follows: Section 2 briefly introduced traditional bagging algorithms and measures to evaluate the classification performance. Section 3 described our proposed algorithms for the bagging pruning methods. In Section 4, experimental results and analysis were reported on twenty-eight real datasets. The conclusion was drawn in Section 5.
In this section, we first introduce the traditional bagging algorithm as well as some basic concepts including accuracy, relative improvement, and cross validation for classification task.
2.1. Traditional Bagging Algorithm
Ensemble learning refers to a combination of several relatively weak classifiers to produce a stronger classifier, which can ensure the diversity of weak classifiers and improve the generalization ability. Bagging is one of the basic algorithms for ensemble learning , which usually can effectively realize the advantage of an ensemble model [28, 29].
The traditional bagging algorithm is composed of two key ingredients, i.e., bootstrap and aggregation. Firstly, a number of subsets are randomly and independently sampled from the original training set using bootstrap sampling strategy  with replacement. Secondly, the bagging algorithm aggregates the outputs of all base models using a voting strategy for classification task . The algorithm for the traditional bagging is briefly described as Algorithm 1. Suppose that the training set for a C-class classification problem is given as , where (,) represents a sample encoded by the -dimensional feature vector with class label , and is the number of samples in the training set. In addition, assume that is the original ensemble size which equals to the number of the sampled subsets as well as the number of base classifiers, is the base classifier, represents the ensemble model built with the bagging algorithm, and returns a bootstrapped subset generated from the original training set .
2.2. Performance Evaluation
To evaluate the prediction performance of the proposed pruning methods, we adopted two measures accuracy and relative improvement to assess the classification results.
When a model trained based on a training set is applied to predict a test set, the following measure, called accuracy defined as follows, is used to assess the total classification performance on the test set:
2.2.2. Relative Improvement
In this work, we proposed four types of pruning algorithms to reduce the ensemble size and improve the classification performance of the original bagging. We also compared our methods with other three variations of bagging algorithms in Section 4.4. To gain a consistent comparison among these variations or pruned bagging methods, we utilized the same measure as in Croux et al. (2007) , called relative improvement. It was defined in terms of the error rate (ER) as follows:where means the error rate of the traditional bagging, which is actually equal to 1- Acc. Sometimes, the performance improvement can also be computed as the relatively accuracy improvement of the pruned bagging with respect to the traditional bagging for comparison and evaluation:
Consequently, in this work, relative improvement is referred to the definition in terms of error rate and accuracy improvement means the relative increase on accuracy.
2.2.3. Cross Validation
To avoid the overfitting problem in the computational simulations, we used cross validation method to verify the performance of the classifiers or the proposed pruning algorithms. Cross validation [30, 31] is a procedure that divides the training dataset into several subsets and that has three categories : hold-out, K-fold cross validation, and leave-one-out cross validation. However, the way using hold-out is not entirely convincing [33, 34] and the procedure using leave-one-out cross validation is time-consuming for large-scale datasets . Thus, in this work, we adopted K-fold cross validation to evaluate the classification performance of the proposed bagging pruning methods. The K-fold cross validation divides the original dataset into K subsets with even number of samples. Then one subset is used for test and all the remaining subsets are combined as a training dataset. Repeat such procedure for each subset and calculate the classification performance in each fold. The average accuracy over K folds is finally computed as the classification performance of the proposed method. In this work, fivefold cross validation was applied to all computational experiments.
3. Two-Stage Pruning Algorithms for Bagging
We presented two-stage pruning methods according to certain rules to reduce the ensemble size of the traditional bagging algorithm. The proposed two-stage pruning methods are composed of two individual pruning procedures with different rules. The first one is accuracy based, denoted by AP stage, and the second is distance based, named as DP stage. The combinations of these two pruning approaches, called two-stage pruning methods, can be in two forms, i.e., “AP+DP” and “DP+AP”. The form “AP+DP” means that the traditional bagging is firstly pruned using the (accuracy-based) AP pruning method and then DP pruning (distance based) is furthermore performed to reduce the subset of base models derived by AP pruning stage, vice versa for “DP+AP”. The flow diagrams of the two-stage pruning methods “AP+DP” and “DP+AP” were depicted in Figures 1(a) and 1(b), respectively. In this section, we described the algorithms for all of these pruning methods, including AP, DP, “AP+DP”, and “DP+AP”.
3.1. Accuracy-Based Pruning Method (AP)
The AP procedure adopted similar reduction rule as nice bagging (Nice)  and trimmed bagging (TB)  in which only good or “nice” bootstrap versions of the base models validated on out-of-bag samples were aggregated. Specially, those base models generated in the traditional bagging, which performed better than the rest ones according to certain decile value , were retained to comprise the final reduced set of base models. The main difference among AP, Nice and TB is that different threshold strategy is used to aggregate the “nice” base models. The AP procedure in detail was described as Algorithm 2. Briefly, we firstly collect the subsets of out-of-bag samples for each base model in the traditional bagging, named as . Then the accuracy for each base model tested on the subset was calculated. The decile value can be viewed as a parameter that takes integer values in . If is less than a threshold , which is calculated as the -th decile value of the set , the base model is then removed from the original bagging ensemble. For example, equals the 30th percentile when . For a given parameter , it is easy to know that base models will be excluded out of the original ensemble and the size of the reduced classifier set is equal to
3.2. Distance-Based Pruning Method (DP)
This DP method is based on the distance of the test sample to the center of the out-of-bag sample set associated with base model . The procedure was in detail presented in Algorithm 3. Briefly, we first computed the center of an out-of-bag sample set as follows:where is the size of the out-of-bag sample set . For any new test sample , the Euclidean distance from the test sample to each center of was calculated as Similarly as the AP procedure, the selection of base models was executed according to a decile parameter If is larger than a threshold , which is calculated as the -th decile in the set , the base model will be excluded out of the original bagging ensemble; otherwise, it will be retained.
3.3. Two-Stage Pruning on the Bagging Algorithm
The above two individual pruning methods, including AP and DP procedures, can be carried out in a combination way, called two-stage pruning. There are two ways for combining AP and DP procedures. One combination firstly applies the AP stage to prune the traditional bagging algorithm, and then the DP stage was performed to further prune the reduced set of base models generated by the AP procedure, which is denoted by “AP+DP”. The other one is similar but the two methods AP and DP were mixed in an opposite way, named as “DP+AP”. The algorithms for “AP+DP” and “DP+AP” are described in Algorithms 4 and 5, respectively. Additionally, the number of base models in the reduced set generated by the first stage whatever it is AP or DP was denoted by , and the corresponding index set of the base models in with respect to the original set was named as .
4. Analysis of Experimental Results
In order to evaluate the proposed bagging pruning methods, including AP, DP, “AP+DP”, and “DP+AP” procedures, we collected 28 real datasets from UCI Machine Learning Repository  to implement the computational experiments by performing and comparing four types of base classifiers. These datasets are listed in Table 1 with brief descriptions about their names, the numbers of instances, classes, and variables (features). The four types of base classifiers include decision tree (DT), Gaussian naive Bayes (GNB), K-nearest neighbors (KNN), and logistic regression (LR), which have been already implemented in the machine learning platform, called scikit-learn . In addition, we adopted fivefold cross validation on any dataset for the proposed pruning methods. For the sake of the simplicity and the consistency, we set the original ensemble size equal to 200 in the traditional bagging algorithm; i.e., 200 subsets were randomly generated from the training dataset using bootstrap sampling strategy.
Note. #Ins., #C, and #V mean the number of instances, the number of classes, and the number of variables for the dataset, respectively.
4.1. Optimization of AP and DP Procedures
The parameter or with the highest accuracy may be varied with the corresponding dataset. On each dataset, we adopted grid search to optimize the parameters and of AP and DP procedures, respectively. The value of in the AP procedure is ranged from 0 to 9 with step size of 1, and the parameter in the DP procedure is taken to be an integer from 1 to 10 with step size of 1. All possible values of the parameter and were examined with paying special attentions to cases in which the accuracy values were achieved by the best.
In AP or DP procedure, given a base classifier on the same dataset, different parameter () or () may result in different ensemble size as well as different accuracy value. For a given base classifier, we examined the value of parameter or when the accuracy value was achieved by the highest. As shown in Figure 2 for a given and Figure 3 for a given , we counted the number of datasets where the accuracy values were achieved by the best in AP and DP procedures, respectively.
It can be observed that the AP procedure behaved the best in the case of the parameter equal to 9. Specially, for a parameter of 9, there are 5, 21, 7, and 14 datasets that the best classification can be achieved when the types of the base classifiers are DT, GNB, KNN, and LR, respectively. For any integer less than 9, the corresponding amounts of datasets on which the accuracy values are achieved by the highest were all smaller than those of cases with equal to 9. When the parameter is set to be 9, it means that 90% of all base models in the traditional bagging algorithm will be trimmed off. The empirical results implied that the accuracy-based pruning (AP) method tended to be able to reduce the ensemble size by a large amount, especially for the GNB and LR classifiers. Therefore, we can conclude that the AP pruning under any type of the four base classifiers is an efficient method to reduce the ensemble size of the traditional bagging.
Similarly, we also counted the number of the datasets with the varying parameter where the accuracy values reached the highest for the DP procedure. In general, the classification performance was achieved by the best at a different parameter for different dataset, and four base classifiers including DT, GNB, KNN, and LR showed distinct distributions about the numbers of datasets on which the DP procedure performed by the best. As can be observed from Figure 3, when the base classifier type is DT and is set to be 3, there are 6 datasets on which the classification performed the best; when the base classifier type is GNB and equals 2, the best classification can be achieved based on 9 datasets; when using KNN and to be 2 or 6, seven datasets were found on which the DP procedure can perform the best; when the base learner LR is adopted and the parameter is set to be 2, we observed that the best classification can be achieved on 6 datasets. The numbers of datasets counted in all cases mentioned above for a given parameter are the largest when compared with those counted for other possible values of the parameter td. In the DP procedure, much more number of base classifiers will be excluded if smaller parameter is taken. The empirical results showed that this DP pruning method tends to reduce the ensemble size by a large amount, although it is not so much significant when compared with the AP procedure.
4.2. Result Analysis for Two-Stage Pruning Methods
As mentioned above, we further combined the AP and DP procedures that generated two strategies for two-stage pruning and examined their classification performance by varying the parameters of AP and of DP based on 28 datasets. The first two-stage pruning method is “AP+DP”. The computational experiments were carried out according to Algorithm 4 by simultaneously varying and . We also counted the numbers of datasets on which the classification performance measured using accuracy value was optimized with the varying parameters and . The distributions for four different base classifiers (i.e., DT, GNB, KNN, and LR) were shown in Figure 4, where the number of datasets corresponding to certain pair of and td was represented as the size of a colored circle, each color meaning a positive integer. It can be easily found that GNB tends to be the most efficient base classifier for reducing the ensemble size when compared with other three base classifiers (DT, KNN, and LR). Specially, given GNB as the base classifier, it is somewhat surprising that there are 11 out of 28 datasets on which the accuracy was achieved by the best with the parameters and . In these cases, 90% of the base models were excluded by the AP procedure and further 90% of the reduced set of base models of AP were trimmed off after the DP stage. The second efficient one on reducing the ensemble size is LR, since majority of datasets were counted at the parameters and . Other two types of classifiers DT and KNN showed relatively weaker ability to reduce the original ensemble size, since they held much more diverse distribution of the numbers of datasets with varying parameter and .
The second two-stage pruning experiment is “DP+AP” performed in terms of Algorithm 5. Similarly as the first two-stage method “AP+DP”, the distributions of the numbers of datasets with varying parameters and for four base classifiers when the classification performance was optimized were plotted as shown in Figure 5. The distributions generated by “DP+AP” pruning method are relatively more diverse for all cases of the four base classifiers when compared with the “AP+DP”. However, it is consistent that GNB exhibited the most apparent tend to the ability to reduce the ensemble size by a large amount. As a result, both “AP+DP” and “DP+AP” are generally effective in reducing the size of the original ensemble, although these two methods showed distinct ability to the extent to the ensemble size reduction.
4.3. Performance Comparison among Single Base Classifier, Bagging, and the Proposed Pruning Methods
We compared the experimental results performed by a single classifier, the corresponding bagging, and the proposed pruning methods including AP, DP, “AP+DP”, and “DP+AP”. The accuracy values based on fivefold cross validation were calculated based on 28 datasets and listed in Tables 2, 3, 4, and 5 for DT, GNB, KNN, and LR, respectively. For the pruning methods including AP, DP, “AP+DP”, and “DP+AP”, we only reported the accuracy values together with the optimized parameters and , which were listed in parentheses. In addition, the highest accuracy value among the single classifier, bagging, and the proposed pruning methods on each dataset was highlighted using bold font.
In general, the traditional bagging performed better than the corresponding single classifier. However, sometimes, there is no improvement even decrease on classification performance when comparing bagging with the underlying single classifier. For example, as shown in Table 3, the accuracy value of bagging using GNB as the base classifier on the Ger dataset is 72.00%, which is lower than that of the single GNB classifier equal to 72.70%. Such cases are marked with underlined italic fonts in Tables 2–5. Several authors, e.g., Dietterich  and Croux et al. , also pointed out that there was no guarantee that bagging will improve the performance of any base classifier. Nevertheless, there is at least one of the proposed pruning methods that performed better than both the single classifier and the bagging for such cases. This implied that our proposed pruning methods were effective on all 28 datasets.
As shown in Tables 2–5, for the base classifiers DT, GNB, KNN, and LR, there are, respectively, 20, 26, 21, and 24 (71.43%,92.86%,75%, and 85.71%) out of 28 datasets on which the classification performance of the single AP stage was increased when compared with the traditional bagging. The accuracy value of AP pruning was increased by 0.39%, 3.05%, 0.61%, and 0.65% on average over all 28 datasets when the base classifier was DT, GNB, KNN, and LR, respectively. Similarly, the DP procedure can improve the classification performance of the traditional bagging on 25 (89.29%), 20 (71.43%), 21 (75%), and 15 (53.57%) out of the 28 datasets for the base classifiers DT, GNB, KNN, and LR, respectively. Moreover, the accuracy improvement when comparing the DP pruning with the traditional bagging using DT, GNB, KNN, and LR as base classifier is on average 0.65%, 0.39%, 0.58%, and 0.19%, respectively. When comparing AP and DP based on the relative improvement, AP is much more powerful than DP although the improvement in case of DT as the base classifier using AP procedure is relatively lower than that using DP procedure.
Moreover, using AP procedure as the first pruning stage, the two-stage pruning method “AP+DP” resulted in increase of classification accuracy on 27 (96.43%), 28 (100%), 28 (100%), and 27 (96.43%) out of 28 datasets for the base classifiers DT, GNB, KNN, and LR, respectively, when compared with the traditional bagging. The average accuracy improvements of the “AP+DP” method with respect to the traditional bagging using DT, GNB, KNN, and LR are 0.88%, 4.06%,1.26%, and 0.96%, which further improved the classification accuracy values of both AP and DP methods. Moreover, if the first stage is DP, the proposed two-stage pruning approach “DP+AP” gained the improvement of classification performance on 26 (92.86%), 26 (92.86%), 26 (92.86%), and 20 (71.43%) out of 28 datasets for the base classifiers DT, GNB, KNN, and LR, respectively, when compared with the traditional bagging. The corresponding average accuracy improvements are 0.90%, 3.52%, 1.08%, and 0.57%, respectively. In addition, there are 9 (32.14%), 19 (67.86%), 15 (53.57%), and 15 (53.57%) out of 28 datasets on which “AP+DP” performed better than “DP+AP”, and the average accuracy improvements of “AP+DP” when compared with “DP+AP” are -0.01%, 0.54%, 0.17%, and 0.39%, respectively, for DT, GNB, KNN, and LR. As a result, the classification performance of “AP+DP” using DT as the base learner is overall very close to that of “DP+AP”, and “AP+DP” performed better than “DP+AP” on majority of 28 datasets when using other three methods GNB, KNN, and LR as the base classifiers.
From above, we can conclude that the proposed pruning methods are able to improve the classification performance when compared with the traditional bagging. Furthermore, the two-stage pruning method is much more powerful than the single pruning approach. Although “DP+AP” using DT as the base classifier performed very slightly better than “AP+DP”, it is obvious that the computation performed by “AP+DP” will be much faster than that of “DP+AP”. Therefore, from the view of the current bagging pruning framework, we recommend using “AP+DP” to prune the traditional bagging for reducing the ensemble size and improving the classification performance for base classifiers DT, GNB, KNN, and LR.
4.4. Comparison with Other Existing Bagging Algorithms
We compared the proposed pruning methods with other three variations of the bagging algorithm including Brag (Bootstrap robust aggregating) , Nice (Nice Bagging) , and TB (Trimmed Bagging) . Brag is actually not a bagging pruning method and calculates the median of the outcomes of all the bootstrapped classifiers instead of computing an average like the traditional bagging. Nice  averaged over the outcomes of the bootstrapped classifiers that performed better than the initial base classifier, while TB  excluded the 25% “worse” classifiers and aggregated the rest. Both Nice and TB as well as the AP method presented in this work are bagging pruning methods using a similar rule by excluding “bad” classifiers validated on the out-of-bag samples. The relative improvement of different bagging variations with respect to the traditional bagging under four base classifiers DT, GNB, KNN, and LR is listed in Tables 6, 7, 8, and 9, respectively. As shown in these tables, we only showed the available results (“NA” means not available as shown in Tables 6 and 9) for Brag, Nice, and TB on 8 datasets where their relative improvement values listed in Tables 6 and 9 were all derived from the work by Croux et al.  However, this work  did not perform the bagging methods that used GNB or KNN as the base classifier.