#### Abstract

Ensemble pruning is a technique to increase ensemble accuracy and reduce its size by choosing a subset of ensemble members to form a subensemble for prediction. Many ensemble pruning algorithms via directed hill climbing searching policy have been recently proposed. The key to the success of these algorithms is to construct an effective measure to supervise the search process. In this paper, we study the importance of individual classifiers with respect to an ensemble using margin theory proposed by Schapire et al. and obtain that ensemble pruning via directed hill climbing strategy should focus more on examples with small absolute margins as well as classifiers that correctly classify more examples. Based on this principle, we propose a novel measure called the margin-based measure to explicitly evaluate the importance of individual classifiers. Our experiments show that using the proposed measure to prune an ensemble leads to significantly better accuracy results compared to other state-of-the-art measures.

#### 1. Introduction

Ensemble of multiple learning machines has been a very popular research topic during the last decade in machine learning and data mining. The basic idea is to construct multiple classifiers from the original data and then aggregate their predictions when classifying examples with unknown classes. Theoretic and empirical results show that an ensemble is potential to increase the classification accuracy beyond the level reached by an individual classifier alone [1]. Dietterich stated “A necessary and sufficient condition for an ensemble of classifiers to be more accurate than any of its individual members is if the classifiers are accurate and diverse” [2].

Many approaches have been proposed to create ensemble members with both high accuracy and high diversity, which can be mainly grouped into three categories: by manipulating data set [3, 4], by manipulating features [5–8], and by manipulating algorithms [9]. Bagging [3] and boosting [4], the most widely used and successful ensemble learning methods, fall into the first category, where bagging learns individual classifiers on data sets obtained by randomly sampling from the original training sets and, through randomly disturbing, the learned classifiers obtain a high accuracy and sufficient diversity. Unlike bagging, boosting is an iterative learning process. For each iteration, boosting adjusts the distribution of training set such that classifiers focus more on examples that are hardly correctly classified. The approaches by manipulating features try to build the individual classifiers on diverse feature spaces obtained by selecting subset or by generating new ones from the original features. For example, random forests [5, 6] learn each tree on a feature subset obtained by randomly sampling from original features and COPEN [8] learns the base classifiers on new feature spaces mapped from original feature space using pairwise constraints projection. The individual classifiers can also be built by manipulating algorithms. Through adjusting model structure or parameter setting, classifiers with diversity are learned, such that the negative correlation method explicitly constrains the parameters of individual neural networks to be different by a regularization term [9].

Ensemble methods have been successfully applied to many fields such as remote sensing [10], time series prediction [11], and imbalanced learning problem [12]. However, an obvious problem existing in ensemble learning methods is that they tend to train a very large number of classifiers which need large storage resources to store them and computational resources to calculate outputs of individual learners. Besides, it is not always true that the larger the ensemble, the better its performance. In fact, Zhou et al. [13] proved that the generalization performance of a subset of an ensemble may be even better than the ensemble consisting of all the given individual learners. These reasons motivate the appearance of ensemble pruning, also called ensemble selection or ensemble thinning, selecting a subset of ensemble members to form subensembles that are subject to less resource consumption and response time with accuracy that is similar to or better than the original ensemble [14–22].

Given an ensemble with members, searching for the best subset of ensemble members by enumerating all subensemble candidates is computational infeasible because of exponential size of the search space , which is NP-complete problem [23]. Several efficient methods that are based on a directed hill climbing search in the space of subsets report good predictive performance results [15, 16, 18, 24–27]. These methods start with an empty (or full) initial ensemble and search the space of different ensembles by iteratively expanding (or contracting) the initial ensemble by a single model. The search is guided by an evaluation measure that is based on either the predictive performance or the diversity of the alternative subsets. The evaluation measure is the main component of a directed hill climbing algorithm and it differentiates the methods that fall into this category.

In this paper, we apply the concepts of example margins proposed by Schapire et al. [28] to analyse the importance of individual classifiers with respect to an ensemble and conclude that ensemble pruning via directed hill climbing strategy should focus more on examples with small absolute margins as well as classifiers that correctly classify more examples. Based on the gained insight, a criterion called margin-based measure is proposed to supervise the search process of ensemble pruning via directed hill climbing strategy. Our experiments show that using the proposed measure to prune an ensemble leads to significantly better accuracy results compared to other state-of-the-art measures.

The paper is structured as follows. Section 2 briefly describes ensemble pruning via directed hill climbing search. Section 3 proposes a measure for evaluating the importance of individual classifiers. Section 4 reports the experimental settings and results, and we conclude this paper in Section 5.

#### 2. Related Work

Directed hill climbing ensemble pruning (DHCEP) attempts to find the globally best subset of classifiers by taking local greedy decisions for changing the current subset [17, 28, 29]. An example of the search space for an ensemble of four models is presented in Figure 1.

The direction of search and the measure used for evaluating the search are two important parameters that differentiate one DHCEP method from the other. The following sections discuss the different options for instantiating these parameters and the particular choices of existing methods.

##### 2.1. Direction of Search

Based on the direction of search we have two main categories of DHCEP methods: (a) forward selection and (b) backward elimination (see Figure 1).

In forward selection algorithm, ensemble pruning starts with the current classifier subset which is initialized to the empty set. Then the algorithm continues by iteratively adding to the classifier that optimizes an evaluation function. This function evaluates the addition of classifier in the current subset based on the pruning set (labeled data). In the past, this approach has been used in [14, 25, 26] and in reduce-error pruning methods [30, 31].

In backward elimination, the current classifier subset is initialized to the complete ensemble and the algorithm continues by iteratively removing from the classifier that optimizes the evaluation function. This function evaluates the removal of classifier from the current subset based on the pruning set. In the past, this approach has been used in the AID thinning and concurrency thinning algorithms [15].

In both cases, the traversal requires the evaluation of subsets, leading to a time complexity of , where the term concerns the complexity of the evaluation function, which is linear with respect to (the size of pruning set) and ranges from constant to quadratic with respect to (the size of ), as we will see in the following sections.

##### 2.2. Evaluation Measure

Evaluation measures are the main component that differentiates DHCEP methods, which can be grouped into two major categories: those are based on performance and those are based on diversity.

The goal of performance-based measures is to find the model that maximizes the performance of the ensemble produced by adding (or removing) a model to (or from) the current ensemble. Their calculation depends on the method used for ensemble combination, which usually is voting. Accuracy was used as an evaluation measure by Margineantu and Dietterich [30] and by Fan et al. [25], while Caruana et al. [26] experimented with several measures, including accuracy, root mean squared error, mean cross-entropy, lift, precision/recall break-even point, precision/recall -score, average precision, and ROC area. Another measure is benefit, which is based on a cost model and has been used in Fan et al. [25]. The calculation of performance-based metrics requires the decision of the ensemble on all examples of the pruning set. Therefore, the complexity of these measures is . However, this complexity can be optimized to , if the predictions of the current ensemble are updated incrementally each time a classifier is added to (or removed from) it.

Ensemble diversity, that is, the difference among the individual learners, is a fundamental issue in ensemble methods. Intuitively, it is easy to understand that, in order to gain from a combination, individual learners must be different, and otherwise there would be no performance improvement if identical individual learners were combined.

Let be a classifier and let be subensemble; Partalas et al. [16, 18, 29] identify that the prediction of and on an instance can be categorized into four cases: : , : , : , and : . They concluded that considering the four cases is crucial to design ensemble diversity measure. Many diverse measures are designed by considering some or all the four cases, for example, complementariness [14] and concurrency [15]. The complementariness of with respect to and a pruning set is calculated aswhere , . The complementariness is exactly the number of examples that are correctly classified by and incorrectly classified by . The concurrency is defined aswhich is similar to the complementariness, with the difference that it considers two more cases and weights them.

Unlike complementariness and concurrency, Partalas et al. [18] introduce a new metric called uncertainty weighted accuracy (UWA) considering all four cases given above. UWA is defined aswhere is the proportion of classifiers in the current ensemble which correctly predict and is the proportion of classifiers that incorrectly predict . In addition to considering all four cases, UWA takes into account the strength of the decision of the current ensemble.

In this paper, we designed a new measure by considering the margin of examples for ensemble pruning via directed hill climbing. More details are discussed in next section.

#### 3. Importance Assessment for Individual Classifiers

As one of the best off-the-shelf algorithms, AdaBoost demonstrates a high generalization performance. To theoretically analyse this phenomenon, a concept called margin of examples was proposed by Schapire et al. [28]. Let be the training set, where each example is associated with a label . Suppose that is an ensemble with classifiers and suppose that each member maps each example to a label ; namely, . Then the margin of is defined aswhere is the weight of the classifier . Without loss of generality, normalizing , , such that , then (4) can be written as From (5), the margin is a value in , is on the border if , the absolute value of the margin is the confidence of ensemble prediction on , and (or ) indicates that the ensemble correctly (or incorrectly) classifies . Based on this concept, they proved that, for any and , the generalization error is upper bound bywhere is the complex of the base classifier and is the size of the training set. To further explain the correctness of the margin theory, Gao and Zhou [32] proposed th margin theory. Specifically, for any , with probability at least over the random choice of training set with size , the generalization error is upper bound bywhereFrom (6) and (7), when other variables are fixed, the larger the margin over the training examples, the better the generalization performance, and thus, individual classifiers that correctly classify examples are more important than incorrect ones since the former is helpful to increase the margin of the examples. In addition, we argue that it is more important to increase the margin of examples at the boundary (margin equal to zero), since adding into (or removing from) the ensemble a classifier would lead to ensembles correctly classifying the examples. Therefore, the proposed measure for ensemble pruning should focus more on correct classifiers and the examples lying near the boundary. Therefore, the importance of individual classifiers can be ordered as shown in Figure 2. Based on the order of importance of individual classifiers, the margin-based measure is proposed in Section 4.

#### 4. Margin-Based Measure

In this section, we propose a heuristic measure for evaluating the importance of individual classifiers based on the gained insight obtained in Section 3: ensemble pruning via directed hill climbing strategy should focus more on examples with small absolute margins as well as classifiers that correctly classify more examples. Several methods use a different approach to calculate diversity during the search.

##### 4.1. Measure for Two-Class Problem

For simplicity of presentation, this section focuses on forward ensemble pruning: given an ensemble subset which is initialized to be empty, we iteratively add into the classifier . Here, the symbols are similar to the ones in Section 3. Assuming that ensembles use simple majority voting to obtain the predictions, then the margin of an example of the ensemble iswhere is the size of the ensemble . From (9), is the weight of each classifier , , and is the margin contribution of on the example . Then the proposed measure, margin-based measure (MM), of classifier with respect to ensemble and the pruning set is defined aswhere is the margin-based measure of with respect to the subensemble and current example , defined aswhere the constant parameter is to avoid the denominator equal to zero. Since , then and therefore . From (9) and (10), is exact the margin contribution of on the example and is the weight of . The rationale of the proposed measure is as follows:(i)If individual classifier correctly classifies the example , increases the margin of , and the corresponding increase value isand thus favor correctly classifying , namely, (refer to (10)). If incorrectly classifies , the prediction of reduces the margin of and the reduction is exactand thus is harmful to correctly classifying , namely, (refer to (10)).(ii)From the discussion of Section 3, reflects the confidence that correctly (or incorrectly) classifies the example . If is very small (equal to 0, e.g.), namely, correctly (or incorrectly) classifying with a low confidence, adding into the classifier may change the prediction of on the example and therefore ’s weight is large. On the other hand, if is very large (equal to 1, e.g.), namely, correctly (or incorrectly) classifying with a high confidence, adding into the classifier cannot change the prediction of on the example and therefore ’s weight is small.

The time complexity of calculating (10) or (16) is , which can be by incrementally updating margins of examples each time a classifier is added to/removed from it, where is the number of pruning sets. Therefore, the time complexity of ensemble pruning via directed hill climbing strategy based on the proposed measure is not more than , where is the size of the original ensemble learned from training sets.

In this way, the proposed measure focuses more on correct classifiers and the examples lying near the boundaries, which coincides with the conclusions in Section 3.

##### 4.2. Measure for Multiclass Problem

For multiclass classification problem, (11) should be extended so that the proposed measure defined by (10) can deal with the problem.

Let each member of map an example to a label ; namely, , and let where is the number of votes on the th label of example of an ensemble combined by majority voting; is the number of majority votes on the example ; is the second largest votes on the example ; is the number of votes on label .From [28], for multiclass, the margin of an example is defined as the difference between the number of correct votes and the maximum number of votes received by any incorrect label; namely,Combining (11) and (15) results inwhere (or ) is the set of examples that are correctly (or incorrectly) classified by current classifier and correctly classified by the ensemble; similarly (or ) is the set of examples that are correctly (or incorrectly) classified by and incorrectly classified by . Formally, In this way, and thus (the proposed measure) focus more on correct classifiers and the examples lying near the boundaries, which coincide with the conclusions in Section 3.

##### 4.3. Discussion

Unlike other measures where each classifier is independently evaluated, the proposed margin-based measure uses a more global evaluation. Indeed, this criterion involves instance margin values that result from a majority voting of the whole ensemble. Thus, the proposed measure is not only based on individual properties of ensemble members (e.g., accuracy of individual learners). It also takes into account some form of complementarity of classifiers.

From (11), our margin-based measure considers both the correctness of predictions of current classifier and the confidence of prediction of ensemble. Therefore, this measure deliberately favors classifiers with a better performance in classifying low margin samples. Thus, it is a boosting-like strategy which aims to increase the performance on low margin instances. So our strategy of selection will lead to a subset of classifiers with a potentially improved capability to classify complex data in general and border data in particular. Consequently, it will induce a selection of a subset of learners that are designed to efficiently handle minor classes.

From (16), our measure considers the diversity of between ensemble members. Therefore, the measure considers not only the correctness of classifiers, but also the diversity of ensemble members. Therefore, using the proposed measure to prune an ensemble leads to significantly better accuracy results.

#### 5. Experiments

This section first introduces the experiment setting and the characteristics of the data sets used in this paper and then reports the comparison of measures for guild ensemble pruning.

##### 5.1. Data Sets and Experimental Setup

We randomly selected 18 data sets from the UCI repository [33]. Each data set was randomly divided into three subsets of equal sizes: one of the subsets as the training set, one as the testing set, and the other as the pruning set. Therefore, we conducted six trials for each data set. We repeated the experiments 50 times and thus conducted a total of 300 trials on each data set. The details of these data sets are summarized in Table 1, where #insts, #Attrs, and #Cls are the size, attribute number, and class number of the corresponding data sets, respectively.

We evaluated the performance of the proposed measure margin-based measure (MM) using forward ensemble selection, where complementariness (COM) [14], concurrency (CON) [15], and uncertainty weighted accuracy (UWA) [18] were used as the compared measures. In each trial, a bagging [3] with 200 base classifiers was trained, where the base classifier was J48, which is a Java implementation of C4.5 [34] from Weka [35]. For simplicity, we denote MM, COM, CON, and UMA as the corresponding pruning algorithms supervised by these measures, respectively.

##### 5.2. Accuracy Performance versus the Size of Subensemble

The goal of this experiment was to evaluate the performance of MM by comparing it with UWA, CON, and COM. The experimental results of the 18 tested data sets can be classified into three cases: MM outperforms UWA, CON, and COM; MM performs comparable to one or more of them and outperforms others; and MM is outperformed by one or more of them. The first case contains 13 data sets, the second case contains two data sets, and the last case contains three. Figures 3, 4, and 5 show the representative results from the three cases.

**(a) Audiology**

**(b) Irish**

**(c) Car**

**(d) Irish**

**(e) Segment**

**(f) Wine**

**(a) Flags**

**(b) Irish**

**(a) Horse colic**

**(b) Labor**

Figure 3 reports the accuracy curves of the four compared measure for six representative data sets that fall into the first case. Results in the figure are reported as average accuracy curves with regard to the number of classifiers, where the horizontal axis is the size of subensembles growing gradually from 5 to 200 with step 1 and the vertical axis is the average accuracy over 300 trials. For the purpose of clarity, the standard deviations are not shown in the figure. The accuracy curves for data sets “audiology,” “autos,” “car,” “glass,” “segment,” and “wine” are reported in Figures 3(a), 3(b), 3(c), 3(d), 3(e), and 3(f), respectively. Figure 3(a) shows that, with the increase of the number of aggregated classifiers, the accuracy curves of subensembles selected by MM, UWA, CON, and COM increase rapidly, reach the maximum accuracy in the intermediate steps of aggregation which are higher than the accuracy of the whole original ensemble, and then drop until the accuracy is the same as the whole ensemble. The remaining five data sets, “autos,” “car,” “glass,” “segment,” and “wine” (shown in Figures 3(b), 3(c), 3(d), 3(e), and 3(f), resp.) have similar accuracy curves to “audiology.”

##### 5.3. Summary of Experimental Results

Table 2 summarizes the accuracy of the 300 trials for each data set, where the value in each parentheses is the rank of compared method and the last row is the average rank. The rank of algorithm is defined as follows: on one data set, the best performing algorithm gets the rank of 1.0, the second best one gets the rank of 2.0, and so on. In the case of ties, average ranks are assigned [36, 37]. The experimental results in Section 5.2 empirically show that MM, UWA, CON, and COM generally reach maximum accuracy when the size of the subensembles is between 20 and 40 (using forward selection for ensemble pruning). Therefore, the subensembles formed by MM with 30 original ensemble members are compared with subensembles formed by UWA, CON, and COM with the same size.

As shown in Table 2, MM outperforms bagging on all the 18 data sets, which indicates that MM efficiently performs ensemble pruning by achieving better predictive accuracies with small subensembles. Table 2 also shows that MM ranks first on 14 out of the 18 data sets and its average rank is 1.33, followed by CON with an average rank of 2.75, COM with an average rank of 2.91, UWA (3.17), and bagging (4.83).

As aforementioned, the backward elimination is another directed hill climbing strategy for ensemble pruning. From experimental results, we observe that performance based on backward elimination strategy is similar to that based on forward selection strategy, and therefore we only present the mean accuracy and ranks of MM, UWA, CON, COM with 30 base classifiers, and bagging (the original ensemble). The corresponding results are illustrated in Table 3. From the table, MM ranks first on 12 data sets and its average rank is 1.42, followed by CON with an average rank of 2.69, COM (2.83), UWA (3.22), and bagging (4.78).

Table 4 shows a summary of the comparisons among the methods, where the pruning methods with “-F” use forward selection to pruning ensemble and similarly, the pruning methods with “-B” use backward elimination to pruning ensemble. The size of each subensemble selected by these ensemble pruning methods is 30. The entry displays the number of times when the method of the column has a better result than the method of the row . The number in the parentheses shows how many of these differences have been statistically significant using pairwise -tests at the 95% significance level. For example, MM-F has been better than CON-F with pruned trees in 16 of the 18 comparisons and worse in 2. The numbers in the parentheses show that, in 14 cases, the difference in favor of MM-F has been statistically significant; hence, the value in row 3, column 1 of the table is 16 .

Table 5 shows the ranking of the comparing methods according to the significant difference between their performances using pairwise -tests at the 95% significance level. Here, we use all pairwise comparisons as summarized in Table 4. For example, the sum of the numbers in the brackets in the column corresponding to MM-F in Table 4 is 94. The sum of the numbers in the brackets in the row corresponding to MM-F is 10. These are used in Table 5 to calculate the nondominance ranking of MM-F (84).

Tables 4 and 5 demonstrate the significant advantage of MM compared with the best benchmark classifier ensemble methods: CON, COM, and bagging. Besides, compared with ensemble pruning methods using backward elimination, the ones with forward selection show better performance.

#### 6. Conclusion

In this paper, we analysed the importance of individual classifiers with respect to the whole ensembles using margin theory and obtained that ensemble pruning via directed hill climbing strategy should focus more on correct classifiers and the examples lying near the boundary. Based on the derived general principles, we proposed criterion called the margin-based measure to explicitly evaluate the importance of individual classifiers. Experimental comparisons on 18 UCI data sets showed that the proposed measure outperforms other state-of-the-art measures and the original ensemble.

The proposed metric in this paper can apply not only to ensemble pruning based on directed hill climbing search but also to other ensemble pruning methods. Therefore, more experiments will be conducted to evaluate the performance of the proposed measure.

#### Competing Interests

The authors declare that there are no competing interests regarding the publication of this paper.

#### Acknowledgments

This work is in part supported by the National Natural Science Foundation of China (Grants no. 61472370, no. 61202207, no. 61501393, no. 61402393, and no. 61572417), Project of Science and Technology Department of Henan Province (no. 162102210310, no. 152102210129, and no. 142400410486), and Science and Technology Research key Project of the Education Department of Henan Province (Grant no. 15A520026).