Abstract

A forest is an ensemble with decision trees as members. This paper proposes a novel strategy to pruning forest to enhance ensemble generalization ability and reduce ensemble size. Unlike conventional ensemble pruning approaches, the proposed method tries to evaluate the importance of branches of trees with respect to the whole ensemble using a novel proposed metric called importance gain. The importance of a branch is designed by considering ensemble accuracy and the diversity of ensemble members, and thus the metric reasonably evaluates how much improvement of the ensemble accuracy can be achieved when a branch is pruned. Our experiments show that the proposed method can significantly reduce ensemble size and improve ensemble accuracy, no matter whether ensembles are constructed by a certain algorithm such as bagging or obtained by an ensemble selection algorithm, no matter whether each decision tree is pruned or unpruned.

1. Introduction

Ensemble learning is a very important research topic in machine learning and data mining. The basic heuristic is to create a set of learners and aggregate the prediction of each learner for classifying examples. Many approaches such as bagging [1], boosting [2], and COPEN [3] have been proposed to create ensembles, and the key to the success of these approaches is that base learners are accurate and diverse [4].

Ensemble methods have been applied to many applications such as image detection [57] and imbalanced learning problem [8]. However, an important drawback existing in ensemble learning approaches is that they try to train unnecessarily large ensembles. Large ensembles need a large memory for storing the bases learners and much response time for prediction. Besides, large ensemble may reduce its generalization ability instead of increasing the performance [9]. Therefore, a lot of researches to tackle this problem have been carried out, and the researches mainly focus on ensemble selection: selecting a subset of ensemble members for prediction, such as ordered-based ensemble selection methods [1012] and greedy heuristic based ensemble selection methods [1321]. The research results indicate that a well-designed ensemble selection method can reduce ensemble size and improve ensemble accuracy.

Besides ensemble selection, we can prune an ensemble through the following two approaches if ensemble members are decision trees: pruning individual members separately and combining the pruned members together for prediction and repeatedly pruning individual members by considering the overall performance of the ensemble. For the first strategy, many decision tree pruning methods such as those used in CART [22] and C4.5 [23] have been studied. Although pruning can simplify model structure, whether pruning can improve model accuracy is still a controversial topic in machine learning [24]. The second strategy coincides with the expectation of improving model generalization ability globally. However, this method has not been extensively studied. This paper focuses on this strategy and names the strategy as forest pruning (FP).

The major job of forest pruning is to define an effective metric evaluating the importance of a certain branch of trees. Traditional metrics can not be applied to forest pruning, since these metrics just consider the influence on a single decision tree when a branch is pruned. Therefore, we need a new metric for pruning forest. Our contributions in this paper are as follows:(i)Introduce a new ensemble pruning strategy to prune decision tree based ensemble;(ii)propose a novel metric to measure the improvement of forest performance when a certain node grows into a subtree;(iii)present a new ensemble pruning algorithm with the proposed metric to prune a decision tree based ensemble. The ensemble can be learned by a certain algorithm or obtained by some ensemble selection method. Each decision tree can be pruned or unpruned.

Experimental results show that the proposed method can significantly reduce the ensemble size and improve its accuracy. This result indicates that the metric proposed in this paper reasonably measures the influence on ensemble accuracy when a certain node grows into a subtree.

The rest of this paper is structured as follows. Section 2 provides a survey of ensemble of decision trees; Section 3 presents the formal description of forest trimming and the motivation of this study by an example. Section 4 introduces a new forest pruning algorithm. Section 5 reports and analyzes experimental results and we conclude the paper with simple remark and future work in Section 6.

2. Forests

A forest is an ensemble whose members are learned by decision tree learning method. Two approaches are often used to train a forest: traditional approaches and the methods specially designed for forests.

Bagging [1] and boosting [2] are the two most often used traditional methods to build forests. Bagging takes bootstrap samples of objects and trains a tree on each sample. The classifier votes are combined by majority voting. In some implementations, classifiers produce estimates of the posterior probabilities for the classes. These probabilities are averaged across the classifiers and the most probable class is assigned, called “average” or “mean” aggregation of the outputs. Bagging with average aggregation is implemented in Weka and used in the experiments in this paper. Since each individual classifier is trained on a bootstrap sample, the data distribution seen during training is similar to the original distribution. Thus, the individual classifiers in a bagging ensemble have relatively high classification accuracy. The factor encouraging diversity between these classifiers is the proportion of different examples in the training set. Boosting is a family of methods and Adaboost is the most prominent member. The idea is to boost the performance of a “weak” classifier (can be decision tree) by using it within an ensemble structure. The classifiers in the ensemble are added one at a time so that each subsequent classifier is trained on data which have been “hard” for the previous ensemble members. A set of weights is maintained across the objects in the data set so that objects that have been difficult to classify acquire more weight, forcing subsequent classifiers to focus on them.

Random forest [25] and rotation forest [26] are two important approaches specially designed for building forests. Random forest is a variant version of bagging. The forest is built again on bootstrap samples. The difference lies in the construction of the decision tree. The feature to split a node is selected as the best feature among a set of randomly chosen features, where is a parameter of the algorithm. This small alteration appeared to be a winning heuristic in that diversity was introduced without much compromising the accuracy of the individual classifiers. Rotation forest randomly splits the feature set into subsets ( is a parameter of the algorithm) and Principal Component Analysis (PCA) [27] is applied to each subset. All principal components are retained in order to preserve the variability information in the data. Thus, axis rotations take place to form the new features and rotation forest building a tree using all training set in the new space defined by a given new feature space.

3. Problem Description and Motivation

3.1. Problem Description

Let be a data set, and let be an ensemble with decision tree, , learning from . Denote by a node in tree and by , the set of the examples reaching from the root of , . Suppose each node contains a vector (), where is the proportion of the examples in associated with label . If is a leaf and , the prediction of on is Similarly, for each example to be classified, ensemble returns a vector indicating that belongs to label with probability , where The prediction of on is =.

Now, our problem is, given a forest with decision trees, how to prune each tree to reduce ’s size and improve its accuracy, where is either constructed by some algorithm or obtained by some ensemble selection method.

3.2. Motivation

First, let us look at an example, which shows the possibility that forest trimming can improve ensemble accuracy.

Example 1. Let be a forest with ten decision trees, where is shown in Figure 1. Suppose that , ; , ; and . Let ten examples reach node , where associate with label 1 and associate with label 2. Assume examples reach leaf node , and reach leaf node .
Obviously, for , we can not prune the children of node , since treating as a leaf would lead to more examples incorrectly classified by .
Assume that ’s predictions on are as follows:where is the probability of associated with label . From ’s predictions shown above, we have that is incorrectly classified by . Update to by pruning ’s children and update to . A simple calculation tells us that, for the ten examples, returns: It is easy to see that correctly classifies all of the ten examples.

This example shows that if a single decision tree is considered, maybe it should not be pruned any more. However, for the forest as a whole, it is still possible to prune some branches of the decision tree, and this pruning will probably improve the ensemble accuracy instead of reducing it.

Although the example above is constructed by us, similar cases can be seen everywhere when we study ensembles further. It is this observation that motivates us to study forest trimming methods. However, more efforts are needed to turn possibility into feasibility. Further discussions about this problem will be presented in the next section.

4. Forest Pruning Based on Branch Importance

4.1. The Proposed Metric and Algorithm Idea

To avoid trapping in detail too early, we assume that has been defined, which is the importance of node when forest classifies example . If , then . Otherwise, the details of the definition of are presented in Section 3.2.

Let be a tree and let be a node. The importance of with respect to forest is defined aswhere is a pruning set and is the set of the example in reaching node from . reflects the impact of node on s accuracy.

Let be the set of leaf nodes of , the branch (subtree) with as the root. The contribution of to is defined aswhich is the sum of the importance of leaves in .

Let be a nonterminal node. The importance gain of to is defined by the importance difference between and node , that is, can be considered as the importance gain of , and its value reflects how much improvement of the ensemble accuracy is achieved when grows into a subtree. If > 0, then this expansion is helpful to improve s accuracy. Otherwise it is unhelpful to improve or even reduce s accuracy.

The idea of the proposed method of pruning ensemble of decision trees is as follows. For each nonterminal node in each tree , calculate its importance gain on the pruning set. If is smaller than a threshold, prune and treat as a leaf. This procedure continues until all decision trees can not be pruned.

Before presenting the specific details of the proposed algorithm, we introduce how to calculate in the next subsection.

4.2. Calculation

Let be a classifier and let be an ensemble. Partalas et al. [28, 29] identified that the prediction of and on an example can be categorized into four cases:   : = ,   : ,   : ,   : . They concluded that considering all four cases is crucial to design ensemble diversity metrics.

Based on the four cases above, Lu et al. [11] introduced a metric, , to evaluate the contribution of the th classifier to when classifies the th instance. Partalas et al. [28, 29] introduced a measure called Uncertainty Weighted Accuracy, , to evaluate s contribution when classifies example .

Similar to the discussion above, we defineIn the following discussions, we assume that and . Let and be the subscripts of the largest element and the second largest element in , respectively. Obviously, is the label of predicted by ensemble . Similarly, let . If is a leaf node, then is the label of predicted by decision tree . Otherwise, is the label of predicted by , where is the decision tree obtained from by pruning . For simplicity, we call the label of predicted by node and say node correctly classifies if = .

We define based on the four cases in formula (8), respectively. If or , then , since correctly classifies . Otherwise, , since incorrectly classifies .

For , is defined aswhere is the number of base classifiers in . Here, and , then , , and thus . Since is the contribution of node to the probability that correctly predicates belonging to class while is the contribution of node to , the probability that incorrectly predicates belongs to class , can be considered as the net importance of node when classifies . is the weight of s net contribution, which reflects the importance of node for classifying correctly. The constant is to avoid being zero or too small.

For , ) is defined asHere, . In this case, both and correctly classify . We treat as the net contribution of node to and , and as the weight of s net contribution.

For , is defined asIt is easy to prove ) ≤ 0. This case is opposed to the first case. In this case, we treat as the net contribution of node to and , and as the weight of s net contribution.

For , is defined aswhere is the label of , ) ≤ 0. In this case, both and incorrectly classify , namely, and . We treat as the net contribution of node to and , and as the weight of s net contribution.

4.3. Algorithm

The specific details of forest pruning (FP) are shown in Algorithm 1, where is a pruning set containing instances, is the probability that ensemble predicts associated with label , is the probability that current tree predicts associated with label , is a variant associated with node to save ’s importance, is a variant associated with node to save the contribution of .

Input: pruning set , forest , where   is a decision tree.
Output: pruned forest .
Method:
for each
  Evaluate , ;
for each   do
  for each node in   do
   ;
  for each   do
   , ;
    Let be the path along which travels;
   for each node
    ;
  PruningTree(root());
  for each
     , ;
    
Procedure PruningTree
if   is not a leaf then
   ;
  ;
  for each child of
   PruningTree(c);
    ;
    ;
   if    then
   Prune subtree and set to be a leaf;

FP first calculates the probability of ’s prediction on each instance (lines ~). Then it iteratively deals with each decision tree (lines ~). Lines ~ calculate the importance of each node , where in line is calculated using one of the equations (9)~(12) based on the four cases in equation (8). Line calls PruningTree to recursively prune . Since forest has been changed after pruning , we adjust s prediction in lines . Lines can be repeated many times until all decision trees can not be pruned. Experimental results show that forest performance is stable after this iteration is executed 2 times.

The recursive procedure PruningTree adopts a bottom-up fashion to prune the decision tree with as the root. After pruning    , saves the sum of the importance of leaf nodes in . Then is equal to the sum of importance of the tree with as root. The essence of using ’s root to call PruningTree is to travel . If current node is a nonleaf, the procedure calculates ’s importance gain , saves into the importance sum of the leaves of (lines ~), and determines pruning or not based on the difference between CG and the threshold value (lines ~).

4.4. Discussion

Suppose pruning set contains instances, forest contains decision trees, and is the depth of the deepest decision tree in . Let be the number of nodes in decision tree , and . The running time of FP is dominated by the loop from lines to . The loop from lines to traverses , which is can be done in ; the loop from lines to searches a path of for each instance in , which is complexity of ; the main operation of PruningTree(root()) is a complete traversal of , whose running time is ; the loop from lines to scans a linear list of length in . Since , we conclude the running time of FP is . Therefore, FP is a very efficient forest pruning algorithm.

Unlike traditional metrics such as those used by CART [22] and C4.5 [23], the proposed measure uses a global evaluation. Indeed, this measure involves the prediction values that result from a majority voting of the whole ensemble. Thus, the proposed measure is based on not only individual prediction properties of ensemble members but also the complementarity of classifiers.

From equations (9), (10), (11), and (12), our proposed measure takes into account both the correctness of predictions of current classifier and the predictions of ensemble and the measure deliberately favors classifiers with a better performance in classifying the samples on which the ensemble does not work well. Besides, the measure considers not only the correctness of classifiers, but also the diversity of ensemble members. Therefore, using the proposed measure to prune an ensemble leads to significantly better accuracy results.

5. Experiments

5.1. Experimental Setup

19 data sets of which the details are shown in Table 1 are randomly selected from UCI repertory [30], where #Size, #Attrs, and #Cls are the size, attribute number, and class number of each data set, respectively. We design four experiments to study the performance of the proposed method (forest pruning, FP):(i)The first experiment studies FP’s performance versus the times of running FP. Here, four data sets, that is, autos, balance-scale, German-credit, and pima, are selected as the representatives, and each data set is randomly divided into three subsets with equal size, where one is used as the training set, one as the pruning set, and the other one as the testing set. We repeat 50 independent trials on each data set. Therefore a total of 300 trials of experiments are conducted.(ii)The second experiment is to evaluate FP’s performance versus FL’s size (number of base classifiers). The experimental setup of data sets is the same as the first experiment.(iii)The third experiment aims to evaluate FP’s performance on pruning ensemble constructed by bagging [1] and random forest [26]. Here, tenfold cross-validation is employed: each data set is divided into tenfold [31, 32]. For each one, the other ninefold is to train model, and the current one is to test the trained model. We repeat 10 times the tenfold cross-validation and thus, 100 models are constructed on each data set. Here, we set the training set as the pruning set. Besides, algorithm rank is used to further test the performance of algorithms [3133]: on a data set, the best performing algorithm gets the rank of 1.0, the second best performing algorithm gets the rank of 2.0, and so on. In case of ties, average ranks are assigned.(iv)The last experiment is to evaluate FP’s performance on pruning the subensemble obtained by ensemble selection method. EPIC [11] is selected as the candidate of ensemble selection methods. The original ensemble is a library with 200 base classifiers, and the size of subsembles is 30. The setup of data sets is the same as the third experiment.

In the experiments, bagging is used to train original ensemble, and the base classifier is J48, which is a Java implementation of C4.5 [23] from Weka [34]. In the third experiment, random forest is also used to build forest. In the last three experiments, we run FP two times.

5.2. Experimental Results

The first experiment is to investigate the relationship of the performance of the proposed method (FP) and the times of running FP. In each trial, we first use bagging to learn 30 unpruned decision trees as a forest and then iteratively run lines ~ of FP many times to trim the forest. More experimental setup refers to Section 5.1. The corresponding results are shown in Figure 2, where the top four subfigures are the variation trend of forest nodes number with the iteration number increasing, and the bottom four are the variation trend of ensemble accuracy. Figure 2 shows that FP significantly reduces forests size (almost 40%~60% of original ensemble) and significantly improves their accuracy. However, the performance of FP is almost stable after two iterations. Therefore, we set iteration number to be 2 in the following experiments.

The second experiment aims at investigating the performance of FP on pruning forests with different scales. The number of decision trees grows gradually from 10 to 200. More experimental setup refers to Section 5.1. The experimental results are shown in Figure 3, where the top four subfigures are the comparison between pruned and unpruned ensembles with the growth of the number of decision trees, and the bottom four are the comparison of ensemble accuracy. As shown in Figure 3, for each data set, the rate of forest nodes pruned by FP keeps stable and forests accuracy improved by FP is also basically unchanged, no matter how many decision trees are constructed.

The third experiment is to evaluate the performance of FP on pruning the ensemble constructed by ensemble learning method. The setup details are shown in Section 5.1. Tables 2, 3, 4, and 5 show the experimental results of compared methods, respectively, where Table 2 reports the mean accuracy and the ranks of algorithms, Table 3 reports the average ranks using nonparameter Friedman test [32] (using STAC Web Platform [33]), Table 4 reports the comparing results using post hoc with Bonferroni-Dunn (using STAC Web Platform [33]) of 0.05 significance level, and Table 5 reports the mean node number and standard deviations. Standard deviations are not provided in Table 2 for clarity. The column of “FP” of Table 2 is the results of pruned forest and, “bagging” and “random forest” are the results of unpruned forests constructed by bagging and random forest, respectively. In Tables 3 and 4, Alg1, Alg2, Alg3, Alg4, Alg5, and Alg6 indicate PF pruning bagging with unpruned C4.5, bagging with unpruned C4.5, PF pruning bagging with pruned C4.5, bagging with pruned C4.5, PF pruning random forest, and random forest. From Table 2, FP significantly improves ensemble accuracy in most of the 19 data sets, no matter whether the individual classifiers are pruned or unpruned, no matter whether the ensemble is constructed by bagging or random forest. Besides, Table 2 shows that the ranks of FP always take place of best three methods in these data sets. Tables 3 and 4 validate the results in Table 2, where Table 3 shows that the average rank of PF is much small than other methods and Table 4 shows that, compared with other methods, PF shows significant better performance. Table 5 shows FP is significantly smaller than bagging and random forest, no matter whether the individual classifier is pruned or not.

The last experiment is to evaluate the performance of FP on pruning subensembles selected by ensemble selection method EPIC. Table 6 shows the results on the 19 data sets, where left and right are the accuracy and size, respectively. As shown in Table 6, FP can further significantly improve the accuracy of subensembles selected by EPIC and reduce the size of the subensembles.

6. Conclusion

An ensemble with decision trees is also called forest. This paper proposes a novel ensemble pruning method called forest pruning (FP). FP prunes trees’ branches based on the proposed metric called branch importance, which indicates the importance of a branch (or a node) with respect to the whole ensemble. In this way, FP achieves reducing ensemble size and improving the ensemble accuracy.

The experimental results on 19 data sets show that FP significantly reduces forest size and improves its accuracy in most of the data sets, no matter whether the forests are the ensembles constructed by some algorithm or the subensembles selected by some ensemble selection method, no matter whether each forest member is a pruned decision tree or an unpruned one.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work is in part supported by the National Natural Science Foundation of China (Grant nos. 61501393 and 61402393), in part by Project of Science and Technology Department of Henan Province (nos. 162102210310, 172102210454, and 152102210129), in part by Academics Propulsion Technology Transfer projects of Xi’an Science and Technology Bureau [CXY1516], and in part by Nanhu Scholars Program for Young Scholars of XYNU.