Margin-Based Pareto Ensemble Pruning: An Ensemble Pruning Algorithm That Learns to Search Optimized Ensembles
The ensemble pruning system is an effective machine learning framework that combines several learners as experts to classify a test set. Generally, ensemble pruning systems aim to define a region of competence based on the validation set to select the most competent ensembles from the ensemble pool with respect to the test set. However, the size of the ensemble pool is usually fixed, and the performance of an ensemble pool heavily depends on the definition of the region of competence. In this paper, a dynamic pruning framework called margin-based Pareto ensemble pruning is proposed for ensemble pruning systems. The framework explores the optimized ensemble pool size during the overproduction stage and finetunes the experts during the pruning stage. The Pareto optimization algorithm is used to explore the size of the overproduction ensemble pool that can result in better performance. Considering the information entropy of the learners in the indecision region, the marginal criterion for each learner in the ensemble pool is calculated using margin criterion pruning, which prunes the experts with respect to the test set. The effectiveness of the proposed method for classification tasks is assessed using datasets. The results show that margin-based Pareto ensemble pruning can achieve smaller ensemble sizes and better classification performance in most datasets when compared with state-of-the-art models.
Recent publications have widely applied multiple classifier systems (MCSs)  in fields such as digital recognition , facial recognition , acoustic recognition , credit scoring , imbalance classification , recommender system , software bug detection , and environmental data analysis . Unlike deep learning frameworks , it has been shown that MCSs  can be learned well on both small and large-scale sets. The advantage of an MCS is that more decision guidelines are provided by the ensemble pool than by a single learner. However, MCSs cannot determine which learners are most suitable with respect to the incoming dataset because not all decision guidelines are useful for classifying targets.
As a modified case of MCS, an ensemble pruning system (EPS) [11–20] is a popular machine learning model that can select base learners from an ensemble pool to construct the expert. Previous studies  have demonstrated that EPS can achieve superior performance to MCS because the competence level of each base learner is calculated using the validation set, and learners with low competitiveness that unlikely improve the performance of ensemble pool are pruned when the testing set is used in EPS. In general, the EPS model can be separated into 3 categories: static pruning, dynamic classifier pruning, and dynamic ensemble pruning. In static pruning , experts are directly selected from the ensemble pool using the training set. In dynamic classifier pruning [14, 15], only the most competent learner can be chosen from the ensemble pool once a test sample emerges. In dynamic ensemble pruning [16–20], the subsets of the ensemble pools are selected as experts to deal with the test samples. For our viewpoint, the dynamic classifier pruning model can be seen as a special case of the dynamic ensemble pruning model. The dynamic ensemble pruning model has four stages: overproduction, region of confidence definition, selection, and integration. During overproduction, every base learner independently learns when composing the ensemble pool using the training set. During region of competence definition, the criterion is defined to explore the correlation between the ensemble pool and the targets using the validation set. During selection, according to the region of competence, different learners with low competitiveness need to be pruned from the testing set. During integration, the outputs of the learners are aggregated. To the best of our knowledge, different EPS models use different definitions of the region of competence criteria to select classifiers. Antosik and Kurzynski  estimated the source competence of every learner using the minimum difference minimization criterion between the outputs and the targets with respect to the validation set. Lobato et al.  used the Bayesian computation framework to approximate the confidence level of the remaining learners by querying the selected expert. The size of the expert is increased until the confidence level of the remaining learners is above the predefined level. Li and Zhou  generalized the EPS problem as a quadratic programming problem with a sparse solution. The results in this publication show that dynamic ensemble pruning can achieve good generalization performance little expertise. KNORA-Eliminate  selected the expert that can correctly classify all samples in the region of competence. The region of competence for KNORA-Eliminate will be reduced until the expert achieves 100% accuracy in the region of competence. META-DES  provided an alternative framework that views dynamic ensemble pruning as a meta-learning procedure. The meta-features are constructed according to the correlation among the probability outputs, inputs, and targets.
Regarding dynamic ensemble pruning, the size of the ensemble pool has been fixed in previous publications [16–20], and these dynamic ensemble pruning methods failed to search the optimal ensemble pool during the overproduction stage. According to Davis et al. , the selection of the expert is known to be a nondeterministic polynomial hard (NP-hard) problem. Additionally, most pruning criteria estimate the competency of the learners but neglect the condition that queries the sample located in the indecision boundary, and learners have different suggestions for samples in the region of competence. It is insufficient to correctly select the expert in the ensemble pool. To address these problems, we first regard the calculation procedure for the size of the pool during the overproduction stage as an optimization problem. Two targets, i.e., ensemble pool size and learning performance for dynamic ensemble pruning models, are globally searched by a bi-objective programming formulation. The ensemble pool size calculated by the programming method does not utilize the local information about the region of competence for the validation set. Thus, the obtained ensemble pool size may be suboptimal. More precisely, the local exploration method between base learners in the ensemble pool with the suboptimal size needs to be considered. Accordingly, margin-based Pareto ensemble pruning (MBPEP) is proposed, consisting of Pareto optimization and a margin criterion pruning mechanism for global and local search of the optimal ensemble pool size. In this paper, the following two contributions are made: the Pareto optimization algorithm  is applied to calculate the optimal size of the ensemble pool during the overproduction stage for dynamic ensemble pruning in MBPEP, and the margin criterion pruning (MCP) mechanism is contained within the MBPEP so that the learners in the indecision region can detect the different classes in the region. Subsequently, the margin criterion for each class in the indecision region is further calculated to prune the size of the expert. Classifiers in the indecision region that contain a lot of information will be pruned according to the MCP. Unlike some dynamic ensemble pruning methods [19, 20] based on various pruning criteria, the local exploration technique in MBPEP is preselected for these pruning criteria; it can be directly inserted into them and provide more robust learning performance than the original dynamic ensemble pruning methods.
The remainder of this paper is organized as follows. Section 2.1 describes the proposed MBPEP framework. Section 2.2 discusses the Pareto optimization that estimates the optimal size of the ensemble pool. Section 2.2 introduces the MCP method to prune the ensemble pool. In Section 3, a series of experiments are executed to evaluate the performance of the MBPEP framework. The study’s conclusions are given in Section 4.
In this section, the methodology of the MBPEP framework is discussed, and the basic architecture of MBPEP is described in Section 2.1. Key ensemble exploration details for ensemble pool, Pareto optimization, and MCP are described in Sections 2.2 and 2.3.
2.1. Margin-Based Pareto Ensemble Pruning (MBPEP) Framework for Dynamic Ensemble Pruning
Like the general dynamic ensemble models [16–20], the MBPEP-based dynamic ensemble pruning framework can be separated into four stages. During the overproduction stage, is the initial ensemble pool size, and each learner in ensemble pool can be independently learned based on the training set . To keep the ensemble pool diverse and informative during this stage, these base learners will be initialized by training different training sets. For example, bagging, boost, and clustering approaches  that can generate different distributions of training sets are used to train every learner.
Bagging is used in this paper, and classification and regression trees (CARTs)  are used as base learners. Subsequently, the optimal overproduction pool size is estimated using the Pareto optimization algorithm. The global exploration for the initial base learners is calculated, and the learners that result in suboptimal performance are selected during this stage. During the region of competence definition stage, the neighbors in validation set are learned using various criteria. Based on region of competence definition, similar samples for unknown query instance and the competence level of each base learner can be estimated. The common techniques for defining the region of competence include minimum difference minimization , k-nearest neighbors (KNN) , K-means , and the competence map method . During the pruning stage, some learners are extracted to construct the expert with respect to the test set . In contract, traditional dynamic ensemble pruning models use insufficient pruning techniques such as KNORA-Eliminate  and META  to prune learners with low competitiveness in the ensemble pool. In MBPEP, the expert is estimated through two steps for the test set. Meanwhile, the region of competence for will be checked using MBPEP with respect to whether the learners in the ensembles have the same suggestions when they recognize the samples in the region of competence for . When every learner in the ensembles has the same suggestion, that region is defined as a safe region; in this condition, the MCP method does not need to be activated, and the ensemble pool is passed down to various pruning techniques. However, when the learners in ensembles make different suggestions for samples in the region of competence for , the region is defined as the indecision region. MCP is applicable to the scope of the indecision region, and the size of the expert can be further pruned. The MCP operator, which does not damage the general pruning criteria [19, 20], can be used as a preselector to improve exploration for the local information between the base learners in the ensemble pool. During the integration stage, the results that are calculated by the expert are aggregated. The strategies used for aggregation can be separated into three types: nontrainable methods , in which the weights of the outputs do not need to be learned; trainable methods , in which the outputs from the experts in the pruning stage are used as input features to be trained by another learner; and dynamic weighting methods , in which the weights are determined by the estimated competence levels of every expert. The famous integration methods include majority voting and oracle. In this paper, majority voting is used as the aggregation strategy. The architecture of the MBPEP-based dynamic ensemble pruning framework is shown in Figure 1. Notably, the performance of dynamic ensemble pruning is sensitive to the definition of the region of competence. Since the region of competence is not the key contribution of MBPEP, in this paper, different criteria are used to define the region of competence.
2.2. Pareto Optimization Evolutionary Algorithm
In recent years, several studies [11–20] have demonstrated that ensemble models always achieve better results than any base learner. However, methods of building ensembles that can achieve great performance with respect to the test set are still required. In this paper, this problem is solved by MBPEP during the overproduction stage. In Section 2.1, the architecture of MBPEP that is introduced during the overproduction stage in Figure 1 is discussed, and the overproduced learners are trained to compose the ensemble pool. In general, it is a conflicting phenomenon that few base learners produce few decision boundaries without achieving powerful performance. According to previous studies [16–20], the ensemble pool size is fixed during the overproduction stage, and calculation of the size of the ensemble pool that is suitable for handling classification tasks is challenging. Our objective is to estimate an optimal ensemble pool size for achieving better performance than that of a fixed ensemble pool size. The MBPEP computation framework uses Pareto optimization to address the learning performance and the ensemble pool size.
In this study, the initial ensemble pool size is T (), the inputs are , and the targets are , where is the number of instances and is the number of classes. The corresponding optimized pool size after calculation is defined as . The vector representation form of the optimized ensemble pool is composed of a binary element vector that is defined as with dimensions. denotes that the learner in is selected; otherwise, . The Pareto optimization algorithm converts the calculation of the optimal ensemble pool size into a subset selection problem. Suppose that the outputs of the different ensemble pools can be denoted as (, and ), where , , and denote the indexes of instances, ensembles, and classes, respectively. Following these symbols, the classification error can be denoted as follows:
The classification performance and ensemble pool size are estimated using the Pareto optimization bi-objective programming formulation, as follows:where denotes . The operator denotes that the binary vector is summed up to calculate the ensemble pool size. Unlike other single-objective algorithms [27, 28], the concept of “domination,” which continuously disturbs the ensemble size and measures the difference between and during the global searching iteration process, is introduced in Pareto optimization.
2.3. Margin Criterion Pruning (MCP)
As mentioned in Section 2.1, the Pareto optimization method is used during the overproduction stage to calculate the optimal ensemble pool size in the dynamic ensemble pruning model. If query sample is located in the indecision boundaries of the learners, then the predictions of the learners for query sample might differ from the samples that belong to its corresponding region of competence. Like the other characteristics of heuristic algorithms , the global exploration between the base learners used by Pareto optimization with respect to the two objectives ( and ) cannot deal with this condition. Pareto optimization provides the suboptimal  ensemble pool size during the overproduction stage because the local information between the base learners and the targets is neglected in the solution space. Different from other publications, previous studies [16–20] propose using the pruning criteria to estimate the competence of each learner for query sample without considering when this query sample is located in the indecision boundaries. In this paper, the local information is measured by MCP mechanism between these base learners in the ensemble pool to compute the information quantity that query sample is located in the indecision boundaries of the learners. The MCP is added during the pruning stage as a preselector that is executed in parallel with such classical pruning criteria [16–20]. As a subset of the region of competence, the indecision region of the ensemble pool is defined and explored to estimate the local competence of each sublearner in the ensembles. A schematic diagram of the region of competence for four ensemble pool size conditions is shown in Figure 2.
The safe region and the indecision region are defined and shown in Figures 2(a) and 2(b) and Figures 2(c) and 2(d), respectively. As shown in Figure 2, two regions of competence that query samples (yellow triangle in Figure 2) located in the safe and indecision regions are determined according to whether the ensembles can correctly recognize the neighbors of the query sample. If these ensembles make the same suggestions for these neighbors, the region of competence becomes the safe region. When ensembles have different suggestions for these neighbors, the region of competence becomes the indecision region. According to Figures 2(b)–2(d), it can be seen that ensembles can reach 100% (), 71.4% (), and 57.1% (), respectively. In addition, MCP focuses on the indecision region, and the local competence of each learner that can distinguish the samples with different classes in the indecision region is estimated.
Suppose that there are learners (in Section 2.2) selected to compose the ensemble pool after being globally explored using the Pareto optimization algorithm during the overproduction stage. Suppose that query sample with neighbors in the region of competence has various suggestions made by the ensembles. denotes the number votes of the most popular label voted for by the ensembles . In addition, represents the number votes of the second most popular label voted for by . The marginal criterion is used to measure the amount of information for the ensembles in the indecision region using MCP. We define the difference between the labels with the most and second most votes as the marginal information for the neighbors of query sample x as follows:
Entropy is used to compute the amount of information quantity in the machine learning field . To measure the margins of the final hypotheses, the margin entropy criterion is defined to calculate the amount of information of samples X for in :where denotes the number of neighbors around query samples X. If a greater difference is calculated by equation (3) between the most and second most popular labels, then it can be interfered that these ensembles cannot detect samples that belong to different classes in the region of competence. In addition, in MBPEP, is always less than , and hence, the maximum value of is less than . According to equation (4), it can be seen that, if the calculated margin entropy of validation sample X is more than , then the learner can be used to compose the expert. Otherwise, when , the learner needs to be pruned. This process is called MCP. MCP performs greedy research to define a subset of the ensemble pool that can be used as a preselector to determine a suitable learner that can recognize the different targets in the indecision region. The MCP algorithm is shown in Algorithm 1.
3. Results and Discussion
Sixteen UCI datasets  are used as benchmarks in this paper. The characteristics of these datasets are shown in Table 1, and each of them is split into three parts: a training set, a validation set, and a test set. Classification and regression trees (CARTs) are used as base learners. First, the bagging method is applied to split the training datasets. The training sets are divided into subsets, and each subset is selected using bootstrap extraction. The subclassifiers are trained using the training set. The number of ensembles is preset as 100. Second, the MBPEP method is applied to prune the validation set. The optimized subclassifiers are aggregated into the final hypotheses using the majority voting method . Finally, the test set is used to measure the performance of the methods.
3.2. Pruning Metrics
The bagging ensemble is a common architecture that uses learners to train different bootstrapped samples. In this paper, the full bagging method in which all base learners are selected to construct the ensemble classifiers is used as the baseline algorithm. The classification accuracy and ensemble size of MBPEP are compared with the optimization-based pruning method EA  and four competence ordering-based pruning methods, i.e., reduced error pruning , kappa pruning , complementarity measure pruning , and margin distance minimization pruning . These pruning methods are described as follows:(i)Optimization-based pruning  regards aggregation as a programming problem that intends to select the optimized subset of base learners by minimizing the validation error. In this paper, EA is used as an optimization ensemble method.(ii)Reduced error (RE) pruning  sorts the subclassifiers and adds these subclassifiers one by one to find the lowest classification error of the final ensemble. Margineantu and Dietterich use the back fitting search method to approximate the generalization performance of RE pruning.(iii)Kappa pruning  uses the Kappa error as a statistical method that measures the diversity between a pair of sublearners. Kuncheva uses statistics to measure the correlation between the outputs of two sublearners. The classifiers are iteratively added to the ensemble with the lowest statistics.(iv)Complementarity measure (CM) pruning  is an ordered pruning ensemble that focuses on finding the most complementary learners to the ensemble for each iteration and incorporates them into the ensemble classifier. The complementarity pruning method enhances the performance of the classes with the most votes without harming the classes with the least votes.(v)Margin distance minimization (MDM) pruning  was discussed in Introduction.(vi)Randomized reference classifier (RRC) pruning  estimates the competence of each learner in the ensemble pool for the validation set using the corresponding probability outputs and several random variables with beta probability distribution.(vii)KNORA-Eliminate  was discussed in Introduction.(viii)META-DES  was discussed in Introduction.
3.3. Characteristics of MBPEP
To investigate the characteristics of MBPEP, the advantages of both Pareto optimization and MCP are measured. In this section, RRC, KNORA-Eliminate, full bagging, and META-DES are used for comparison purposes. Notably, the full bagging method can be used as the baseline without pruning. The two contributions (Pareto optimization and MCP) for MBPEP are applied to the overproduction and pruning stages, respectively, and hence, two experiments in which MCP and MBPEP are embedded into dynamic ensemble pruning models which are executed to explore the characteristics of MBPEP.
To measure the advantages of MCP, from Figure 3, we can see that the performances of the dynamic ensemble pruning models (for the META and KNORA models) based on MCP are superior to those without MCP. In addition, all EPS models achieve better performance than full bagging. As an alternative technique for measuring performance, the value is an effective metric that has been widely  used to measure the precision and sensitivity of results, and it is calculated as follows:where , , and denote true positives, false positives, and false negatives, respectively. For example, for class in Figure 3, the values for META-DES, KNORA-Eliminate, and RRC with MCP in Figure 3 are , , and . In addition, for class in Figure 3, the values are , , and , respectively. To explore the influence of MBPEP in dynamic ensemble models, optimal overproduction and learning performance are used as metrics in Figure 4. Different from Figure 3, META-MCP, KNORA-MCP, and RRC-MCP in Figure 4 are processed by Pareto optimization during the overproduction stage. META, KNORA, and RRC show that ensemble pruning model is learned by Pareto optimization but not MCP. From the top panel of Figure 4, it can be seen that META and KNORA-Eliminate models with MCP achieve superior learning performance to those without MCP. For RRC, RRC-MCP achieves comparable learning performance to that of RRC. From the bottom panel of Figure 4, it can be seen that META and RRC with MCP achieve an optimal overproduction pool size compared to that without MCP. For the KNORA-Eliminate model, KNORA-MCP achieves a comparable pool size to that without MCP.
To measure the classification performance, test error and average optimized ensemble size are calculated using MBPEP and compared with the other ensemble methods. The results are shown in Tables 2 and 3. The test errors and optimal ensemble sizes of each model are not fixed for each experiment. For calculation with the benchmarked sets, each result is executed 30 times, and the average results are shown in Tables 2 and 3. The winners are bolded for MBPEP in Tables 2 and 3. The comparison methods are full bagging, RE pruning, Kappa pruning, CM pruning, MDM pruning, and EA pruning. RE pruning, Kappa pruning, CM pruning, and MDM pruning are dynamic ensemble pruning models that use competence ordering to estimate the competence of each learner; EA pruning is based on the optimized dynamic pruning model.
In Table 2, it can be seen that MBPEP achieves the lowest test error for 10 of 16 datasets (10/16). Meanwhile, the datasets with lowest test errors are full bagging, RE pruning, Kappa pruning, CM pruning, MDM pruning, and EA pruning, with 2, 3, 4, 3, 1, and 1 out of 16 datasets, respectively. Thus, MBPEP has been demonstrated to achieve better performance than the other methods. However, according to Demsar , it is inappropriate to use a single criterion to evaluate the performance of an ensemble classifier. In this section, the pairwise significance for numbers of direct wins is validated between MBPEP and the other algorithms using a sign test. Notably, a win is counted as 1 and a tie is counted as 0.5 for each dataset to compare MBPEP with the other algorithms in Tables 2 and 3. From Table 2, the numbers of direct wins are 13, 10.5, 11, 10, 11, and 11 when comparing MBPEP with the other methods.
We also measure the optimal ensemble size in Table 3. It can be easily found that the EA algorithm needs to query more sublearners during the aggregation process. This phenomenon has been explained by Zhou et al. . According to Table 3, MBPEP achieves the lowest ensemble size on 81.25% (13/16) of the datasets, while the other methods achieve the lowest ensemble size in less than 19% (3/16). The results from Tables 2 and 3 support the claim that the performance of MBPEP is superior to those of the competence ordering method and the optimization-based pruning method for most datasets. The reason for this result is that MBPEP can simultaneously minimize classification error and ensemble size, whereas competence ordering ensemble methods (RE pruning, Kappa pruning, CM pruning, and MDM pruning) focus on optimizing only one objective, such as diversity, but neglect the others.
3.5. Robustness of Classification
The robustness of the algorithms is an important indicator for measuring classifier performance. It can reflect the fault tolerance of the algorithms. In this section, the ensemble classifiers are constructed under different levels of the Gaussian noise. The noise levels are determined using the variance intensity of the Gaussian noise. When the experiment is executed, the test errors of the ensemble classifiers are modified when the training sets are corrupted by 0.00, 0.02, 0.04, and 0.08 noise. The results are measured using the 7 datasets described in Table 4. The average test errors are shown in Table 4, and each one is calculated 50 times. To better evaluate the robustness of MBPEP, the performances of full bagging and the other ensemble pruning methods are also measured. The winners are bolded for MBPEP in Table 4.
In Table 4, it can be seen that the test errors of all methods increased with the intensity of the Gaussian noise. For example, MBPEP can achieve a 21.4% test error when there is no noise to corrupt the training set of the sonar dataset. However, the test error reaches 32.3% when the training set is corrupted by 0.08 Gaussian noise. According to Table 4, MBPEP achieves the lowest test error on 5, 5, 6, and 7 datasets of the 7 datasets under the different noise intensities. Specifically, when the variance intensity of the Gaussian noise is larger than 0.04, MBPEP performs better than the other methods. For example, the test error of MBPEP reaches 20.80% on the waveform dataset when there is no noise to corrupt the training set. It is not the best classification performance of all of the comparison methods, but when the variance of the noise is larger than 0.04, MBPEP achieves the lowest test errors. The number of winners for MBPEP increases with noise intensity, which demonstrates that MBPEP is robust with respect to Gaussian noise.
3.6. Application to the Pattern Recognition Task
MBPEP is applied to handwritten digital character pattern recognition. Digital character recognition is an important research field in computer vision. MNIST  is a standard digital character benchmark dataset that contains 60000 and 10000 grayscale images for training and testing, respectively. Each image is mapped to 10 classes that include the digits 0 to 9. Figure 5 shows samples of MNIST. To evaluate the generalization performance of the pruning ensemble method on MNIST, it was tested 30 times.
In this section, MBPEP and the five dynamic pruning methods that were mentioned in Section 3.2 are applied to the deep forest framework multigrained cascade forest (gc-Forest) . To validate the efficiency of MBPEP, the average max process has been replaced by dynamic pruning in gc-Forest. We call gc-Forest with the dynamic pruning process the modified version of gc-Forest. The two modules in gc-Forest, i.e., multigrained scanning and cascade forest, are reserved. The modified version of gc-Forest is shown in Figure 6.
According to Figure 6, the overall learning procedure of the modified version of gc-Forest has the same construction as the original version of gc-Forest. The number of raw input features is scaled using the scanning method on the novel features (printed in blue blocks). The scaling process is determined using the scaling window sizes. The generated features are concatenated to the large feature vectors (printed in red blocks) that are prepared to enter the cascade forest. To encourage diversity and improve pruning efficiency for the construction of the cascade forest, each layer consists of 100 random forests in this section. The number of layers is self-adapted until the validation performance is below the error tolerance. We introduce a metric, namely, the improvement ratio of the test error, in this section. Specifically, the original version of gc-Forest is used as a baseline. The results are shown in Figure 7(a). In addition, the percentage of reduction for the number of optimized learners that has been compared with the original ensemble size is shown in Figure 7(b). The different colors in Figure 7 denote the different ensemble methods.
According to Figure 7(a), it can be seen that not all ensemble methods can improve the classification accuracy when compared with the original version of gc-Forest. The modified version of gc-Forest with MBPEP could improve the classification accuracy by 0.4% over that of the original version. Meanwhile, the other methods can improve the classification accuracy by 0.2%. From Figure 7(b), the results show that MBPEP can store fewer sublearners during aggregation. MBPEP reduces the query quantity by approximately 64.4% for the MBPEP algorithm when combining these sublearners into the final hypotheses. The other four ensemble methods need to query more sublearners, which increases the time consumption of the majority voting process. Thus, MBPEP has apparent advantages with less storage and higher computational efficiency than the other pruning methods.
The dynamic ensemble pruning technique is an important strategy for improving the performance of ensemble classifiers when a subset of ensemble pools is used as a classification expert with respect to the incoming set. During the overproduction stage of the dynamic ensemble pruning model, since selecting the optimal ensemble size is an NP-hard optimization problem, the initial ensemble pool size is fixed. Many previous publications mainly address the pruning criteria during the pruning stage, and few studies utilize the local margin information for learners to select the expert in an ensemble. In this paper, MBPEP is proposed.
First, the Pareto optimization algorithm is used by MBPEP to globally search for the feasible ensemble pool size during the overproduction stage. The “domination” computation between the learning performance and the ensemble pool size is continuously estimated by Pareto optimization. Second, the margin entropy for every learner in the ensembles is used to locally determine the expert that can detect the classes in the indecision region. Several experiments are conducted to demonstrate the advantages of MBPEP with respect to other state-of-art pruning methods (RE pruning, Kappa pruning, complementarity measure pruning, MDM pruning, RRC pruning, KNORA-Eliminate, META-DES, and EA ensemble method). Finally, MBPEP is applied to the deep forest framework to conduct handwritten digital character recognition tasks. Compared to the original version of gc-Forest, the average max process has been replaced by MBPEP-based gc-Forest. This modified version of gc-Forest with MBPEP can improve the classification accuracy while using fewer sublearners when combining the final hypotheses.
In the future, we would like to focus on merging ensemble methods with deep learning frameworks. The combination of ensemble methods and deep learning construction has become an advanced research direction in machine learning. For example, deep ensemble learning  has been demonstrated as a powerful tool for addressing various recognition tasks.
The datasets generated during and analysed during the current study are available in the UCI repository (https://archive.ics.uci.edu/ml/index.php) and the MNIST repository (http://yann.lecun.com/exdb/mnist/).
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This work was supported by the National Science Foundation for Young Scientists of China (61803107). The Science and Technology Planning Project of Guangzhou, China (Grant number 201803020025).
Z. Zhou and J. Feng, “Deep forest: towards an alternative to deep neural networks,” in Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, VIC, Australia, August 2017.View at: Google Scholar
P. R. L. Almeida, E. J. Silva, T. M. Celinski, A. S. Britto, L. E. S. Oliveira, and A. L. Koerich, “Music genre classification using dynamic selection of ensemble of classifiers,” in Proceedings of IEEE International Conference on Systems, Man, and Cybernetics, pp. 2700–2705, Banff, AB, Canada, October 2017.View at: Google Scholar
M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, “A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 4, pp. 463–484, 2012.View at: Publisher Site | Google Scholar
A. Lazarevic and Z. Obradovic, “Effective pruning of neural network classifiers,” in Proceedings of the 14th International Joint Conference on Neural Networks, vol. 1–4, pp. 796–801, Washington, DC, USA, July 2001.View at: Google Scholar
B. Antosik and M. Kurzynski, “New measures of classifier competence – heuristics and application to the design of multiple classifier systems,” in Proceedings of 7th International Conference on Computer Recognition Systems, vol. 4, pp. 197–206, Wroclaw, Poland, 2011.View at: Google Scholar
N. Li and Z. H. Zhou, “Selective ensemble under regularization framework,” in Proceedings of the 8th International Workshop on Multiple Classifier Systems, vol. 5519, pp. 293–303, Reykjavik, Iceland, June 2009.View at: Google Scholar
A. H. R. Ko, R. Sabourin, and J. Britto, “From dynamic classifier selection to dynamic ensemble selection,” Pattern Recognition, vol. 47, no. 11, pp. 3665–3680, 2014.View at: Google Scholar
R. M. O. Cruz, R. Sabourin, and G. D. C. Cavalcanti, “On meta-learning for dynamic ensemble selection,” in Proceedings of the International Workshop on Multiple Classifier Systems, pp. 157–166, Naples, Italy, June 2011.View at: Google Scholar
C. Qian, J. Shi, Y. Yu, K. Tang, and Z. Zhou, “Parallel Pareto optimization for subset selection,” in Proceedings of the 25th International Joint Conference on Artificial Intelligence, pp. 1939–1945, New York, NY, USA, July 2016.View at: Google Scholar
L. Kuncheva, “Clustering and selection model for classifier combination,” in Proceedings of the 4th International Conference on Knowledge-Based Intelligent Information Engineering Systems and Allied Technologies, pp. 185–188, Brighton, UK, 2000.View at: Google Scholar
P. Castro, G. P. Coelho, M. F. Caetano, and V. Zuben, “Designing ensembles of fuzzy classification systems: an immune-inspired approach,” in Proceedings of the 4th International Conference on Artificial Immune Systems, vol. 3627, pp. 469–482, Banff, AB, Canada, August 2005.View at: Google Scholar
A. Kirshners, S. Parshutin, and H. Gorskis, “Entropy-based classifier enhancement to handle imbalanced class problem,” in Proceedings of the International Conference on Technology and Education, vol. 104, pp. 586–591, Balikpapan, Indonesia, 2017.View at: Google Scholar
T. Back, Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolutionary Programming, Genetic Algorithms, vol. 2, Oxford University Press, Oxford, UK, 1997.
D. Margineantu and T. G. Dietterich, “Pruning adaptive boosting,” in Proceedings of the 14th International Conference on Machine Learning, pp. 211–218, Nashville, TN, USA, July 1997.View at: Google Scholar
G. Martinezmuoz and A. Suarez, “Aggregation ordering in bagging,” in Proceedings of 8th International Conference on Artificial Intelligence and Applications, vol. 1, pp. 258–263, Innsbruck, Austria, 2004.View at: Google Scholar
J. Demsar, “Statistical comparisons of classifiers over multiple data sets,” Journal of Machine Learning Research, vol. 7, no. 1, pp. 1–30, 2006.View at: Google Scholar