Table of Contents Author Guidelines Submit a Manuscript
Computational Intelligence and Neuroscience
Volume 2017, Article ID 1930702, 12 pages
https://doi.org/10.1155/2017/1930702
Research Article

A Theoretical Analysis of Why Hybrid Ensembles Work

Department of Computer Science, National Chengchi University, No. 64, Sec. 2, Zhi Nan Rd., Wen Shan District, Taipei City 11605, Taiwan

Correspondence should be addressed to Kuo-Wei Hsu; wt.ude.uccn@ushwk

Received 8 August 2016; Revised 6 December 2016; Accepted 5 January 2017; Published 31 January 2017

Academic Editor: Jussi Tohka

Copyright © 2017 Kuo-Wei Hsu. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Inspired by the group decision making process, ensembles or combinations of classifiers have been found favorable in a wide variety of application domains. Some researchers propose to use the mixture of two different types of classification algorithms to create a hybrid ensemble. Why does such an ensemble work? The question remains. Following the concept of diversity, which is one of the fundamental elements of the success of ensembles, we conduct a theoretical analysis of why hybrid ensembles work, connecting using different algorithms to accuracy gain. We also conduct experiments on classification performance of hybrid ensembles of classifiers created by decision tree and naïve Bayes classification algorithms, each of which is a top data mining algorithm and often used to create non-hybrid ensembles. Therefore, through this paper, we provide a complement to the theoretical foundation of creating and using hybrid ensembles.

1. Introduction

Ensemble learning is inspired by the human group decision making process and has gained much attention [17]. It is to create an ensemble, which is a group of classifiers and combines classifications made by these classifiers to make an overall classification. The advantage of ensembles is not that the best combination of classifiers outperforms the best classifier but that a combination of classifiers is less probable to misclassify unseen data samples than a single classifier. Ensembles have shown their satisfactory classification performance in a large scale comparative study [8].

Ensemble learning has been applied in various application domains, such as image classification [914], fingerprint classification [15], weather forecasting [16], text categorization [17], image segmentation [18], visual tracking [19], protein fold pattern recognition [20], cancer classification [21], pedestrian recognition [22] or detection [23], prediction of software quality [24, 25], face recognition [26], email filtering [27], prediction of students’ performance [28], medical image analysis [2932], churn prediction [33], sentiment analysis [3437], steganalysis [38], prediction of air quality [39], and intrusion detection [40].

From one point of view, ensemble learning becomes popular because every classification algorithm has its own limitations. From another point of view, if every classifier in an ensemble has expertise in classifying data samples that belong to some portion of the given data set, the overall classification combined from all (or some) classifiers in the ensemble will potentially be more reliable.

Compared to classifiers working individually, classifiers working together will have a better potential for gaining better accuracy [41]. Diverse classifiers working together will have a better potential for gaining better accuracy compared to non-diverse classifiers working together, as suggested in [42]. In creating an ensemble, every classifier that will be part of the ensemble is expected to be adequately accurate, while the correlation between classifications made by two classifiers that will be part of the ensemble is expected to be small.

Some researchers use a mixture of different types of classification algorithms to create hybrid ensembles. For example, with reference to ensembles composed of decision trees and artificial neural networks, Langdon et al. use such an ensemble in drug discovery [43]; Lu et al. discuss sampling methods along with these ensembles for active learning [44]; in [45], the author analyzes the hybrid ensembles for binary classification from the standpoint of bias-variance decomposition proposed in [46]. Furthermore, Salgado et al. use ensembles of artificial neural networks and support vector machines [47, 48] to predict daily electricity load [49]. Min and Cho use naïve Bayes classifiers and support vector machines for activity recognition [50]. Verikas et al. provide a survey of hybrid ensembles designed for bankruptcy prediction [51]. In addition, Verma and Hassan use integrations of clustering and classification algorithms to create hybrid ensembles [52].

Most researchers simply use hybrid ensembles without further investigation. Therefore, we plan to have a better understanding of hybrid ensembles. Our goal is not only to show that the classification performance of hybrid ensembles can be comparable or even superior to that of non-hybrid ensembles, but also to provide an explanation of why hybrid ensembles work from the standpoint of diversity. Diversity among classifiers in an ensemble plays a significant role in the success of the ensemble. One type of ensemble learning algorithms uses sampling methods to generate different data sets used for training diverse classifiers, such as Bagging (Bootstrap Aggregating) [53], while another type is rather ad hoc and uses different classification algorithms to train diverse classifiers that will be used to create an ensemble.

We create hybrid ensembles by using an integration of these two types of ensemble learning algorithms. The classification algorithms that we use to create hybrid ensembles are C4.5 [48, 54] decision tree and naïve Bayes [48]. We provide an empirical comparison of these hybrid ensembles and others created by using Bagging. This paper is particularly essential because quite few papers empirically evaluate hybrid ensembles and at the same time theoretically analyze them.

The rest of this paper is structured as follows: We provide background information and present our analysis in Section 2. Next, we report and discuss experiment results in Section 3. Finally, we give conclusion in Section 4.

2. Materials and Methods

The goal of using several classifiers in an ensemble is to achieve better classification performance by combining classifications from these classifiers, each of which serves as an optimal model or an expert in a portion of the data set. These classifiers are expected to be uncorrelated and behave independently of each other; or, at least, they need to show different patterns of errors.

Diversity among classifiers in an ensemble is related to the success of the ensemble, because it can compensate for errors made by those classifiers individually [55]. In this section, we analyze diversity and classification performance, for example, accuracy, of ensembles created by using a single algorithm (i.e., non-hybrid ensembles) and hybrid ensembles created by using two different algorithms. The relationship between diversity and accuracy is not “straightforward” [56]. On one hand, combining classifications from classifiers of low diversity would not improve the overall accuracy because these classifiers behave similarly for some portions of the data set; on the other hand, using highly diverse classifiers to create an ensemble would not guarantee absolutely high accuracy. Hsu and Srivastava show that using different classification algorithms in an ensemble would likely increase diversity and decrease correlation between classifiers in the ensemble [57], and they build the connection between diversity and correlation, which can be indirectly connected to accuracy [58]. Referring to the analysis technique used in [59], we build a more direct connection between diversity and accuracy, and further we analyze the influence of using different algorithms to create an ensemble on accuracy; the analysis distinguishes this paper from the earlier papers.

In what follows, is a -dimensional vector to represent a data sample, and is a binary class label or . is a classifier and , where is a classification made by a classifier. is a classification algorithm. is a set of data samples. means that is trained by applying on and is used to classify , and it returns a binary class label.

What is described in (1) is a general form of an ensemble where classifications from classifiers are combined through majority voting to make the overall classification. It can be modified such that it uses weighted majority voting. In (1), is an ensemble of   classifiers, is a classifier in the ensemble, is a data sample that needs to be classified, and is the overall classification. For those ensemble learning algorithms only using different data sets to train diverse classifiers, and for all , where , and Bagging is an example. For those using different classification algorithms and different data sets to train diverse classifiers, for some (or, in few cases, all) and for all , where , and this is of our interest in this paper

Given a data set and a set of classification algorithms , we create a hybrid ensemble of classifiers, each of which is trained by applying an algorithm selected in an alternating fashion from on a set of data samples drawn from with bootstrap sampling. This process is shown in Algorithm 1. Instead of selecting algorithms in an alternating fashion, we can simply do random selection such that a classifier in the hybrid ensemble could be trained by using one of the algorithms in with an equal probability. Extending from this, we can incorporate prior knowledge and assign unequal probabilities to different algorithms.

Algorithm 1: Process to create a hybrid ensemble.

The input of the bootstrap sampling method is a data set , and the output is a set of data samples drawn with replacement from , . We use bootstrap sampling when we need different data sets to train diverse classifiers in creating an ensemble. Diversity among the classifiers in such an ensemble totally comes from differences among the data sets used to train the classifiers, and therefore bootstrap sampling is the single generator of diversity in such an ensemble. We train diverse classifiers by additionally using different classification algorithms, and by doing this we have an additional generator of diversity.

Afterward, we present our analysis of why hybrid ensembles work based on the analysis technique used in [59]. We start from the following definitions, where and is the classification to a testing data sample given by a classifier trained by applying on .

Definition 1. is a given data set and is a set of data sets generated with bootstrap sampling from . The th element in is of the same size of ; that is, . and are used as training data sets, where . With respect to used as a testing data set, where and , -Diversity of a classification algorithm is defined as an expectation of disagreement between classifiers trained by applying on and , as given inDefinition 1 describes how unstable a classification algorithm could possibly be on a given data set. Definition 2 describes the degree of being unstable (or the instability) for a classification algorithm.

Definition 2. Based on Definition 1, a classification algorithm is -unstable with respect to -Diversity, if the following holds:Here, disagreement between classifiers is from using different training data sets, and it is a type of diversity. In (3), β is the lower bound of the probability that we can observe disagreement at least α; the larger the α, the larger the diversity. When α is treated as a constant, a larger value of β means a more unstable classification algorithm.
We use the process given below to estimate the instability of an algorithm on a data set. The idea is to use two sampled data sets to train two classifiers and then measure the difference in classifications made by the two classifiers on another sampled data set. Because these data sets are from the same given data set, difference in classifications mainly comes from the algorithm used to train the classifiers. If the algorithm is more unstable and sensitive to the changes in the data sets used for training, the trained classifiers would make more different classifications on a data set, part of which has been seen by both classifiers. The larger the value returned by the process, the more unstable the algorithm.

Definition 3. is a given data set and is a set of data sets generated with bootstrap sampling from . The th element in is of the same size of ; that is, . is used as a training data set. With respect to used as a testing data set, where , -Diversity for two classification algorithm and is defined as an expectation of disagreement between classifiers trained by applying and on , as given inDefinition 3 describes how different two classification algorithms could possibly be on a given data set. Definition 4 describes the degree of being different (or the differentiability) for two classification algorithms.

Definition 4. Based on Definition 3, two classification algorithms and are -differentiable with respect to -Diversity, if the following holds:Here, disagreement between classifiers is from using different classification algorithms, and it is a type of diversity, too. In (5), γ is the lower bound of the probability that we can observe disagreement at least δ; the larger the δ, the larger the diversity. Similarly, when δ is treated as a constant, a larger value of γ means that two classification algorithms are more different.
We use the process given below to estimate the differentiability of a pair of algorithms on a data set. The idea is to use two algorithms and a sampled data set to train two classifiers and then measure the difference in classifications made by the two classifiers on another sampled data set. If the two algorithms are more differentiable, the classifiers trained with them would make more different classifications on a data set, part of which has been seen by both classifiers. The larger the value returned by the process, the more differentiable the two algorithms.

We show a connection between diversity and accuracy gain for using a hybrid ensemble through the proposition given below.

Proposition 1. If two classification algorithms and are -differentiable and is -unstable, then (6) holds, where is the difference between the expected accuracy of a hybrid ensemble with and and the expected accuracy of a non-hybrid ensemble with only , and is the accuracy of the classifier training by using

Regarding (6), is the classification algorithm used to train most classifiers in both ensembles, and is the classification algorithm used to train a classifier in a hybrid ensemble. When and are treated as constants, the lower bound of the probability that we can observe accuracy gain depends on how unstable is and how different and are. If is larger, meaning that is more unstable, and is larger, meaning that and are more different, then the lower bound of the probability is larger; if is smaller, meaning that is more stable, and is smaller, meaning that and are more similar, then the lower bound of the probability is smaller, while this does not necessarily mean that the probability of observing accuracy gain is smaller. As a result, if the algorithm used to train most classifiers in a hybrid ensemble is unstable and it is different from the other algorithm, then it is more probable that accuracy gain would be observed. Furthermore, the lower bound of the accuracy gain depends on how unstable is, how different and are, and how accurate the classifiers trained with and are. If the classifier trained with is more accurate, is larger and the lower bound of accuracy gain is larger. As a result, it is beneficial to replace a classifier (or some classifiers) in a non-hybrid ensemble with a classifier (or some classifiers) trained with a different yet accurate classification algorithm. Nevertheless, it may not be beneficial to replace all, because doing so would lower diversity among classifiers.

Proof. , , and are three data sets generated with bootstrap sampling from a given data set. Classifiers , , and are trained by applying a classification algorithm on , , and , respectively. That is, , , and . The ensemble composed of , , and is a non-hybrid ensemble, because it is with only a classification algorithm. Classifier is trained by applying another classification algorithm on . That is, . The ensemble composed of , , and is a hybrid ensemble, because it is with two different classification algorithms.
Initially, we represent the expected accuracy of the non-hybrid ensemble in (7), which is based on majority voting. In what follows, is the class label of a data sampleSimilarly, we represent the expected accuracy of the hybrid ensemble inThe difference between the expected accuracy of the hybrid ensemble and that of the non-hybrid ensemble or the difference between (8) and (7) is denoted by . It is given inNext, (9) is rewritten as (10) by rearranging its components is equal to , and further the component is equal to . Furthermore, is equal to , and the relationship implies . is equal to , and the relationship implies , too. Next, (11) is obtained:In (11), the first component is related to -Diversity and the second component is related to accuracy. As the two components are independent, (11) is rewritten asNext, by referring to Definition 4, the second component in (12) is associated with , which is equal toBecause and are -differentiable, (14) is larger than or equal to with a probability at least . Combining this and being -unstable, (14) is obtained: Finally, (15) is obtained through (14) and the proof is complete:

We treat and as constants, as in the processes shown in Algorithms 2 and 3. Given an ensemble of classifiers trained by using . We replace some classifiers with those trained by using , which generally provides a higher value of accuracy (so that Acc2 is larger than Acc1), and the lower bound of accuracy gain will be positive, meaning that we would probably obtain a hybrid ensemble that could achieve better classification performance. If is more unstable, will be larger. If is more different from , will be larger. The larger and the larger , the larger the lower bound of the probability that we would observe better accuracy.

Algorithm 2: Process to estimate instability.
Algorithm 3: Process to estimate differentiability.

3. Results and Discussion

3.1. Data Sets

In experiments, we use 20 data sets from UCI Machine Learning Repository [60], supported by School of Information and Computer Science, University of California, Irvine, and from Department of Statistics, Carnegie Mellon University. Using public data sets is to allow the reproduction of the experiments.

The data sets used in experiments are from various application domains. Their characteristics are summarized in Table 1, where the first column is the serial number, the second column is the name of a data set, the third column is the number of data samples, the fourth column is the number of data samples that belong to the minority class, the fifth column is the number of nominal attributes, the sixth column is the number of numeric attributes, and the numbers in parentheses present the numbers of attributes with missing values. The proportion of minority samples affects the classification performance of a classifier or an ensemble, and so does the proportion of attributes with missing values. For a data set that is usually used in regression analysis, we apply discretization on its target attribute and divide continuous values of the attribute properly into two intervals each of which corresponds to a class label; such a data set is with the suffix “binary.”

Table 1: Characteristics of the data sets used in experiments.
3.2. Settings

We consider C4.5 decision tree and naïve Bayes classification algorithms. The former is denoted by DT, and the latter is denoted by NB. We consider single classifiers trained by using DT or NB, non-hybrid ensembles trained with DT or NB, and hybrid ensembles trained with DT and NB.

We implement the process given in Algorithm 1 by using WEKA [61], and we compare the classification performances given by ensembles created by it to that given by Bagging. The ensembles created by using Bagging with DT or NB are non-hybrid ensembles with only DT or NB. We set the number of classifiers in an ensemble to 10 for all ensemble learning algorithms. We use 20 × 10-fold cross-validation for classification performance evaluation. That is, for each combination of an ensemble learning algorithm and a data set, we run 10-fold cross-validation 20 times randomly and independently. In addition, we investigate instability and differentiability.

3.3. Instability

We use the process described earlier to estimate the instability of an algorithm on a data set. We set α to 0.05 (by statistical convention) and the number of trials to 100. We report the results in Table 2. The content in a cell indicates the value of or the probability that the disagreement rate is larger than or equal to or 0.05. The disagreements are given by two classifiers trained by applying the algorithm corresponding to the column on two data sets sampled from the data set corresponding to the row. The probability is calculated with reference to 100 trials. The larger the value of , the more unstable the algorithm on the data set. DT is more unstable than NB on 13 out of these 20 data sets. This is consistent with the general thought that decision tree algorithm is unstable (or it is a classification algorithm with high variance) and therefore suitable for being used to create ensembles. The difference of values of instability is larger than or equal to 0.5 on 4 data sets: heart-c-binary, heart-h-binary, hprice-binary, and sonar. The results clearly show that instability results from not only the nature of a data set but also the nature of a classification algorithm.

Table 2: Results for instability.
3.4. Differentiability

We use the process described earlier to estimate the differentiability of two algorithms on a data set. We set to 0.05 (by statistical convention) and the number of trials to 100. We report the results in Table 3. The content in a cell indicates the value of or the probability that the disagreement rate is larger than or equal to or 0.05. The disagreements are given by two classifiers trained by applying DT and NB separately on a data set sampled from the data set corresponding to the row. The probability is calculated with reference to 100 trials. The larger the value of , the more different the two algorithms on the data set. According to Table 3, DT and NB are different or behave differently on 19 out of these 20 data sets; they are not sufficiently different on the data set breast-w. Decision tree and naïve Bayes algorithms are fundamentally different: For example, the former makes no assumptions on the data set, while the latter assumes statistical independence between attributes; the former discretizes numeric attributes, while the latter can apply density estimation on numeric attributes; the former uses special treatment for missing values, while the latter handles missing values naturally.

Table 3: Results for differentiability.
3.5. Performance

We report the values of accuracy in Table 4. The results have shown support for the idea that we could possibly obtain better classification performance by using different classification algorithms to train classifiers in an ensemble. In the table, HE is for the hybrid ensemble, and the content of a cell indicates the mean and standard deviation calculated over 20 runs of 10-fold cross-validation for applying the algorithm corresponding to the column on the data set corresponding to the row. Generally speaking, compared to a single classifier, an ensemble would show stabilized classification performance, especially when the underlying classification algorithm is an unstable one. We can see this from the fact that the standard deviations given by ensembles are lower than those given by single classifiers in most cases. For example, on the first data set, biomed, the standard deviation given by DT is 0.014, that given by Bagging DT is 0.009, and that given by the hybrid ensemble DT + NB is 0.007.

Table 4: Performance in accuracy.

In some cases, such as those where data distributions are skewed, accuracy is not a good measure for classification performance evaluation, while F1-measure is a more comprehensive measure. In Table 5, we report the values of F1-measure, which is the harmonic mean of precision and recall, for data samples that belong to the minority class. A higher value of F1-measure means better classification performance. F1-measure for minority is used to evaluate how well a classification algorithm performs on data samples that belong to the minority class, which are usually the targets in most real-world machine learning applications. A classifier or an ensemble can achieve a low error rate simply by classifying all samples to the majority class or simply by ignoring minority samples on highly unbalanced data sets. In the table, similarly, HE is for the hybrid ensemble, and the content of a cell indicates the mean and standard deviation.

Table 5: Performance in F1-measure.

To assess the significance of differences in classification performance achieved by two algorithms, we perform the Wilcoxon test, a non-parametric test, whose use is recommended by Demšar [62]. We report results for the test for accuracy and F1-measure in Tables 6 and 7, respectively. In the tables, B-DT means Bagging DT, B-NB means Bagging NB, and similarly HE is for the hybrid ensemble. In the tables, the content in a cell indicates the number of data sets on which the algorithm in the corresponding row significantly wins (outperforms) and loses to (is outperformed by) the algorithm in the corresponding column. For example, in Table 6, the first cell shows that decision tree algorithm wins in accuracy naïve Bayes algorithm on 10 data sets and it loses on 9 data sets, while the difference in accuracy is not significant on 1 data set; in Table 7, the first cell shows that decision tree algorithm wins in F1-measure naïve Bayes algorithm in 18 data sets and it does not lose on a data set, while the difference in F1-measure is not significant on 2 data sets.

Table 6: Results for Wilcoxon test for accuracy (: row versus column).
Table 7: Results for Wilcoxon test for F1-measure (: row versus column).

We can see from Table 6 that, in terms of accuracy, the number of data sets on which HE DT + NB outperforms B-DT is slightly smaller than that on which HE DT + NB is outperformed by B-DT; the number of data sets on which HE DT + NB outperforms B-NB is much larger than that on which HE DT + NB is outperformed by B-NB. We can see from Table 7 that, in terms of F1-measure, the results are similar but HE DT + NB outperforms less. The results seem to be in opposition to creating and using hybrid ensembles. On the contrary, the results show that hybrid ensembles outperform single classifiers not deterministically but probabilistically, which is also shown by our analysis. Moreover, in this paper, we intend to explain why and when hybrid ensembles are better than non-hybrid ensembles.

We propose using different classification algorithms to train more diverse classifiers in order to create better ensembles. The basic idea is to use a combination of classifiers to naturally reduce variance and use a stronger algorithm to explicitly increase classification performance. The ensemble creation process that we propose is distinguishing, because it uses fundamentally different classification algorithms to create a hybrid ensemble. For example, it uses DT, which is often with high variance (related to high instability), and NB, which is often with low variance (related to low instability or high stability). Using such a combination of classification algorithms goes against the generally accepted sense that one should only use classification algorithms with high variance in an ensemble like Bagging. We evaluate the proposed process by using a varied collection of public data sets and two metrics. Experiment results reveal that the proposed process could achieve better performance when compared to Bagging.

3.6. Discussion

There are 8 data sets on which the hybrid ensemble DT + NB outperforms both Bagging DT and Bagging NB: biomed, breast-w, credit-g, diabetes, heart-c-binary, heart-statlog, hepatitis, and schizo. According to Table 4, naïve Bayes algorithm performs better in accuracy than decision tree algorithm does on these data sets. Therefore, we consider that is DT and is NB, which is newly introduced into the ensemble and used as another generator of diversity. For these data sets, we summarize lower bounds for accuracy gain and the probability that the gain is obtained in Table 8. In the table, the second column (Acc1) and the third column (Acc2) are from the second column and the third column in Table 4, respectively; the fourth column is from the second column in Table 2; the fifth column is from the second column in Table 3; the sixth and seventh columns are lower bounds for accuracy gain and the probability that the gain is obtained, respectively, and both are calculated by (6), or (15), given that is 0.05 and is 0.05; the eighth and ninth columns are from the fourth and sixth columns in Table 4, respectively; the tenth column is the accuracy difference between HE DT + NB and B-DT. According to Table 8, on all these data sets, the accuracy difference, or the actual gain given by the hybrid ensemble DT + NB against Bagging NB, is larger than the lower bound; all the probabilities are high, except the one for the data set number 3, breast-w, and this means that the hybrid ensemble DT + NB would highly probably outperform Bagging DT (and this indeed the case). Nevertheless, the lower bound is too loose (but it is still the only one presented so far), and finding a tighter bound would be part of the future work.

Table 8: Lower bounds for accuracy gain and the probability.

4. Conclusion

Ensemble learning is to train classifiers and then combine their classifications to make an overall classification. Many researchers use ensembles of classifiers created by using a single classification algorithm in various applications. These are non-hybrid ensembles, and why they work is becoming clearer. Some researchers propose the use of the mixture of two different types of classification algorithms in the creation of a hybrid ensemble. In this paper, we investigate why hybrid ensembles work, which is somewhat unclear. We present our theoretical analysis from the standpoint of diversity, which plays a significant role in ensemble learning and is one of the fundamental elements of the success of ensembles. This is the most distinguishing characteristic of this paper. We also report and discuss experiment results obtained from hybrid ensembles of classifiers created by decision tree and naïve Bayes classification algorithms, each of which is a top data mining algorithm and often used to create non-hybrid ensembles. These are two fundamentally different classification algorithms, and therefore it is interesting to see that hybrid ensembles created with them together can achieve similar or even better classification performance compared to non-hybrid ensembles created with them individually. In short, we contribute to a complement to the theoretical foundation of creating and using hybrid ensembles. The hybrid ensemble performs better not in a deterministic but a probabilistic manner. In a hybrid ensemble where two classification algorithms are used, if two are different and one is unstable while the other is more accurate, then there is a higher probability that we can have a higher value of accuracy gain; the gain is measured against an ensemble where only one of the two classification algorithms is used.

Competing Interests

The author declares that there is no conflict of interests regarding the publication of this paper.

References

  1. R. Ranawana and V. Palade, “Multi-classifier systems: review and a roadmap for developers,” International Journal of Hybrid Intelligent Systems, vol. 3, no. 1, pp. 35–61, 2006. View at Publisher · View at Google Scholar
  2. R. Polikar, “Ensemble based systems in decision making,” IEEE Circuits and Systems Magazine, vol. 6, no. 3, pp. 21–45, 2006. View at Publisher · View at Google Scholar · View at Scopus
  3. R. Polikar, “Bootstrap-inspired techniques in computation intelligence,” IEEE Signal Processing Magazine, vol. 24, no. 4, pp. 59–72, 2007. View at Publisher · View at Google Scholar · View at Scopus
  4. G. Brown, “Ensemble learning,” in Encyclopedia of Machine Learning, pp. 312–320, Springer, New York, NY, USA, 2010. View at Publisher · View at Google Scholar
  5. C. Zhang and Y. Ma, Eds., Ensemble machine learning, Springer, NY, USA, 2012. View at Publisher · View at Google Scholar · View at MathSciNet
  6. Z.-H. Zhou, Ensemble Methods: Foundations and Algorithms, Chapman & Hall/CRC, 2012.
  7. M. Woźniak, M. Graña, and E. Corchado, “A survey of multiple classifier systems as hybrid systems,” Information Fusion, vol. 16, no. 1, pp. 3–17, 2014. View at Publisher · View at Google Scholar · View at Scopus
  8. M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim, “Do we need hundreds of classifiers to solve real world classification problems?” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 3133–3181, 2014. View at Google Scholar · View at MathSciNet
  9. G. Giacinto and F. Roli, “Design of effective neural network ensembles for image classification purposes,” Image and Vision Computing, vol. 19, no. 9-10, pp. 699–707, 2001. View at Publisher · View at Google Scholar · View at Scopus
  10. K.-S. Goh, E. Chang, and K.-T. Cheng, “SVM binary classifier ensembles for image classification,” in Proceedings of the 10th International Conference on Information and Knowledge Management (ACM CIKM '01), pp. 395–402, November 2001. View at Scopus
  11. M. Pal, “Ensemble of support vector machines for land cover classification,” International Journal of Remote Sensing, vol. 29, no. 10, pp. 3043–3049, 2008. View at Publisher · View at Google Scholar · View at Scopus
  12. A. Merentitis, C. Debes, and R. Heremans, “Ensemble learning in hyperspectral image classification: toward selecting a favorable bias-variance tradeoff,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 7, no. 4, pp. 1089–1102, 2014. View at Publisher · View at Google Scholar · View at Scopus
  13. A. Samat, P. Du, S. Liu, J. Li, and L. Cheng, “E2LMs: ensemble extreme learning machines for hyperspectral image classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 7, no. 4, pp. 1060–1069, 2014. View at Publisher · View at Google Scholar · View at Scopus
  14. M. Han and B. Liu, “Ensemble of extreme learning machine for remote sensing image classification,” Neurocomputing, vol. 149, pp. 65–70, 2015. View at Publisher · View at Google Scholar · View at Scopus
  15. R. Cappelli, D. Maio, and D. Maltoni, “A multi-classifier approach to fingerprint classification,” Pattern Analysis and Applications, vol. 5, no. 2, pp. 136–144, 2002. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  16. I. Maqsood, M. R. Khan, and A. Abraham, “An ensemble of neural networks for weather forecasting,” Neural Computing and Applications, vol. 13, no. 2, pp. 112–122, 2004. View at Google Scholar · View at Scopus
  17. Y.-S. Dong and K.-S. Han, “A comparison of several ensemble methods for text categorization,” in Proceedings of the IEEE International Conference on Services Computing (SCC '04), pp. 419–422, September 2004. View at Publisher · View at Google Scholar · View at Scopus
  18. T. Rohlfing and C. R. Maurer Jr., “Multi-classifier framework for atlas-based image segmentation,” Pattern Recognition Letters, vol. 26, no. 13, pp. 2070–2079, 2005. View at Publisher · View at Google Scholar · View at Scopus
  19. S. Avidan, “Ensemble tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 2, pp. 261–271, 2007. View at Publisher · View at Google Scholar · View at Scopus
  20. H.-B. Shen and K.-C. Chou, “Ensemble classifier for protein fold pattern recognition,” Bioinformatics, vol. 22, no. 14, pp. 1717–1722, 2006. View at Publisher · View at Google Scholar · View at Scopus
  21. S. B. Cho and H.-H. Won, “Cancer classification using ensemble of neural networks with multiple significant gene subsets,” Applied Intelligence, vol. 26, no. 3, pp. 243–250, 2007. View at Publisher · View at Google Scholar · View at Zentralblatt MATH · View at Scopus
  22. D. Gray and H. Tao, “View point invariant pedestrian recognition with an ensemble of localized features,” in Proceedings of the 10th European Conference on Computer Vision: Part I, pp. 262–275, Marseille, France, 2008.
  23. S. Paisitkriangkrai, C. Shen, and A. van den Hengel, “Pedestrian detection with spatially pooled features and structured ensemble learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 6, pp. 1243–1257, 2016. View at Publisher · View at Google Scholar · View at Scopus
  24. H. I. Aljamaan and M. O. Elish, “An empirical study of bagging and boosting ensembles for identifying faulty classes in object-oriented software,” in Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM '09), pp. 187–194, IEEE, Nashville, Tenn, USA, April 2009. View at Publisher · View at Google Scholar · View at Scopus
  25. I. H. Laradji, M. Alshayeb, and L. Ghouti, “Software defect prediction using ensemble learning on selected features,” Information and Software Technology, vol. 58, pp. 388–402, 2015. View at Publisher · View at Google Scholar · View at Scopus
  26. Y. Su, S. Shan, X. Chen, and W. Gao, “Hierarchical ensemble of global and local classifiers for face recognition,” IEEE Transactions on Image Processing, vol. 18, no. 8, pp. 1885–1896, 2009. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  27. I. Katakis, G. Tsoumakas, and I. Vlahavas, “Tracking recurring contexts using ensemble classifiers: an application to email filtering,” Knowledge and Information Systems, vol. 22, no. 3, pp. 371–391, 2010. View at Publisher · View at Google Scholar · View at Scopus
  28. S. Kotsiantis, K. Patriarcheas, and M. Xenos, “A combinational incremental ensemble of classifiers as a technique for predicting students' performance in distance education,” Knowledge-Based Systems, vol. 23, no. 6, pp. 529–535, 2010. View at Publisher · View at Google Scholar · View at Scopus
  29. A. Takemura, A. Shimizu, and K. Hamamoto, “Discrimination of breast tumors in ultrasonic images using an ensemble classifier based on the adaboost algorithm with feature selection,” IEEE Transactions on Medical Imaging, vol. 29, no. 3, pp. 598–609, 2010. View at Publisher · View at Google Scholar · View at Scopus
  30. B. C. Ko, J. W. Gim, and J. Y. Nam, “Cell image classification based on ensemble features and random forest,” Electronics Letters, vol. 47, no. 11, pp. 638–639, 2011. View at Publisher · View at Google Scholar · View at Scopus
  31. M. M. Fraz, P. Remagnino, A. Hoppe et al., “An ensemble classification-based approach applied to retinal blood vessel segmentation,” IEEE Transactions on Biomedical Engineering, vol. 59, no. 9, pp. 2538–2548, 2012. View at Publisher · View at Google Scholar · View at Scopus
  32. S. Mohapatra, D. Patra, and S. Satpathy, “An ensemble classifier system for early diagnosis of acute lymphoblastic leukemia in blood microscopic images,” Neural Computing and Applications, vol. 24, no. 7-8, pp. 1887–1904, 2014. View at Publisher · View at Google Scholar · View at Scopus
  33. Z. Borbora, J. Srivastava, K.-W. Hsu, and D. Williams, “Churn prediction in MMORPGs using player motivation theories and ensemble approach,” in Proceedings of the IEEE 3rd International Conference on Privacy, Security, Risk and Trust and IEEE 3rd International Conference on Social Computing, pp. 157–164, Boston, Mass, USA, October 2011. View at Publisher · View at Google Scholar
  34. R. Xia, C. Zong, and S. Li, “Ensemble of feature sets and classification algorithms for sentiment classification,” Information Sciences, vol. 181, no. 6, pp. 1138–1152, 2011. View at Publisher · View at Google Scholar · View at Scopus
  35. E. Fersini, E. Messina, and F. A. Pozzi, “Sentiment analysis: bayesian ensemble learning,” Decision Support Systems, vol. 68, pp. 26–38, 2014. View at Publisher · View at Google Scholar · View at Scopus
  36. G. Wang, J. Sun, J. Ma, K. Xu, and J. Gu, “Sentiment classification: the contribution of ensemble learning,” Decision Support Systems, vol. 57, pp. 77–93, 2014. View at Publisher · View at Google Scholar
  37. M. Hagen, M. Potthast, M. Büchner, and B. Stein, “Twitter sentiment detection via ensemble classification using averaged confidence scores,” in Proceedings of the 37th European Conference on IR Research, pp. 741–754, Vienna, Austria, March 2015. View at Publisher · View at Google Scholar
  38. J. Kodovský, J. Fridrich, and V. Holub, “Ensemble classifiers for steganalysis of digital media,” IEEE Transactions on Information Forensics and Security, vol. 7, no. 2, pp. 432–444, 2012. View at Publisher · View at Google Scholar · View at Scopus
  39. K. P. Singh, S. Gupta, and P. Rai, “Identifying pollution sources and predicting urban air quality using ensemble learning methods,” Atmospheric Environment, vol. 80, pp. 426–437, 2013. View at Publisher · View at Google Scholar · View at Scopus
  40. M. Govindarajan, “Hybrid intrusion detection using ensemble of classification methods,” International Journal of Computer Network and Information Security, vol. 6, no. 2, pp. 45–53, 2014. View at Publisher · View at Google Scholar
  41. H. T. X. Doan and G. M. Foody, “Increasing soft classification accuracy through the use of an ensemble of classifiers,” International Journal of Remote Sensing, vol. 28, no. 20, pp. 4609–4623, 2007. View at Publisher · View at Google Scholar · View at Scopus
  42. L. I. Kuncheva, M. Skurichina, and R. P. W. Duin, “An experimental study on diversity for bagging and boosting with linear classifiers,” Information Fusion, vol. 3, no. 4, pp. 245–258, 2002. View at Publisher · View at Google Scholar · View at Scopus
  43. W. B. Langdon, S. J. Barrett, and B. F. Buxton, “Combining decision trees and neural networks for drug discovery,” in Proceedings of the 5th European Conference on Genetic Programming, pp. 60–70, Kinsale, Ireland, April 2002.
  44. Z. Lu, X. Wu, and J. Bongard, “Adaptive informative sampling for active learning,” in Proceedings of the 10th SIAM International Conference on Data Mining (SDM '10), pp. 894–905, May 2010. View at Scopus
  45. K.-W. Hsu, “Hybrid ensembles of decision trees and artificial neural networks,” in Proceedings of the 1st IEEE International Conference on Computational Intelligence and Cybernetics (CyberneticsCom '12), pp. 25–29, Bali, Indonesia, July 2012. View at Publisher · View at Google Scholar · View at Scopus
  46. P. Domingos, “A unified bias-variance decomposition for zero-one and squared loss,” in Proceedings of the 17th National Conference on Artificial Intelligence and 12th Conference on Innovative Applications of Artificial Intelligence, pp. 564–569, 2000.
  47. C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995. View at Publisher · View at Google Scholar · View at Scopus
  48. X. Wu, V. Kumar, J. R. Quinlan et al., “Top 10 algorithms in data mining,” Knowledge and Information Systems, vol. 14, no. 1, pp. 1–37, 2008. View at Publisher · View at Google Scholar · View at Scopus
  49. R. M. Salgado, J. J. F. Pereira, T. Ohishi, R. Ballini, C. A. M. Lima, and F. J. Von Zuben, “A hybrid ensemble model applied to the short-term load forecasting problem,” in Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN '06), pp. 2627–2634, IEEE, Vancouver, Canada, July 2006. View at Publisher · View at Google Scholar · View at Scopus
  50. J.-K. Min and S.-B. Cho, “Activity recognition based on wearable sensors using selection/fusion hybrid ensemble,” in Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC '11), pp. 1319–1324, October 2011. View at Publisher · View at Google Scholar · View at Scopus
  51. A. Verikas, Z. Kalsyte, M. Bacauskiene, and A. Gelzinis, “Hybrid and ensemble-based soft computing techniques in bankruptcy prediction: a survey,” Soft Computing, vol. 14, no. 9, pp. 995–1010, 2010. View at Publisher · View at Google Scholar · View at Scopus
  52. B. Verma and S. Z. Hassan, “Hybrid ensemble approach for classification,” Applied Intelligence, vol. 34, no. 2, pp. 258–278, 2011. View at Publisher · View at Google Scholar
  53. L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996. View at Google Scholar · View at Zentralblatt MATH · View at Scopus
  54. J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers Inc, San Francisco, Calif, USA, 1993.
  55. D. Opitz and R. Maclin, “Popular ensemble methods: an empirical study,” Journal of Artificial Intelligence Research, vol. 11, pp. 169–198, 1999. View at Publisher · View at Google Scholar
  56. L. I. Kuncheva, “Using diversity measures for generating error-correcting output codes in classifier ensembles,” Pattern Recognition Letters, vol. 26, no. 1, pp. 83–90, 2005. View at Publisher · View at Google Scholar · View at Scopus
  57. K.-W. Hsu and J. Srivastava, “Diversity in combinations of heterogeneous classifiers,” in Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, pp. 923–932, Bangkok, Thailand, April 2009. View at Publisher · View at Google Scholar
  58. K.-W. Hsu and J. Srivastava, “Relationship between diversity and correlation in multi-classifier systems,” in Proceedings of the 14th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, Part II, pp. 500–506, Hyderabad, India, June 2010. View at Publisher · View at Google Scholar
  59. K.-W. Hsu and J. Srivastava, “Improving bagging performance through multi-algorithm ensembles,” Frontiers in Computer Science, vol. 6, no. 5, pp. 498–512, 2012. View at Google Scholar · View at Zentralblatt MATH · View at MathSciNet
  60. M. Lichman, UCI Machine Learning Repository, School of Information and Computer Science, University of California, Irvine, Calif, USA, 2013, http://archive.ics.uci.edu/ml.
  61. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software: an update,” ACM SIGKDD Explorations Newsletter, vol. 11, no. 1, pp. 10–18, 2009. View at Publisher · View at Google Scholar
  62. J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” The Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006. View at Google Scholar · View at MathSciNet