Abstract

Inspired by the group decision making process, ensembles or combinations of classifiers have been found favorable in a wide variety of application domains. Some researchers propose to use the mixture of two different types of classification algorithms to create a hybrid ensemble. Why does such an ensemble work? The question remains. Following the concept of diversity, which is one of the fundamental elements of the success of ensembles, we conduct a theoretical analysis of why hybrid ensembles work, connecting using different algorithms to accuracy gain. We also conduct experiments on classification performance of hybrid ensembles of classifiers created by decision tree and naïve Bayes classification algorithms, each of which is a top data mining algorithm and often used to create non-hybrid ensembles. Therefore, through this paper, we provide a complement to the theoretical foundation of creating and using hybrid ensembles.

1. Introduction

Ensemble learning is inspired by the human group decision making process and has gained much attention [17]. It is to create an ensemble, which is a group of classifiers and combines classifications made by these classifiers to make an overall classification. The advantage of ensembles is not that the best combination of classifiers outperforms the best classifier but that a combination of classifiers is less probable to misclassify unseen data samples than a single classifier. Ensembles have shown their satisfactory classification performance in a large scale comparative study [8].

Ensemble learning has been applied in various application domains, such as image classification [914], fingerprint classification [15], weather forecasting [16], text categorization [17], image segmentation [18], visual tracking [19], protein fold pattern recognition [20], cancer classification [21], pedestrian recognition [22] or detection [23], prediction of software quality [24, 25], face recognition [26], email filtering [27], prediction of students’ performance [28], medical image analysis [2932], churn prediction [33], sentiment analysis [3437], steganalysis [38], prediction of air quality [39], and intrusion detection [40].

From one point of view, ensemble learning becomes popular because every classification algorithm has its own limitations. From another point of view, if every classifier in an ensemble has expertise in classifying data samples that belong to some portion of the given data set, the overall classification combined from all (or some) classifiers in the ensemble will potentially be more reliable.

Compared to classifiers working individually, classifiers working together will have a better potential for gaining better accuracy [41]. Diverse classifiers working together will have a better potential for gaining better accuracy compared to non-diverse classifiers working together, as suggested in [42]. In creating an ensemble, every classifier that will be part of the ensemble is expected to be adequately accurate, while the correlation between classifications made by two classifiers that will be part of the ensemble is expected to be small.

Some researchers use a mixture of different types of classification algorithms to create hybrid ensembles. For example, with reference to ensembles composed of decision trees and artificial neural networks, Langdon et al. use such an ensemble in drug discovery [43]; Lu et al. discuss sampling methods along with these ensembles for active learning [44]; in [45], the author analyzes the hybrid ensembles for binary classification from the standpoint of bias-variance decomposition proposed in [46]. Furthermore, Salgado et al. use ensembles of artificial neural networks and support vector machines [47, 48] to predict daily electricity load [49]. Min and Cho use naïve Bayes classifiers and support vector machines for activity recognition [50]. Verikas et al. provide a survey of hybrid ensembles designed for bankruptcy prediction [51]. In addition, Verma and Hassan use integrations of clustering and classification algorithms to create hybrid ensembles [52].

Most researchers simply use hybrid ensembles without further investigation. Therefore, we plan to have a better understanding of hybrid ensembles. Our goal is not only to show that the classification performance of hybrid ensembles can be comparable or even superior to that of non-hybrid ensembles, but also to provide an explanation of why hybrid ensembles work from the standpoint of diversity. Diversity among classifiers in an ensemble plays a significant role in the success of the ensemble. One type of ensemble learning algorithms uses sampling methods to generate different data sets used for training diverse classifiers, such as Bagging (Bootstrap Aggregating) [53], while another type is rather ad hoc and uses different classification algorithms to train diverse classifiers that will be used to create an ensemble.

We create hybrid ensembles by using an integration of these two types of ensemble learning algorithms. The classification algorithms that we use to create hybrid ensembles are C4.5 [48, 54] decision tree and naïve Bayes [48]. We provide an empirical comparison of these hybrid ensembles and others created by using Bagging. This paper is particularly essential because quite few papers empirically evaluate hybrid ensembles and at the same time theoretically analyze them.

The rest of this paper is structured as follows: We provide background information and present our analysis in Section 2. Next, we report and discuss experiment results in Section 3. Finally, we give conclusion in Section 4.

2. Materials and Methods

The goal of using several classifiers in an ensemble is to achieve better classification performance by combining classifications from these classifiers, each of which serves as an optimal model or an expert in a portion of the data set. These classifiers are expected to be uncorrelated and behave independently of each other; or, at least, they need to show different patterns of errors.

Diversity among classifiers in an ensemble is related to the success of the ensemble, because it can compensate for errors made by those classifiers individually [55]. In this section, we analyze diversity and classification performance, for example, accuracy, of ensembles created by using a single algorithm (i.e., non-hybrid ensembles) and hybrid ensembles created by using two different algorithms. The relationship between diversity and accuracy is not “straightforward” [56]. On one hand, combining classifications from classifiers of low diversity would not improve the overall accuracy because these classifiers behave similarly for some portions of the data set; on the other hand, using highly diverse classifiers to create an ensemble would not guarantee absolutely high accuracy. Hsu and Srivastava show that using different classification algorithms in an ensemble would likely increase diversity and decrease correlation between classifiers in the ensemble [57], and they build the connection between diversity and correlation, which can be indirectly connected to accuracy [58]. Referring to the analysis technique used in [59], we build a more direct connection between diversity and accuracy, and further we analyze the influence of using different algorithms to create an ensemble on accuracy; the analysis distinguishes this paper from the earlier papers.

In what follows, is a -dimensional vector to represent a data sample, and is a binary class label or . is a classifier and , where is a classification made by a classifier. is a classification algorithm. is a set of data samples. means that is trained by applying on and is used to classify , and it returns a binary class label.

What is described in (1) is a general form of an ensemble where classifications from classifiers are combined through majority voting to make the overall classification. It can be modified such that it uses weighted majority voting. In (1), is an ensemble of   classifiers, is a classifier in the ensemble, is a data sample that needs to be classified, and is the overall classification. For those ensemble learning algorithms only using different data sets to train diverse classifiers, and for all , where , and Bagging is an example. For those using different classification algorithms and different data sets to train diverse classifiers, for some (or, in few cases, all) and for all , where , and this is of our interest in this paper

Given a data set and a set of classification algorithms , we create a hybrid ensemble of classifiers, each of which is trained by applying an algorithm selected in an alternating fashion from on a set of data samples drawn from with bootstrap sampling. This process is shown in Algorithm 1. Instead of selecting algorithms in an alternating fashion, we can simply do random selection such that a classifier in the hybrid ensemble could be trained by using one of the algorithms in with an equal probability. Extending from this, we can incorporate prior knowledge and assign unequal probabilities to different algorithms.

Input: A data set , a set of clssification algorithms , the number of classifiers
Output: An ensemble
Steps:
For to
Use bootstrap sampling to sample and generate , which is of the same size of D
Select the ()th element in as
Train by applying on
End For
Return

The input of the bootstrap sampling method is a data set , and the output is a set of data samples drawn with replacement from , . We use bootstrap sampling when we need different data sets to train diverse classifiers in creating an ensemble. Diversity among the classifiers in such an ensemble totally comes from differences among the data sets used to train the classifiers, and therefore bootstrap sampling is the single generator of diversity in such an ensemble. We train diverse classifiers by additionally using different classification algorithms, and by doing this we have an additional generator of diversity.

Afterward, we present our analysis of why hybrid ensembles work based on the analysis technique used in [59]. We start from the following definitions, where and is the classification to a testing data sample given by a classifier trained by applying on .

Definition 1. is a given data set and is a set of data sets generated with bootstrap sampling from . The th element in is of the same size of ; that is, . and are used as training data sets, where . With respect to used as a testing data set, where and , -Diversity of a classification algorithm is defined as an expectation of disagreement between classifiers trained by applying on and , as given inDefinition 1 describes how unstable a classification algorithm could possibly be on a given data set. Definition 2 describes the degree of being unstable (or the instability) for a classification algorithm.

Definition 2. Based on Definition 1, a classification algorithm is -unstable with respect to -Diversity, if the following holds:Here, disagreement between classifiers is from using different training data sets, and it is a type of diversity. In (3), β is the lower bound of the probability that we can observe disagreement at least α; the larger the α, the larger the diversity. When α is treated as a constant, a larger value of β means a more unstable classification algorithm.
We use the process given below to estimate the instability of an algorithm on a data set. The idea is to use two sampled data sets to train two classifiers and then measure the difference in classifications made by the two classifiers on another sampled data set. Because these data sets are from the same given data set, difference in classifications mainly comes from the algorithm used to train the classifiers. If the algorithm is more unstable and sensitive to the changes in the data sets used for training, the trained classifiers would make more different classifications on a data set, part of which has been seen by both classifiers. The larger the value returned by the process, the more unstable the algorithm.

Definition 3. is a given data set and is a set of data sets generated with bootstrap sampling from . The th element in is of the same size of ; that is, . is used as a training data set. With respect to used as a testing data set, where , -Diversity for two classification algorithm and is defined as an expectation of disagreement between classifiers trained by applying and on , as given inDefinition 3 describes how different two classification algorithms could possibly be on a given data set. Definition 4 describes the degree of being different (or the differentiability) for two classification algorithms.

Definition 4. Based on Definition 3, two classification algorithms and are -differentiable with respect to -Diversity, if the following holds:Here, disagreement between classifiers is from using different classification algorithms, and it is a type of diversity, too. In (5), γ is the lower bound of the probability that we can observe disagreement at least δ; the larger the δ, the larger the diversity. Similarly, when δ is treated as a constant, a larger value of γ means that two classification algorithms are more different.
We use the process given below to estimate the differentiability of a pair of algorithms on a data set. The idea is to use two algorithms and a sampled data set to train two classifiers and then measure the difference in classifications made by the two classifiers on another sampled data set. If the two algorithms are more differentiable, the classifiers trained with them would make more different classifications on a data set, part of which has been seen by both classifiers. The larger the value returned by the process, the more differentiable the two algorithms.

We show a connection between diversity and accuracy gain for using a hybrid ensemble through the proposition given below.

Proposition 1. If two classification algorithms and are -differentiable and is -unstable, then (6) holds, where is the difference between the expected accuracy of a hybrid ensemble with and and the expected accuracy of a non-hybrid ensemble with only , and is the accuracy of the classifier training by using

Regarding (6), is the classification algorithm used to train most classifiers in both ensembles, and is the classification algorithm used to train a classifier in a hybrid ensemble. When and are treated as constants, the lower bound of the probability that we can observe accuracy gain depends on how unstable is and how different and are. If is larger, meaning that is more unstable, and is larger, meaning that and are more different, then the lower bound of the probability is larger; if is smaller, meaning that is more stable, and is smaller, meaning that and are more similar, then the lower bound of the probability is smaller, while this does not necessarily mean that the probability of observing accuracy gain is smaller. As a result, if the algorithm used to train most classifiers in a hybrid ensemble is unstable and it is different from the other algorithm, then it is more probable that accuracy gain would be observed. Furthermore, the lower bound of the accuracy gain depends on how unstable is, how different and are, and how accurate the classifiers trained with and are. If the classifier trained with is more accurate, is larger and the lower bound of accuracy gain is larger. As a result, it is beneficial to replace a classifier (or some classifiers) in a non-hybrid ensemble with a classifier (or some classifiers) trained with a different yet accurate classification algorithm. Nevertheless, it may not be beneficial to replace all, because doing so would lower diversity among classifiers.

Proof. , , and are three data sets generated with bootstrap sampling from a given data set. Classifiers , , and are trained by applying a classification algorithm on , , and , respectively. That is, , , and . The ensemble composed of , , and is a non-hybrid ensemble, because it is with only a classification algorithm. Classifier is trained by applying another classification algorithm on . That is, . The ensemble composed of , , and is a hybrid ensemble, because it is with two different classification algorithms.
Initially, we represent the expected accuracy of the non-hybrid ensemble in (7), which is based on majority voting. In what follows, is the class label of a data sampleSimilarly, we represent the expected accuracy of the hybrid ensemble inThe difference between the expected accuracy of the hybrid ensemble and that of the non-hybrid ensemble or the difference between (8) and (7) is denoted by . It is given inNext, (9) is rewritten as (10) by rearranging its components is equal to , and further the component is equal to . Furthermore, is equal to , and the relationship implies . is equal to , and the relationship implies , too. Next, (11) is obtained:In (11), the first component is related to -Diversity and the second component is related to accuracy. As the two components are independent, (11) is rewritten asNext, by referring to Definition 4, the second component in (12) is associated with , which is equal toBecause and are -differentiable, (14) is larger than or equal to with a probability at least . Combining this and being -unstable, (14) is obtained: Finally, (15) is obtained through (14) and the proof is complete:

We treat and as constants, as in the processes shown in Algorithms 2 and 3. Given an ensemble of classifiers trained by using . We replace some classifiers with those trained by using , which generally provides a higher value of accuracy (so that Acc2 is larger than Acc1), and the lower bound of accuracy gain will be positive, meaning that we would probably obtain a hybrid ensemble that could achieve better classification performance. If is more unstable, will be larger. If is more different from , will be larger. The larger and the larger , the larger the lower bound of the probability that we would observe better accuracy.

Input: A data set , a clssification algorithm , as in (3), the number of trials
Output: A possible value of as in (3)
Steps:
For to
Sample and generate
Train by applying on
Sample and generate
Train by applying on
Sample and generate
Use to test and
If the disagreement rate ≥ then
End If
End For
Return
Input: A data set , two clssification algorithms and , as in (5), the number of trials
Output: A possible value of as in (5)
Steps:
For to
Use bootstrap sampling to sample and generate
Train by applying on
Train by applying on
Sample and generate
Use to test and
If the disagreement rate ≥ then
End If
End For
Return

3. Results and Discussion

3.1. Data Sets

In experiments, we use 20 data sets from UCI Machine Learning Repository [60], supported by School of Information and Computer Science, University of California, Irvine, and from Department of Statistics, Carnegie Mellon University. Using public data sets is to allow the reproduction of the experiments.

The data sets used in experiments are from various application domains. Their characteristics are summarized in Table 1, where the first column is the serial number, the second column is the name of a data set, the third column is the number of data samples, the fourth column is the number of data samples that belong to the minority class, the fifth column is the number of nominal attributes, the sixth column is the number of numeric attributes, and the numbers in parentheses present the numbers of attributes with missing values. The proportion of minority samples affects the classification performance of a classifier or an ensemble, and so does the proportion of attributes with missing values. For a data set that is usually used in regression analysis, we apply discretization on its target attribute and divide continuous values of the attribute properly into two intervals each of which corresponds to a class label; such a data set is with the suffix “binary.”

3.2. Settings

We consider C4.5 decision tree and naïve Bayes classification algorithms. The former is denoted by DT, and the latter is denoted by NB. We consider single classifiers trained by using DT or NB, non-hybrid ensembles trained with DT or NB, and hybrid ensembles trained with DT and NB.

We implement the process given in Algorithm 1 by using WEKA [61], and we compare the classification performances given by ensembles created by it to that given by Bagging. The ensembles created by using Bagging with DT or NB are non-hybrid ensembles with only DT or NB. We set the number of classifiers in an ensemble to 10 for all ensemble learning algorithms. We use 20 × 10-fold cross-validation for classification performance evaluation. That is, for each combination of an ensemble learning algorithm and a data set, we run 10-fold cross-validation 20 times randomly and independently. In addition, we investigate instability and differentiability.

3.3. Instability

We use the process described earlier to estimate the instability of an algorithm on a data set. We set α to 0.05 (by statistical convention) and the number of trials to 100. We report the results in Table 2. The content in a cell indicates the value of or the probability that the disagreement rate is larger than or equal to or 0.05. The disagreements are given by two classifiers trained by applying the algorithm corresponding to the column on two data sets sampled from the data set corresponding to the row. The probability is calculated with reference to 100 trials. The larger the value of , the more unstable the algorithm on the data set. DT is more unstable than NB on 13 out of these 20 data sets. This is consistent with the general thought that decision tree algorithm is unstable (or it is a classification algorithm with high variance) and therefore suitable for being used to create ensembles. The difference of values of instability is larger than or equal to 0.5 on 4 data sets: heart-c-binary, heart-h-binary, hprice-binary, and sonar. The results clearly show that instability results from not only the nature of a data set but also the nature of a classification algorithm.

3.4. Differentiability

We use the process described earlier to estimate the differentiability of two algorithms on a data set. We set to 0.05 (by statistical convention) and the number of trials to 100. We report the results in Table 3. The content in a cell indicates the value of or the probability that the disagreement rate is larger than or equal to or 0.05. The disagreements are given by two classifiers trained by applying DT and NB separately on a data set sampled from the data set corresponding to the row. The probability is calculated with reference to 100 trials. The larger the value of , the more different the two algorithms on the data set. According to Table 3, DT and NB are different or behave differently on 19 out of these 20 data sets; they are not sufficiently different on the data set breast-w. Decision tree and naïve Bayes algorithms are fundamentally different: For example, the former makes no assumptions on the data set, while the latter assumes statistical independence between attributes; the former discretizes numeric attributes, while the latter can apply density estimation on numeric attributes; the former uses special treatment for missing values, while the latter handles missing values naturally.

3.5. Performance

We report the values of accuracy in Table 4. The results have shown support for the idea that we could possibly obtain better classification performance by using different classification algorithms to train classifiers in an ensemble. In the table, HE is for the hybrid ensemble, and the content of a cell indicates the mean and standard deviation calculated over 20 runs of 10-fold cross-validation for applying the algorithm corresponding to the column on the data set corresponding to the row. Generally speaking, compared to a single classifier, an ensemble would show stabilized classification performance, especially when the underlying classification algorithm is an unstable one. We can see this from the fact that the standard deviations given by ensembles are lower than those given by single classifiers in most cases. For example, on the first data set, biomed, the standard deviation given by DT is 0.014, that given by Bagging DT is 0.009, and that given by the hybrid ensemble DT + NB is 0.007.

In some cases, such as those where data distributions are skewed, accuracy is not a good measure for classification performance evaluation, while F1-measure is a more comprehensive measure. In Table 5, we report the values of F1-measure, which is the harmonic mean of precision and recall, for data samples that belong to the minority class. A higher value of F1-measure means better classification performance. F1-measure for minority is used to evaluate how well a classification algorithm performs on data samples that belong to the minority class, which are usually the targets in most real-world machine learning applications. A classifier or an ensemble can achieve a low error rate simply by classifying all samples to the majority class or simply by ignoring minority samples on highly unbalanced data sets. In the table, similarly, HE is for the hybrid ensemble, and the content of a cell indicates the mean and standard deviation.

To assess the significance of differences in classification performance achieved by two algorithms, we perform the Wilcoxon test, a non-parametric test, whose use is recommended by Demšar [62]. We report results for the test for accuracy and F1-measure in Tables 6 and 7, respectively. In the tables, B-DT means Bagging DT, B-NB means Bagging NB, and similarly HE is for the hybrid ensemble. In the tables, the content in a cell indicates the number of data sets on which the algorithm in the corresponding row significantly wins (outperforms) and loses to (is outperformed by) the algorithm in the corresponding column. For example, in Table 6, the first cell shows that decision tree algorithm wins in accuracy naïve Bayes algorithm on 10 data sets and it loses on 9 data sets, while the difference in accuracy is not significant on 1 data set; in Table 7, the first cell shows that decision tree algorithm wins in F1-measure naïve Bayes algorithm in 18 data sets and it does not lose on a data set, while the difference in F1-measure is not significant on 2 data sets.

We can see from Table 6 that, in terms of accuracy, the number of data sets on which HE DT + NB outperforms B-DT is slightly smaller than that on which HE DT + NB is outperformed by B-DT; the number of data sets on which HE DT + NB outperforms B-NB is much larger than that on which HE DT + NB is outperformed by B-NB. We can see from Table 7 that, in terms of F1-measure, the results are similar but HE DT + NB outperforms less. The results seem to be in opposition to creating and using hybrid ensembles. On the contrary, the results show that hybrid ensembles outperform single classifiers not deterministically but probabilistically, which is also shown by our analysis. Moreover, in this paper, we intend to explain why and when hybrid ensembles are better than non-hybrid ensembles.

We propose using different classification algorithms to train more diverse classifiers in order to create better ensembles. The basic idea is to use a combination of classifiers to naturally reduce variance and use a stronger algorithm to explicitly increase classification performance. The ensemble creation process that we propose is distinguishing, because it uses fundamentally different classification algorithms to create a hybrid ensemble. For example, it uses DT, which is often with high variance (related to high instability), and NB, which is often with low variance (related to low instability or high stability). Using such a combination of classification algorithms goes against the generally accepted sense that one should only use classification algorithms with high variance in an ensemble like Bagging. We evaluate the proposed process by using a varied collection of public data sets and two metrics. Experiment results reveal that the proposed process could achieve better performance when compared to Bagging.

3.6. Discussion

There are 8 data sets on which the hybrid ensemble DT + NB outperforms both Bagging DT and Bagging NB: biomed, breast-w, credit-g, diabetes, heart-c-binary, heart-statlog, hepatitis, and schizo. According to Table 4, naïve Bayes algorithm performs better in accuracy than decision tree algorithm does on these data sets. Therefore, we consider that is DT and is NB, which is newly introduced into the ensemble and used as another generator of diversity. For these data sets, we summarize lower bounds for accuracy gain and the probability that the gain is obtained in Table 8. In the table, the second column (Acc1) and the third column (Acc2) are from the second column and the third column in Table 4, respectively; the fourth column is from the second column in Table 2; the fifth column is from the second column in Table 3; the sixth and seventh columns are lower bounds for accuracy gain and the probability that the gain is obtained, respectively, and both are calculated by (6), or (15), given that is 0.05 and is 0.05; the eighth and ninth columns are from the fourth and sixth columns in Table 4, respectively; the tenth column is the accuracy difference between HE DT + NB and B-DT. According to Table 8, on all these data sets, the accuracy difference, or the actual gain given by the hybrid ensemble DT + NB against Bagging NB, is larger than the lower bound; all the probabilities are high, except the one for the data set number 3, breast-w, and this means that the hybrid ensemble DT + NB would highly probably outperform Bagging DT (and this indeed the case). Nevertheless, the lower bound is too loose (but it is still the only one presented so far), and finding a tighter bound would be part of the future work.

4. Conclusion

Ensemble learning is to train classifiers and then combine their classifications to make an overall classification. Many researchers use ensembles of classifiers created by using a single classification algorithm in various applications. These are non-hybrid ensembles, and why they work is becoming clearer. Some researchers propose the use of the mixture of two different types of classification algorithms in the creation of a hybrid ensemble. In this paper, we investigate why hybrid ensembles work, which is somewhat unclear. We present our theoretical analysis from the standpoint of diversity, which plays a significant role in ensemble learning and is one of the fundamental elements of the success of ensembles. This is the most distinguishing characteristic of this paper. We also report and discuss experiment results obtained from hybrid ensembles of classifiers created by decision tree and naïve Bayes classification algorithms, each of which is a top data mining algorithm and often used to create non-hybrid ensembles. These are two fundamentally different classification algorithms, and therefore it is interesting to see that hybrid ensembles created with them together can achieve similar or even better classification performance compared to non-hybrid ensembles created with them individually. In short, we contribute to a complement to the theoretical foundation of creating and using hybrid ensembles. The hybrid ensemble performs better not in a deterministic but a probabilistic manner. In a hybrid ensemble where two classification algorithms are used, if two are different and one is unstable while the other is more accurate, then there is a higher probability that we can have a higher value of accuracy gain; the gain is measured against an ensemble where only one of the two classification algorithms is used.

Competing Interests

The author declares that there is no conflict of interests regarding the publication of this paper.