Abstract

In recent years, ensemble classification methods have been widely investigated in both industry and literature in the field of machine learning and artificial intelligence. The main advantage of this approach is to benefit from a set of classifiers instead of using a single classifier with the aim of improving the prediction performance, such as accuracy. Selecting the base classifiers and the method for combining them are the most challenging issues in the ensemble classifiers. In this paper, we propose a heterogeneous dynamic ensemble classifier (HDEC) which uses multiple classification algorithms. The main advantage of using heterogeneous algorithms is increasing the diversity among the base classifiers as it is a key point for an ensemble system to be successful. In this method, we first train many classifiers with the original data. Then, they are separated based on their strength in recognizing either positive or negative instances. For doing this, we consider the true positive rate and true negative rate, respectively. In the next step, the classifiers are categorized into two groups according to their efficiency in the mentioned measures. Finally, the outputs of the two groups are compared with each other to generate the final prediction. For evaluating the proposed approach, it has been applied to 12 datasets from the UCI and LIBSVM repositories and calculated two popular prediction performance metrics, including accuracy and geometric mean. The experimental results show the superiority of the proposed approach in comparison to other state-of-the-art methods.

1. Introduction

Classification is a type of supervised learning which is aimed at predicting the class of given data samples. There are many classification algorithms in the literature including decision trees, support vector machines [1], neural networks [2, 3], Bayesian networks [4], and fuzzy classifiers [512]. However, according to the “No Free Launch” theorem, there is no universally superior classification algorithm that can outperform other algorithms for all datasets [13]. A good solution to deal with this problem is the development and use of ensemble classification algorithms [14]. In ensemble learning, instead of using a single classification algorithm, a set of algorithms are considered and the final output is generated by combining the output of each classifier. Indeed, the main goal of an ensemble classifier is to benefit from the advantages of multiple classifiers and combine their outputs such that the quality of the final output improves [15]. The individual classifiers contained in an ensemble system are called base classifiers.

The main challenges with ensemble classifiers are (1) to select the base classifiers and (2) to combine the output of the base classifiers [16]. An essential key for designing a successful ensemble is to ensure that the base classifiers are sufficiently diverse [1719]. Two classifiers are diverse if their outputs do not correlate with each other [20]. There are different strategies for the construction of an ensemble system. In homogeneous ensembles, the base classifiers are from the same family and the diversity among the base classifiers is achieved by training them with different samples of the training dataset [21].

Conversely, in heterogeneous ensembles, different classification algorithms are considered for the base classifiers; therefore, diversity is achieved by using different algorithms [22]. Bagging [23] and Boosting [24] are the most commonly used homogeneous classifiers in the literature. In Bagging, each base classifier is trained with a bootstrapped replica of the training dataset and the final decision is generated by applying a majority voting on the decision of each base classifier. It is worth noting that other combination methods can also be found in the literature such as weighted voting and plurality voting [25]. On the other hand, the Boosting constructs the base classifiers in an iterative fashion, each compensating the weakness of its predecessors [26]. The constructed base classifiers are finally integrated using the weighted voting approach.

Focusing on increasing diversity, several papers have proposed heterogeneous ensemble classifiers [16, 2729]. Stack Generalization (briefly, Stacking) is one the most successful heterogeneous ensembles proposed in 1992 [30]. As shown in Figure 1, it uses a two-layer schema for training the model. In the first layer, the base classifiers are trained with different classification algorithms on the original dataset. For example, in the first layer, one classifier is trained with a Naïve Bayes algorithm while the other classifier uses a decision tree algorithm for training on the dataset. After training the base classifiers in the first layer, the output of them for the same original dataset is used as a new dataset to train the metaclassifier in the second layer. It means that the output of each classifier in the first level is considered as the inputs of the metaclassifier in the second level. It is worth noting that selecting the type of the base classifiers in the first level is an important issue in the Stacking, because it could affect the prediction performance of the Stacking. As mentioned before, an ideal subset of the base classifiers are those that are diverse and accurate.

The prediction performance of the Stacking is tightly dependent on the accuracy and diversity of the base classifiers in the first layer [22, 28, 31]. In this paper, we propose a novel approach for the smart selection of the base classifiers in order to improve the prediction performance of the final model. We categorized the base classifiers into two groups such that the classifiers within the first group can efficiently recognize the positive samples while the second group contains the classifiers suitable for the negative samples.

In this paper, we propose a heterogeneous dynamic ensemble classifier (HDEC) which uses multiple classification algorithms and is aimed at selecting the best classifiers for positive and negative instances. We categorize the classifiers using the true positive rate and true negative rate values to specify which classifiers are more accurate for recognizing the positive and negative instances. Then, for an unseen instance, the weight of each category is compared to generate the final output. Our major contributions are the following:(1)We design a novel ensemble classifier which uses heterogeneous classification algorithms in order to increase the diversity between the base classifiers and benefits from the advantages of different algorithms.(2)HDEC dynamically selects the base classifiers that should participate in the final decision-making process. In other words, depending on the test instances, it is possible that different classifiers are selected to generate the final output.(3)HDEC considers both TPR and TNR for detecting the suitable classifiers in recognizing the positive and negative instances, respectively. Indeed, unlike the existing methods, it does not rely on just the accuracy of classifiers and with using the TPR and TNR concepts it generates a more powerful ensemble classifier.

In order to evaluate the proposed approach, we have applied it on 12 datasets from the UCI and LIBSVM repositories. We use two popular performance measures including accuracy and geometric mean. The obtained results show the superiority of the HDEC in comparison to individual classifiers and state-of-the-art ensemble approaches.

The rest of this paper is organized as follows. Section 2 contains a brief review of previous related works. In Section 3, we present our approach, while Section 4 demonstrates the experimental results and shows the comparisons with different approaches. Finally, in Section 5, we conclude this study along with some future research directions.

Several papers in the literature have investigated the usage of the multiple classification algorithms as the base learners of the ensemble classifiers [16, 27, 28, 3235]. There are two general approaches for selecting the base classifiers of an ensemble classifier: static approaches and dynamic approaches. In static approaches, the base classifiers that should participate in the final decision-making process are selected during the training the ensemble classifier. Mendialdua et al. [36] proposed an approach to improve the accuracy of the Stack Generalization. The authors argued that although using multiple classification algorithms increases the diversity among the base classifiers, not all of them are suitable to be combined for generating the final output. As a result, they appended an extra step for the Stack Generalization to prune similar classifiers by applying the Estimation of the Bayesian Network Algorithm (EBNA) on the base classifiers. Coelho and Nascimento [27] used five different learning algorithms for the base classifiers of the bagging approach. They mapped the combination of the base classifiers to the chromosomes and by applying the genetic algorithm they find the best combination of the base classifiers that should be aggregated for generating the final output. Kadkhodaei and Eftekhari [37] designed a genetic algorithm to find the most diverse set of the base classifiers to improve the accuracy of the Stacking. A heterogeneous boosting-based classifier is proposed in [38]. The authors have considered a set of distinct learning algorithms and generated heterogeneous classifiers at each iteration of the boosting. In order to increase the diversity among the base classifiers, a pruning step eliminates similar base classifiers. Finally, the remaining classifiers are combined by applying the weighted voting method.

In contrast, in dynamic ensemble classifiers (DES), the decision of which classifiers should be combined for generating the final output is postponed until generalization phase [3945]. In other words, there will be not a fixed subset of classifiers which applies to any test instance. A dynamic ensemble classifier for credit scoring has been proposed in [46]. In this method, initially, the classifiers are selected based on their accuracy and relative costs of Type I and Type II errors for a validation set. Then, the selected classifiers are combined based on the classification results by using the soft probability. The authors have applied their approach to the four credit datasets from the UCI repository and reported the superiority of their algorithm in terms of the accuracy, area under ROC (receiver operating characteristic) curve (AUC), H-measure, and Partial Gini index (PG) in comparison to other state-of-the-art approaches. The authors of [47] argued that most misclassifications occur in areas near classes boundaries of a dataset. They called this area Region of Competence (RoC) and proposed an approach (FIRE-DES) to dynamically select the best base classifiers which are powerful in classifying the instance which falls into the RoC. They have applied their approach on 40 datasets and showed the validity of the FIRE-DES against other approaches.

3. HDEC: A Heterogeneous Dynamic Ensemble Classifier

In this section, we describe our proposed approach. Our main objective is to design a heterogeneous ensemble classifier such that the prediction performance improves.

Figure 2 depicts the general schema of the HDEC. The pseudocode of the training and generalization phase is also shown in Figures 3 and 4. In a nutshell, we use multiple learning algorithms and train many classifiers with them in order to separate the classifiers which are accurate in recognizing the positive samples and those that are efficient in recognizing the negative samples. The rationale behind separating the classifiers is that some classifiers which are efficient in recognizing the positive samples may not be suitable for classifying the negative samples. This is true for efficient classifiers in recognizing the negative samples. Afterward, we make two groups and for an unlabeled instance we dynamically specify which group is more confident for generating the final prediction of the instance.

As we can see in Figure 2, the proposed approach consists of five steps. Furthermore, a preprocessing step is required to make a balance dataset from the input instances. As mentioned before, the main idea behind the proposed approach is to use both true positive rate and true negative rate simultaneously for identifying which classifiers can better recognize the instances. In other words, the degree of trust in each classifier is determined by the obtained values of TPR and TNR.

Let us consider is the dataset where each represents a training sample such that , , and is the number of total samples. In the following, we elaborate on the steps of the proposed method. Since we use multiple learning algorithms, we can expect that a diverse set of classifiers are formed.

3.1. Generating Many Classifiers

Our proposed approach commences with considering many learning algorithms and training them by using the original dataset. Suppose that we consider L distinct learning algorithms. As a result, L classifiers will be generated:

In other words, the output of this step is a pool of heterogeneous and diverse classifiers. Note that in our proposed method the dataset should have an equal number of positive and negative instances, i.e., a balance dataset. Thus, we have used the SMOTE technique [48] to convert the imbalanced dataset into the balanced dataset.

3.2. Calculating the TPR and TNR

In order to specify which classifiers are efficient for classifying either positive or negative samples, we consider a validation set. Each classifier generates the predictions for the instances within the validation set.

Then, we evaluate the performance of the classifier for positive and negative samples by calculating the TPR and TNR according to (2) and (3), respectively. TPR or recall [49] is defined as the ratio of the positive samples (TP) that are correctly classified to the total number of positive samples in the validation set (P).

The more the positive samples are classified correctly, the higher the TPR value is. Accordingly, TNR or specificity represents the ratio of the number of negative samples correctly classified (TN) to the total number of negative samples (N) [49]:

3.3. Sorting the Classifiers with respect to the TPR and TNR

After calculating the TPR and TNR for the generated classifiers, we sort them with respect to both TPR and TNR in descending order. The reason for sorting the classifiers is that we want to specify which of them are more efficient for positive samples and which of them are more efficient for the negative samples. We have

such that

such that

3.4. Categorizing the Classifiers into Two Groups

After sorting the classifiers in the previous step, we select the top T of the sorted classifiers to make two groups:

ATPC (Accurate True Positive Classifiers): this group contains the classifiers which accurately recognize the positive samples. We have

ATNC (Accurate True Negative Classifiers): the classifiers within this group are efficient in recognizing the negative samples:

3.5. Combining the Classifiers

Once the classifiers are divided into two groups, there should be a strategy to combine the output of the classifiers. For an unlabeled sample, the classifiers within each group generate their prediction. For the ATPC group, we only consider the positive prediction and for the ATNC group we only consider the negative predictions (according to (8) and (9)). This is due to the fact that the classifiers in the ATPC group are just efficient in recognizing the positive samples and generating a negative prediction does not matter much. In contrast, if a classifier in the ATNC group generates a positive prediction for an unlabeled instance, we ignore it. Afterward, we compare the sum of TPR and TNR for the ATPC and ATNC groups, respectively. The final prediction is the class that generates the highest value:where I(·) is the identity function such that its output is one if the input parameter is true; otherwise it is zero. In (8), is the true positive rate value for the base classifier . This value is calculated in the second step of the proposed algorithm as described in Section 3.2. The expression is for ignoring the negative prediction by the classifiers in the ATPC group. In other words, if a classifier in the ATPC group makes a negative prediction for an unlabeled instance , the value of will be zero, because the input parameter is false. Consequently, the TPR value is ignored as it will be multiplied by the zero.

According to the above description, it makes sense that HDEC dynamically specifies which category is more confident for generating the final output for an unlabeled instance.

4. Experimental Study

In this section, we describe the experimental details and the analysis carried out to show the performance of the proposed method against state-of-the-art approaches.

4.1. Dataset

For evaluating the proposed method, 12 standard datasets from the UCI [50] and LIBSVM [51] repositories are selected. Table 1 shows the characteristics of the used datasets. The last column indicates the ratio of the majority class to the minority class. As it can be seen, the selected datasets belong to different categories and cover a wide range of problems (from 155 to 1284 instances, from two to 60 features, and from a variety of imbalance ratios). Since we categorize the classifiers into two groups, our approach is only applicable to the binary datasets. To carry out the experiments, we have used a 10-fold cross-validation strategy. That is, each dataset is divided into ten folds and we use nine folds for training the classifiers and the remaining fold as the test data. We repeat this process ten times and finally report the average value of the performance measures.

4.2. Base Classifiers

All of the experiments have been carried out by using the Weka package which is one of the most popular machine learning libraries. Table 2 presents the classification algorithms used as the base classifiers of the proposed method. As can be seen, they belong to the different families of the Weka classifiers.

4.3. Performance Measurements

To reach a reliable result on the proposed approach, we have used two popular performance indicator measures including accuracy and geometric mean. The accuracy is calculated from dividing the number of instances correctly classified by the total number of instances. It is formulated by [52]where TP is the number of positive instances that are correctly recognized and TN is the number of negative instances that are correctly classified.

However, the accuracy alone does not express which class has been better classified. Regarding this issue, we use geometric mean measure which considers the strength of a classifier in both positive and negative classes [52]:

4.4. Method

With the aim of evaluating the proposed approach, we have compared HDEC against different classifiers including (a) single classifiers presented in Table 2, (b) three state-of-the-art ensemble classifiers investigated in the literature including Bagging, Boosting, and Stacking, and (c) the approach introduced in [36].

4.5. Obtained Results

Table 3 shows the best average accuracy and geometric mean for the individual classifiers presented in Table 2 along with the corresponding classification algorithm. The standard deviations of the accuracies and geometric means of 10 simulations are also shown using the “±” sign corresponding to each percentage.

As it is clear, there is not a unique classification algorithm that is always superior to others. For example, while the decision tree generates the best accuracy for the BRC and COL datasets, the SVM has the highest accuracy for the DBT and HPT datasets.

Tables 4 and 5 show the values of the accuracy and geometric mean of different ensemble approaches, respectively. For doing this, we have set each classification algorithm in Table 2 as the base learner of Bagging and Boosting.

For Stack Generalization, we have set the classification algorithm as its metaclassifiers. The hyphen for the CSS column indicates that the authors of [36] did not applied their approach for the respective dataset. The best results are shown in bold fonts. The results are shown graphically in Figures 5 and 6. As can be seen, HDEC outperforms the other ensemble approaches in 10 out of 12 datasets in terms of the accuracy and 11 out of 12 datasets in terms of geometric mean. In other words, both accuracy and geometric mean have been improved as a result of applying the HDEC method.

We can also see that the geometric mean has been more improved than the accuracy. This is as a result of smart selection of the base classifiers for classifying the unseen instances. The positive instances are more likely to be classified correctly with the classifiers which are efficient in classifying the positive instances, i.e., ATPC groups. Accordingly, the same is true for negative instances. This causes the fact that both TPR and TNR values improve simultaneously and consequently the geometric mean value increases.

Table 6 shows the classifiers in each ATPC and ATNC group. It is observed that there are different classification algorithms which are suitable for classifying the positive and negative instances. For example, while KStar algorithm has a good strength for classifying the positive instances of the LVD dataset, it is suitable for neither classifying the positive instances nor classifying negative instances. It should be taken into account that a classification algorithm may have a good strength for classifying both positive and negative instances. As an exemplification, we can see that the Naïve Bayes classifier exists in both ATPC and ATNC groups which means that it can classify both positive and negative instances efficiently.

5. Conclusion

The aim of this study is to design a heterogeneous ensemble classifier in order to achieve higher prediction performance. For this purpose, we consider multiple classification algorithms and train each of them with the training dataset. Then, we separate the classifiers which are accurate in recognizing the positive samples and those that are efficient in recognizing the negative samples. This is done by calculating the TPR and TNR values of each classifier for the validation set. Afterwards, we select the classifiers which have the higher value for the TPR and TNR to form two groups ATPCS and ATNC classifiers, respectively. We then combine the output of the classifiers in each group to generate the final output.

In order to evaluate the prediction performance of the HDEC, we have applied it on 12 standard datasets from the UCI repository and compare our proposed method against three state-of-the-art ensemble approaches including Bagging, Boosting, and Stack Generalization. The obtained results show the superiority of HDEC in terms of the accuracy and geometric mean values.

It is worth noting that although HDEC shows better performance in many datasets, it is not our purpose to express that HDEC is always better than other ensemble approaches for any dataset. As we stated before, according to “No Free Lunch” theorem, there is not a unique classifier which always shows best performance for all datasets. The proposed approach needs a balanced dataset and for this purpose we included an extra preprocessing step (using SMOTE) to convert imbalanced datasets to the balanced datasets.

For future work, we plan to extend the proposed algorithm such that it does not need the balancing step and can handle the imbalanced datasets directly. Furthermore, it is suggested to develop an adoption model to investigate the usage of more classification algorithms as the base classifiers of the HDEC. We also plan to extend our work to be capable of classifying the multiclass datasets.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.