Research Article | Open Access
Improving Classification Performance through an Advanced Ensemble Based Heterogeneous Extreme Learning Machines
Extreme Learning Machine (ELM) is a fast-learning algorithm for a single-hidden layer feedforward neural network (SLFN). It often has good generalization performance. However, there are chances that it might overfit the training data due to having more hidden nodes than needed. To address the generalization performance, we use a heterogeneous ensemble approach. We propose an Advanced ELM Ensemble (AELME) for classification, which includes Regularized-ELM, -norm-optimized ELM (ELML2), and Kernel-ELM. The ensemble is constructed by training a randomly chosen ELM classifier on a subset of training data selected through random resampling. The proposed AELM-Ensemble is evolved by employing an objective function of increasing diversity and accuracy among the final ensemble. Finally, the class label of unseen data is predicted using majority vote approach. Splitting the training data into subsets and incorporation of heterogeneous ELM classifiers result in higher prediction accuracy, better generalization, and a lower number of base classifiers, as compared to other models (Adaboost, Bagging, Dynamic ELM ensemble, data splitting ELM ensemble, and ELM ensemble). The validity of AELME is confirmed through classification on several real-world benchmark datasets.
An ensemble learning is a machine learning process to get better prediction performance by strategically combining the predictions from multiple learning algorithms . Ensembles are known to reduce the risk of selecting the wrong model by aggregating all the candidate models [2, 3].
In the process of improving ensemble accuracy and stability, different techniques have been established. These techniques vary in their approach to treat the training data, the type of algorithms used, and the combination methods followed. Bagging , Boosting , and their variants, such as Adaboost , are some of the popular ensembling techniques.
Traditional neural network-based classifiers usually suffer from overfitting and local optimum issues and have remained an active research subject for performance improvement by different ensemble methods. However, recently Extreme Learning Machine (ELM) has gained popularity for solving classification problems. ELM is a single-hidden layer feedforward network (SLFN) extension. Unlike the traditional classic gradient based learning algorithms, which only work for differentiable activation functions and are prone to issues like local optimum, improper learning rate, and overfitting, etc., ELM can deal with nondifferentiable activation functions and tends to reach the solution straightforward without such trivial issues . Random initialization of input to hidden layer parameters in ELM helps evade the tuning process for hidden layer parameters, which extensively shortens the learning time. Although ELM is fast and achieves good generalization performance, there is still a lot of room for improvement. Several modifications have been recently introduced in the base of ELM algorithm to improve accuracy and generalization, such as optimally pruned Extreme Learning Machine (OP-ELM)  and Regularized-ELM [9–12].
On the contrary, ensemble learning offers an inexpensive alternative due to its performance optimization. Several approaches were proposed to generate ensembles-based ELM, such as DELM , EnELM , and DSELME . Such Ensembles of Extreme Learning Machine classifiers were successful in achieving good performance for hyperspectral image classification and segmentation in a semisupervised and spatially regularized process . Bagging-ELM (B-ELM)  is another ELM ensemble classifier, which leverages the bag of little bootstraps technique and has been found efficient for large-scale data classification. An online sequential-ELM (OS-ELM) based framework supports ensemble methods including Bagging, subspace partitioning, and cross-validating .
Diversity among the performance of each single classifier in the ensemble is essential for combining the predictions from several member classifiers. Different techniques are followed to introduce diversity among member classifiers. A cross-validation [13, 14] to validate each ELM before adding it to the ensemble is used. The proposed work related to ELM ensembles in the literature used a homogeneous base classifier algorithm for members’ training in the model [13–15]. Motivated by the accuracy achievement of enhanced ELM algorithms and ensemble approach; we propose a heterogeneous ensemble model with different ELM algorithms for members’ training. More specifically, we adopt three types of ELM algorithms, namely, Regularized-ELM , ELML2 , and Kernel-ELM . These ELM algorithms are briefly described in Section 2. These base classifiers are an enhancement for the standard ELM algorithm and are chosen on the basis of their better generalization, regularization, and resilience to the outliers. A random resampling strategy is chosen to split the training data into subsets. Each member classifier is learned on a randomly chosen data subset through a randomly selected base ELM algorithm. The proposed ensemble algorithm evolves by monitoring the diversity and generalization performance of the updated ensemble during training. Majority voting method is used for combining the predictions from several member classifiers in AELME. Ten real-world benchmark datasets (Iris, Climate, Credit, Wave, Satellite, Letter, Firm, Colon, Liver, and Vowel) are used for detailed performance analysis and comparison. Experimental results, as reported in this work, show that the proposed AELME approach gives better classification accuracy on the benchmark datasets. The remainder of this paper is organized as follows: AELME ensemble algorithm and implementation details are elaborated under Section 3. Performance analysis of the proposed AELME algorithm (see Algorithm 1) is reported in Section 4, by comparing its accuracy with base classifiers and other ensemble methods, which include DELM, DSELME, and EnELM. Finally, the paper is concluded in Section 5.
2.1. The Base Classifiers
Three types of ELM classifiers, namely, ELML2, RELM, and KELM, are used as base classifiers to build AELME ensemble. Here we will briefly introduce the strengths of the selected base ELM classifiers. ELML2  is a regularized algorithm-based ELM, which has all the basic ELM advantages of regression, binary, and multiclass classification. Moreover, it introduced a Lagrange multiplier based constraint optimization method. Therefore, the resultant solution is more stable and has a better generalization performance with different types of hidden nodes (feature mappings). KELM  is an optimization method-based Extreme Learning Machine, which links the ELM minimal weight norm property to Support Vector Machines (SVM) maximal margin for classification. It is shown that, through standard optimization for ELM, a so-called support vector network with better generalization property can be obtained by ELM Kernels. However, in comparison with standard SVM, the Kernel-ELM is less sensitive to the user-specified parameters and has fewer optimization constraints. RELM  is a constrained and optimized algorithm-based ELM for regression and multiclass classification. For better generalization, RELM makes a tradeoff between the structural (weight norm) and empirical risk (least square error) by regulating a proportion of them during optimization. To achieve this balance, the empirical risk in the objective function is weighted by a regulating factor gamma. For more details, the reader can refer to [10, 11, 19].
2.2. ELM Theory
According to the ELM theorem, the Extreme Learning Machine is built by random hidden nodes. Given a training dataset , where is the training data vector, is the target of the training data and the number of hidden nodes (). Different from other learning algorithms, ELM theory target is to reach the smallest training error with the smallest norm of the output weights [10, 20]. The minimization goal iswhere and represents the output of hidden layer matrix: is the target of the input data:Three steps summarize ELM training algorithm :(1)Randomly assign input weights and biases , .(2)Compute the hidden layer output matrix.(3)Compute the output vector:where represents Moore–Penrose (MP) generalized inverse of matrix and .
To compute MP inverse by applying the orthogonal projection, , if is nonsingular. A positive value is added to the diagonal of or in the calculation of the output weights as ridge regression theory stated. At the end, we have a solution which is equivalent to the ELM optimization solution [7, 19], with , which is more stable and has better generalization performance. So, to enhance the stability of ELM, and the output function are computed, respectively, by
Given the above-mentioned advantages of ELM, we propose using it in ensemble to achieve better classification results. Naturally, there is nothing to gain by combining identical models while doing precisely the same actions. Consequently, the base classifiers must commit their errors on different instances, which is the informal meaning of the term diversity. We use three variants of ELM to improve diversity among the base classifiers. Overall, the proposed ensemble is designed to improve performance in terms of accuracy and it is more stable.
3. Advanced Ensemble for Classification Using Extreme Learning Machines
Unlike designing a single classifier in traditional pattern recognition field, ensemble learning aims at constructing multiple diverse classifiers and combines their outputs to form a hybrid predictive model. Consequently, the overall classification performance of ensemble classifier tends to be better than when using a single classifier. As ELM uses random weights, it often has a low misclassification rate. To improve the classification rate performance, a number of multiclassifiers based on ensemble learning have been proposed in [21, 22]. In this work, we use data splitting of training data and three types of ELM algorithms as the base learners to build a classifier on split data and majority voting to combine outputs from all member classifiers in ensemble pool. Different training parameters of base ELM learning algorithms allow each member classifier to generate different decision boundaries. Hence different errors are made resulting in a reduced overall error for the ensemble. Training data distribution has an effect on the generalization of the learning classifier. For example, a training set may contain instances from a particular class such that the feature values of those instances are skewed towards a particular intraclass member. To address this issue, we divide training dataset into different parts as it tends to preserve the original data distribution by using random resampling on the dataset. Consequently, classifiers with large diversity and different errors are produced. For example, if we divide training data into 3 parts then we have three training subsets: , , and . A sufficient and necessary condition for the ensemble to outperform its base members is that component learners should be simultaneously accurate and diverse; therefore, a new member is added to AELME if it increases both diversity (in terms of disagreement) and accuracy of the model. General description of the model is shown as flowchart in Figure 1.
The training dataset is divided randomly into equal size subsets. If we have samples, then the size of each subset will be . To maximize the diversity among reconstructed training datasets, each new training set is obtained through resampling on out of subsets. Then, training with each subset is done using one out of the three base classifier learners, which is selected randomly. The trained classifier is added to the ensemble and the process is repeated for all the remaining subsets. In the next iteration, if diversity and accuracy of the current ensemble are improved with the addition of (ensemble number ), then it will be retained in the updated ensemble and excluded otherwise. The final ensemble model is a mixture of all classifiers trained on all subsets. Our model has three types of ELM algorithms, specifically Regularized-ELM, ELML2, and Kernel-ELM. Once the training is complete, labels for tested data are obtained by majority voting method applied to the member classifiers’ outputs in the evolved ensemble.
3.2. Testing Stage
3.2.1. Majority Vote
The implementation procedure for the ensemble construction and training stage is described in the algorithm of AELME.
Given a testing instance , an ensemble of predictors is created. For pattern , we use majority voting to make the final decision. Suppose we have one -class’s problem. If the th ELM in the ensemble predicts the pattern as class , we assign vote one to it and zero otherwise. Once all the votes have been assigned, the class that receives the highest votes from all predictors is considered the predicted class.
3.2.2. Weighted Sum
Given a testing instance , an ensemble of predictors are created. In decision making on the ensemble, for pattern , we use weighted sum to make the final decision. Suppose that there is -classe’s problem, and we calculate the weighted sum for all classifiers for all classes. The class that receives the maximum weighted sum from all predictors is considered as the predicted class:where is the weight of base learner and is the prediction result.
4. Simulation and Discussion
4.1. Simulation Settings
To test the performance of the model, we carry out our simulation experiments on ten diverse datasets from several domains with different characteristics and diversity in size and input feature dimensions. The datasets come from machine learning repository (UCI)  besides including one dataset from LIBSVM  which is sourced in . A brief description of the datasets is included in Table 1; more details of what characterizes the problem domains of the datasets can be found on the web pages of those repositories. The simulations of different algorithms on all the datasets are carried out in MATLAB 8.1.0 environment running on Intel® Core i5, 2.4 GHZ CPU with 4 GB RAM. To remove any bias from the results, we repeat the experiment 10 times and calculate the average accuracy for all iterations. Training data is split into 2–8 equal size subsets (according to the number of instances in the datasets) using random resampling. For a fair evaluation, we use the same split number of subsets for all ensembles.
4.2. User-Specified Parameters
To achieve good generalization performance, the cost parameter and the Kernel parameter of the base ELM classifiers (Regularized-ELM, ELML2, and Kernel-ELM) need to be chosen appropriately. We use search grid over and to determine optimal values. For each dataset, we have used different values of C and different values of . The range of is and the range of is . The number of hidden nodes is selected from the range . Optimal values of the selected parameters are shown in Table 2. To study the generalization performance of AELME on the combination of , we select a medium dataset size (Wave). From Figure 2, it can be noticed that changing the value of and () parameters does not have a significant effect on the accuracy. So, the model seems to have less sensitivity towards the combination parameters .
We use a set of measures to evaluate the efficiency of AELME model. We use accuracy as an indication of the classification output correctness. Standard deviation of the accuracy rates is used as an indication of ensemble stability; the lower standard deviation the method has, the more stable the method is.
The cost of training a new (test) data should not have a significant change on the ensemble accuracy when we train the ensemble with any training set of size a bit more or less than the original data. We use the decrease or increase in average absolute error averaged over all our datasets, assuming they represent a reasonable real-world distribution of datasets. The average relative error reduction measure is also used. For two algorithms  A and B with errors and , the decrease in relative error between A and B is . The average relative error is the average (over all our datasets) of the relative error between the pair of algorithms compared. We compared our model with all other approaches. A negative value for the error implies that our model reduces error, while positive values correspond to increase in error for our model. Time costs of Adaboost, Bagging, EnELM , DSELME , DELM , and AELME are also compared.
4.4. Diversity Measures
It is not straightforward to express actual diversity among classifiers in an ensemble through standard diversity measure. While there are some measures with which to approximate its value, there is no perfect one [27, 28]. Here, we use disagreement to measure diversity and also -Statistic which is recommended in .
The diversity within the whole ensemble is calculated by averaging disagreement measure  over all pairs of base classifiers:where is the number of base classifiers, is the disagreement between classifier and classifier . This measure is defined based on the intuition that two diverse classifiers perform differently on the same training data. Disagreement measure is used to test the diversity within the whole set of base classifiers. The diversity increases with the value of the disagreement measure.
Yule’s -Statistic  measures the similarity between two classifiers ( and ). It can be calculated as follows:where represents the number of samples for which both the classifiers are making (correct, wrong) classification, respectively. Similarly, represents the number of samples for which both the classifiers are committing errors. Then the averaged value for more than two classifiers can be calculated as follows:where is the number of classifiers and . When equals zero, it implies that the classifiers are independent. And if equals one, it implies identical (dependent) classifiers. A positive value of means that the classifiers have classified the same input correctly and negative value of means that the classifiers have committed errors on different inputs. Diversity increases if value decreases and vice versa. However, it is not easy to attain large negative value  for more than two classifiers. For calculations, we use the diversity measure toolbox (http://pages.bangor.ac.uk/~mas00a/ensemble_diversity.html).
4.5. Statistical Tests
4.5.1. Wilcoxon Test
The Wilcoxon test is a nonparametric statistical test . The purpose is to compare between two models and several data samples to measure the difference between them and to know if one is significant or both are equal. It is insensitive to the sample size and outliers. Our null hypothesis is that “there is no difference between our model and the one to which it is being compared.” The alternative hypothesis is that our model is more significant than the compared model. We use a significance level of 95% (threshold is equal to 0.05). Small values of ( value) cast doubt on the verity of the null hypothesis. A small value verifies that one approach is more significant than the other. The procedure is as follows: find the performance (Pf) difference between the two compared algorithms. Rank the absolute values of Pf in ascending order (the smallest value = 1, the second value = 2, and so on). If there are two equal values, then assign the average rank for all equal values. Compute the negative and the positive rank sum according to the Pf sign. Find the minimum of the two sums (Wilcoxon statistic: ). Find the critical value of  that corresponds to the dataset number with a level of significance used to examine if the null hypothesis can be rejected. For more details, the reader can refer to .
4.5.2. Friedman Test
Friedman test is a nonparametric statistical test . The purpose is to compare the performance of multiple models and several data samples to measure the differences between them and to determine whether there is a significant difference or they are equal. Our null hypothesis is that there is no difference between AELME and other algorithms. The alternative hypothesis is that there is a significant difference between at least two of the compared models. The significance level used is 95% (threshold is equal to 0.05).
4.6. Statistical Results
The ( value) result of Wilcoxon test of AELME compared with all algorithms is shown in Table 5. It is less than 0.05 in all cases; that implies the null hypothesis is rejected, and the alternative hypothesis is accepted. There is enough evidence that our model is more significant as compared to other models reported in this work. Moreover, the result of Friedman test value is 22.08, with value that equals 0.0086. We reject the null hypothesis and accept the alternative hypothesis that there are differences between the compared models.
4.7. -Statistic Experiments
We use Wave, Liver, and Satellite datasets to do experiments; we do not prespecify the individual accuracy nor the individual dependency. Wave data is three-class data; we divide training data into subsets of 4, 4, and 2 features (500 parts). Then train three classifiers one on each subset of features. The Liver is two-class data with 6 dimensions. Satellite is a seven-class data with 10 dimensions. We split data randomly into training/testing subsets. The training/test sets split is 470/230 for Wave data, 230/115 for Liver data, and 800/400 for Satellite data. We generate 2500 ensembles for all data. The setting of our experiments is shown in Table 8. We observe from experiments a general trend of improvement in accuracy towards low values of . In all experiments, the maximum improvement corresponds to negative values of as shown in Table 9. However, at the same time there exists a range of improvements against these values, while the top-improvements are dispersed across a wide spectrum of negative -values. Almost all ensembles in our experiments show accuracy improvement over the single best classifier (Ensemble Accuracy − Maximum individual accuracy). Nevertheless, we cannot draw a conclusion that there is strong relationship between accuracy and diversity for all ensembles, because it depends on the experiment settings. Moreover, there is a need for more dedicated, in-depth research to investigate the relationship between accuracy and diversity.
4.8. Performance Analysis and Discussion
The classification experiments on datasets are performed using Bagging, Adaboost, Regularized-ELM (RELM) , ELML2 , Kernel-ELM (ELMK) , EnELM , DELM , DSELME , and AELME algorithms. The average classification accuracy rates with their corresponding standard deviations of the experiments over ten runs are shown in Table 4. Accuracy rates on the tested datasets show the strength of the model, as we can observe from results that our model achieves the highest accuracy rates in most cases. The base classifiers of our model ELML2, Regularized-ELM, and Kernel-ELM have accuracy rates less than the ensemble model. From Table 4 we observe that the accuracy rates on almost all datasets of Bagging and Adaboost algorithms are lower than our model and they have low accuracy rates on Wave, Credit, and Firm datasets. Moreover, we use weighted sum method in all base classifiers to test AELME on unseen data. Table 6 shows the accuracy rates using weighted sum. We observe from results weighted sum method outperforms majority vote in most datasets. Stability is an important factor related to whether the ensemble classifier can improve the accuracy rate of classification. To analyze the stability of AELME, we repeat the experiments for 10 times on Climate dataset. , , , , and instances are selected in sequence, corresponding to 1st to the 5th group, respectively. The standard deviation of the accuracy rates is calculated based on these 10 runs. Stable classifiers are less likely to overfit. To make use of the variations of the training set, the base classifier should be unstable ; otherwise, the resultant ensemble will be a collection of almost identical classifiers. As shown in Table 7, our ensemble classifier is more stable than all the base classifiers. Disagreement is a measure of diversity. As shown in Table 3, it is mostly increased as the size of the dataset increases. This demonstrates that diversity is increased between base classifiers in AELME.
The mean average error (MAE) of our model is the lowest one among all algorithms on all the datasets as shown in Table 10. There is a relative error reduction of our model compared to almost all other ensembles tested on all the datasets in this research. For example, for Letter dataset, there is an error reduction of 147% compared to Bagging, 157% compared to Adaboost, 175% compared to EnELM, 54% compared to DSELME, and 171% compared to DELM. The average absolute error of our model is 0.2758 which is the smallest among all ensembles as shown in Table 10. There is an average error reduction of 39% for DELM, 43% for DSELME, 35% for EnELM, 44% for Adaboost, and 59% for Bagging. We compare the time costs of Bagging, Adaboost, and AELME algorithms. , , , , and instances are selected in sequence, corresponding to 1st to the 5th group, respectively, as shown in Figure 3. For Climate dataset, we observe that Adaboost algorithm is the most time-consuming algorithm, and the time cost of the AELME algorithm is less than the Bagging algorithm and Adaboost. The average training time of all algorithms is compared by taking the average of ten runs of the ten datasets. Table 11 shows average training time for the different algorithms. We observe that our algorithm training time on some datasets is higher than other ensembles due to the computations in our algorithm.