Mathematical Problems in Engineering

Volume 2015, Article ID 260970, 12 pages

http://dx.doi.org/10.1155/2015/260970

## A Robust AdaBoost.RT Based Ensemble Extreme Learning Machine

Department of Electromechanical Engineering, Faculty of Science and Technology, University of Macau, Macau

Received 21 August 2014; Revised 12 November 2014; Accepted 13 November 2014

Academic Editor: Yi Jin

Copyright © 2015 Pengbo Zhang and Zhixin Yang. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Extreme learning machine (ELM) has been well recognized as an effective learning algorithm with extremely fast learning speed and high generalization performance. However, to deal with the regression applications involving big data, the stability and accuracy of ELM shall be further enhanced. In this paper, a new hybrid machine learning method called robust AdaBoost.RT based ensemble ELM (RAE-ELM) for regression problems is proposed, which combined ELM with the novel robust AdaBoost.RT algorithm to achieve better approximation accuracy than using only single ELM network. The robust threshold for each weak learner will be adaptive according to the weak learner’s performance on the corresponding problem dataset. Therefore, RAE-ELM could output the final hypotheses in optimally weighted ensemble of weak learners. On the other hand, ELM is a quick learner with high regression performance, which makes it a good candidate of “weak” learners. We prove that the empirical error of the RAE-ELM is within a significantly superior bound. The experimental verification has shown that the proposed RAE-ELM outperforms other state-of-the-art algorithms on many real-world regression problems.

#### 1. Introduction

In the past decades, computational intelligence methodologies are widely adopted and have been effectively utilized in various areas of scientific research and engineering applications [1, 2]. Recently, Huang et al. introduced an efficient learning algorithm, named extreme learning machine (ELM), for single-hidden layer feedforward neural networks (SLFNs) [3, 4]. Unlike conventional learning algorithms such as back-propagation (BP) methods [5] and support vector machines (SVMs) [6], ELM could randomly generate the hidden neuron parameters (the input weights and the hidden layer biases) before seeing the training data, and could analytically determine the output weights without tuning the hidden layer of SLFNs. As the random generated hidden neuron parameters are independent of the training data, ELM can reach not only the smallest training error but also the smallest norm of output weights. ELM overcomes several limitations in the conventional learning algorithms, such as local minimal and slow learning speed, and embodies very good generalization performance.

As a popular and pleasing learning algorithm, massive variants of ELM have been investigated in order to further improve its generalization performance. Rong et al. [7] proposed an online sequential fuzzy extreme learning machine (OS-Fuzzy-ELM) for function approximation and classification problems. Cao et al. [8] combined the voting based extreme learning machine [9] with online sequential extreme learning machine [10] into a new methodology, called voting based online sequential extreme learning machine (VOS-ELM). In addition, to solve the two drawbacks in the basic ELM, namely, the over-fitting problem and the unstable accuracy, Luo et al. [11] presented a novel algorithm, called sparse Bayesian extreme learning machine (SB-ELM), which estimates the marginal likelihood of the output weights automatically pruning the redundant hidden nodes. What is more, to overcome the limitations of supervised learning algorithms, according to the theory of semisupervised learning [12].

Although ELM has good generalization performances for classification and regression problems, how to efficiently perform training and testing on big data is challenging for ELM as well. As a single learning machine, although ELM is quite stable compared to other learning algorithms, its classification and regression performance may still be slightly varied among different trails on big dataset. Many researchers sought for various ensemble methods that integrate a set of ELMs into a combined network structures, and verified that they could perform better than using individual ELM. Lan et al. [13] proposed an ensemble of online sequential ELM (EOS-ELM), which is comprised of several OS-ELM networks. The mean of the OS-ELM networks’ outputs was used as the performance indicator of the ensemble networks. Liu and Wang [14] presented an ensemble-based ELM (EN-ELM) algorithm, where the cross-validation scheme was used to create an ensemble of ELM classifiers for decision making. Besides, Xue et al. [15] proposed a genetic ensemble of extreme learning machine (GE-ELM), which adopted genetic algorithms (GAs) to produce a group of candidate networks first. According to a specific ranking strategy, some of the networks were selected to ensemble a new network. More recently, Wang et al. [16] presented a parallelized ELM ensemble method based on M^{3}-network, called M^{3}-ELM. It could improve the computation efficiency by parallelism and solve imbalanced classification tasks through task decomposition.

To learn the exponentially increased number and types of data with high accuracy, the traditional learning algorithms may tend to suffer from overfitting problem. Hence, a robust and stable ensemble algorithm is of great importance. Dasarathy and Sheela [17] firstly introduced an ensemble system, whose idea is to partition the feature space using multiple classifiers. Furthermore, Hansen and Salamon [18] presented an ensemble of neural networks with a plurality consensus scheme to obtain far better performance in classification issues than approaching using single neural networks. After that, the ensemble-based algorithms have been widely explored [19–23]. Among the ensemble-based algorithms, Bagging and Boosting are the most prevailing methods for training neural network ensembles. The Bagging (short for Bootstrap Aggregation) algorithm randomly selects bootstrap samples from cardinality original training set (), and then the diversity in the bagging-based ensembles is ensured by the variations within the bootstrapped replicas on which each classifier is trained. By using relatively weak classifiers, the decision boundaries measurably vary with respect to relatively small perturbations in the training data. As an iterative method presented by Schapire [20] for generating a strong classifier, boosting could achieve arbitrarily low training error from an ensemble of weak classifiers, each of which can barely do better than random guessing. Whereafter, a novel boosting algorithm, called the adaptive boosting (AdaBoost), was presented by Schapire and Freund [21]. The AdaBoost algorithm makes improvement to traditional boosting methods in two perspectives. One is that the instances thereof are drawn into subsequent subdatasets from an iteratively updated sample distribution of the same training dataset. AdaBoost replaces randomly subsamples by weighted versions of the same training dataset which could be repeatedly utilized. The training dataset is therefore not required to be very large. Another is to define an ensemble classifier through combination of weighted majority voting of a set of weak classifiers, where voting weights are based on classifiers’ training errors.

However, many of the existing investigations on ensemble algorithms focus on classification problems. The ensemble algorithms on classification problems, unfortunately, cannot be directly applied on regression problems. Regression methods could provide predicated results through analyzing historical data. Forecasting and predication are important functional requirements for real-world applications, such as temperature prediction, inventory management, and positioning tracking in manufacturing execution system. To solve regression problems, based on the AdaBoost algorithm on the classification problem [24–26], Schapire and Freund [21] extended AdaBoost.M2 to AdaBoost.R. In addition, Drucker [27] proposed AdaBoost.R2 algorithm, which is based on ad hoc modification of AdaBoost.R. Besides, Avnimelech and Intrator [28] presented the notion of weak and strong learning and an appropriate equivalence theorem between them so as to improve the boosting algorithm in regression issues. What is more, Solomatine and Shrestha [29, 30] proposed a novel boosting algorithm, called as AdaBoost.RT. AdaBoost.RT projects the regression problems into the binary classification domain which could be processed by AdaBoost algorithm while filtering out those examples with the relative estimation error larger than the preset threshold value.

The proposed hybrid algorithm, which combines the effective learner, ELM, with the promising ensemble method, AdaBoost.RT algorithm, could inherit their intrinsic properties and shall be able to achieve good generalization performances for dealing with big data. Same as the development effort on general ensemble algorithms, the available ELM ensembles algorithms are mainly aimed at the classification problems, while the regression problems with ensemble algorithm have received relatively little attention. Tian and Mao [31] presented an ensemble ELM based on modified AdaBoost.RT algorithm (modified Ada-ELM) in order to predict the temperature of molten steel in ladle furnace. The novel hybrid learning algorithm combined the modified AdaBoost.RT with ELM, which possesses the advantage of ELM and overcomes the limitation of basic AdaBoost.RT by self-adaptively modifiable threshold value. The threshold value need not be constant; instead, it could be adjusted using a self-adaptive modification mechanism subjected to the change trend of the predication error at each iteration. The variation range of threshold value is set to be [0, 0.4], as suggested by Solomatine and Shrestha [29, 30]. However, the initial value of is manually fixed to be the mean of the variation range of threshold value, ex. , according to an empirical suggestion. When one error rate is smaller than that in previous iteration, , the value of will decrease and vice versa. Hence, such empirical suggestion based method is not fully self-adaptive in the whole threshold domain. Moreover, the manually fixed initial threshold is not related to the properties of input dataset and the weak learners, which make the ensemble ELM hardly reach a generally optimized learning effect.

This paper presents a robust AdaBoost.RT based ensemble ELM (RAE-ELM) for regression problems, which combined ELM with the robust AdaBoost.RT algorithm. The robust AdaBoost.RT algorithm not only overcomes the limitation of the original AdaBoost.RT algorithm (original Ada-ELM), but also makes the threshold value of adaptive to the input dataset and ELM networks instead of presetting. The main idea of RAE-ELM is as follows. The ELM algorithm is selected as the “weak” learning machines to build the hybrid ensemble model. A new robust AdaBoost.RT algorithm is proposed to utilize the error statistics method to dynamically determine the regression threshold value rather than via manual selection which may only be ideal for very few regression cases. The mean and the standard deviation of the approximation errors will be computed at each iteration. The* robust threshold* for each weak learner is defined to be a scaled standard deviation. Based on the concept of standard deviation, those individual training data with error exceeding the robust threshold are regarded as “flaws in this training process” and shall be rejected. The rejected data will be processed in the late part of weak learners’ iterations.

We then analyze the convergence of the proposed robust AdaBoost.RT algorithm. It could be proved that the error of the final hypothesis output by the proposed ensemble algorithm, , is within a significantly superior bound. The proposed robust AdaBoost.RT based ensemble extreme learning machine can avoid overfitting because of the characteristic of ELM. ELM can tend to reach the solutions straightforwardly, and the error rate of regression outcome at each training process is much smaller than 0.5. Therefore, the proposed robust AdaBoost.RT based ensemble extreme learning machine selecting ELM as the “weak” learner can avoid overfitting. Moreover, as ELM is a fast learner with quite high regression performance, it contributes to the overall generalization performance of the robust AdaBoost.RT based ensemble module. The experiment results have demonstrated that the proposed robust AdaBoost.RT ensemble ELM (RAE-ELM) has superior learning properties in terms of stability and accuracy for regression issues and have better generalization performance than other algorithms.

This paper is organized as follows. Section 2 gives a brief review of basic ELM. Section 3 introduces the original and the proposed robust AdaBoost.RT algorithm. The hybrid robust AdaBoost.RT ensemble ELM (RAE-ELM) algorithm is then presented in Section 4. The performance evaluation of RAE-ELM and its regression ability are verified using experiments in Section 5. Finally, the conclusion is drawn in the last section.

#### 2. Brief on ELM

Recently, Huang et al. [3, 4] proposed novel neural networks, called extreme learning machines (ELMs), for single-hidden layer feedforward neural networks (SLFNs) [32, 33]. ELM is based on the least-square method which could randomly assign the input weights and the hidden layer biases, and then the output weights between the hidden nodes and the output layer can be analytically determined. Since the learning process in ELM can take place without iterative tuning, the ELM algorithm could trend to reach the solutions straightforwardly without suffering from those problems including local minimal, slow learning speed, and overfitting.

From the standard optimization theory point of view, the objective of ELM in minimizing both the training errors and the outputs weights can be presented as [4]where is the training error vector of the output nodes with respect to the training sample . According to KKT theorem, training ELM is equivalent to solving the dual optimization problem:where is the vector of the weights between the hidden layer and the th output node and . is the regularization parameter representing the trade-off between the minimization of training errors and the maximization of the marginal distance.

According to KKT theorem, we can obtain different solutions as follows.

*(**1) Kernel Case.* Consider

The ELM output function is

A kernel matrix for ELM is defined as follows:

Then, the ELM output function can be as follows:

In the special case, a corresponding kernel is used in ELM, instead of using the feature mapping** h**(**x**) which need be known. We call ELM random kernel, where the feature mapping** h**(**x**) is randomly generated.

*(**2) Nonkernel Case. *Similarly, based on KKT theorem, we have

In this case, the ELM output function is

#### 3. The Proposed Robust AdaBoost.RT Algorithm

We first describe the original AdaBoost.RT algorithm for regression problem and then present a new robust AdaBoost.RT algorithm in this section. The corresponding analysis on the novel algorithm will also be given.

##### 3.1. The Original AdaBoost.RT Algorithm

Solomatine and Sherstha proposed AdaBoost.RT [29, 30], a new boost algorithm for regression problems, where the letters and represent regression and threshold, respectively. The original AdaBoost.RT algorithm is described as follows. Input the following:(i)sequence of examples , where output ,(ii)weak learning algorithm (weak learner),(iii)integer specifying number of iterations (machines),(iv)threshold for demarcating correct and incorrect predictions. Initialize the following:(i)machine number or iteration ,(ii)distribution for all i,(iii)error rate . Learning steps (iterate while ) are as follows. *Step 1.* Call weak learner, providing it with distribution . *Step 2.* Build the regression model: . *Step 3.* Calculate absolute relative error (ARE) for each training example as *Step 4.* Calculate the error rate of : . *Step 5.* Set , where is power coefficient (e.g., linear, square, or cubic). *Step 6.* Update distribution as follows: if , then ; else, , where is a normalization factor chosen such that will be a distribution. *Step 7. *Set . Output the following: the final hypotheses:

The AdaBoost.RT algorithm projects the regression problems into the binary classification domain. Based on the boosting regression estimators [28] and BEM [34], the AdaBoost.RT algorithm introduces the absolute relative error (ARE) to demarcate samples as either correct or incorrect predictions. If the ARE of any particular sample is greater than the threshold , the predicted value for this sample is regarded as the incorrect predictor. Otherwise, it is remarked as correct predictor. Such indication method is similar to the “misclassification” and “correct-classification” labeling used in classification problems. The algorithm will assign relatively large weights to those weak learners in the front of learner list that reach high correct prediction rate. The samples with incorrect prediction will be handled as ad hoc cases by the followed weak learners. The outputs from each weak learner are combined as the final hypotheses using the corresponding computed weights.

AdaBoost.RT algorithm requires manual selection of threshold , which is a main factor sensitively affecting the performance of committee machines. If is too small, very few samples will be treated as correct predictions which will easily get boosted. It requires the followed learners to handle a large number of ad hoc samples and make the ensemble algorithm unstable. On the other hand, if is too large, say, greater than 0.4, most of samples will be treated as correct predictions where they fail to reject those false samples. In fact, it will cause low convergence efficiency and overfitting. The initial AdaBoost.RT and its variant suffer the limitation in setting threshold value, which is specified either as a manually specified constant value or a variable changing in vicinity of 0.2. Both of their strategies are irrelevant to the regression capability of the weak learner. In order to determine the value effectively, a novel improvement of AdaBoost.RT is proposed in the following section.

##### 3.2. The Proposed Robust AdaBoost.RT Algorithm

To overcome the limitations suffered by the current works on AdaBoost.RT, we embed the statistics theory into the AdaBoost.RT algorithm. It overcomes the difficulty to optimally determining the initial threshold value and enables the intermediate threshold values to be dynamically self-adjustable according to the intrinsic property of the input data samples. The proposed robust AdaBoost.RT algorithm is described as follows. (1) Input the following: sequence of samples , where output , weak learning algorithm (weak learner), maximum number of iterations (machines) . (2) Initialize the following: iteration index , distribution for all , the weight vector: for all , error rate . (3) Iterate while the following.(1)Call weak learner, WL_{t}, providing it with distribution: where is a normalization factor chosen such that will be a distribution.(2)Build the regression model: .(3)Calculate each error: .(4)Calculate the error rate: , where , , stands for the expected value, and is defined as robust threshold ( stands for the standard deviation, the relative factor is defined as ). If , then set and abort loop.(5)Set .(6)Calculate contribution of to the final result: .(7)Update the weight vectors: if , then ; else, .(8)Set . (4) Normalize , such that . Output the final hypotheses: .

At each iteration of the proposed robust AdaBoost.RT algorithm, the standard deviation of the approximation error distribution is used as a criterion. In the probability and statistics theory, the standard deviation measures the amount of variation or dispersion from the average. If the data points tend to be very close to the mean value, the standard deviation is low. On the other hand, if the data points are spread out over a large range of values, a high standard deviation will be resulted in.

Standard deviation may be served as a measure of uncertainty for a set of repeated predictions. When deciding whether predictions agree with their correspondingly true values, the standard deviation of those predictions made by the underlined approximation function is of crucial importance: if the averaged distance from the predictions to the true values is large, then the regression model being tested probably needs to be revised. Because the sample points that fall outside the range of values could reasonably be expected to occur, the prediction accuracy rate of the model is low.

In the proposed robust AdaBoost.RT algorithm, the approximation error of th weak learner, WL_{t}, for an input dataset could be represented as one statistics distribution with parameters , where stands for the expected value, stands for the standard deviation, and is an adjustable relative factor that ranges from 0 to 1. The threshold value for WL_{t} is defined by the scaled standard deviation, . In the hybrid learning algorithm, the trained weak learners are assumed to be able to generate small prediction error (). For all , , , and , where denotes a small error limit approaching zero. The population mean of a regression error distribution is closer to the targeted zero than elements in the population. Therefore, the means of the obtained regression errors are fluctuating around zero within a small range. The standard deviation, , is solely determined by the individual samples and the generalization performance of the weak learner WL_{t}. Generally, is relatively large such that most of the outputs will be located within the range [, ], which tends to make the boosting process unstable. To maintain a stable adjusting of the threshold value, a relative factor is applied on the standard deviation, , which results in the robust threshold, . For those samples that fall within the threshold range [, ], they are treated as “accepted” samples. Other samples are treated as “rejected” samples. With the introduction of the robust threshold, the algorithm will be stable and resistant to noise in the data. According to the error rate of each weak learner’s regression model, each weak learner WL_{t} will be assigned with one accordingly computed weight .

For one regression problem, the performances of different weak learners may be different. The regression error distributions for different weak learners under robust threshold are shown in Figure 1.