#### Abstract

Extreme learning machine (ELM) has been well recognized as an effective learning algorithm with extremely fast learning speed and high generalization performance. However, to deal with the regression applications involving big data, the stability and accuracy of ELM shall be further enhanced. In this paper, a new hybrid machine learning method called robust AdaBoost.RT based ensemble ELM (RAE-ELM) for regression problems is proposed, which combined ELM with the novel robust AdaBoost.RT algorithm to achieve better approximation accuracy than using only single ELM network. The robust threshold for each weak learner will be adaptive according to the weak learner’s performance on the corresponding problem dataset. Therefore, RAE-ELM could output the final hypotheses in optimally weighted ensemble of weak learners. On the other hand, ELM is a quick learner with high regression performance, which makes it a good candidate of “weak” learners. We prove that the empirical error of the RAE-ELM is within a significantly superior bound. The experimental verification has shown that the proposed RAE-ELM outperforms other state-of-the-art algorithms on many real-world regression problems.

#### 1. Introduction

In the past decades, computational intelligence methodologies are widely adopted and have been effectively utilized in various areas of scientific research and engineering applications [1, 2]. Recently, Huang et al. introduced an efficient learning algorithm, named extreme learning machine (ELM), for single-hidden layer feedforward neural networks (SLFNs) [3, 4]. Unlike conventional learning algorithms such as back-propagation (BP) methods [5] and support vector machines (SVMs) [6], ELM could randomly generate the hidden neuron parameters (the input weights and the hidden layer biases) before seeing the training data, and could analytically determine the output weights without tuning the hidden layer of SLFNs. As the random generated hidden neuron parameters are independent of the training data, ELM can reach not only the smallest training error but also the smallest norm of output weights. ELM overcomes several limitations in the conventional learning algorithms, such as local minimal and slow learning speed, and embodies very good generalization performance.

As a popular and pleasing learning algorithm, massive variants of ELM have been investigated in order to further improve its generalization performance. Rong et al. [7] proposed an online sequential fuzzy extreme learning machine (OS-Fuzzy-ELM) for function approximation and classification problems. Cao et al. [8] combined the voting based extreme learning machine [9] with online sequential extreme learning machine [10] into a new methodology, called voting based online sequential extreme learning machine (VOS-ELM). In addition, to solve the two drawbacks in the basic ELM, namely, the over-fitting problem and the unstable accuracy, Luo et al. [11] presented a novel algorithm, called sparse Bayesian extreme learning machine (SB-ELM), which estimates the marginal likelihood of the output weights automatically pruning the redundant hidden nodes. What is more, to overcome the limitations of supervised learning algorithms, according to the theory of semisupervised learning [12].

Although ELM has good generalization performances for classification and regression problems, how to efficiently perform training and testing on big data is challenging for ELM as well. As a single learning machine, although ELM is quite stable compared to other learning algorithms, its classification and regression performance may still be slightly varied among different trails on big dataset. Many researchers sought for various ensemble methods that integrate a set of ELMs into a combined network structures, and verified that they could perform better than using individual ELM. Lan et al. [13] proposed an ensemble of online sequential ELM (EOS-ELM), which is comprised of several OS-ELM networks. The mean of the OS-ELM networks’ outputs was used as the performance indicator of the ensemble networks. Liu and Wang [14] presented an ensemble-based ELM (EN-ELM) algorithm, where the cross-validation scheme was used to create an ensemble of ELM classifiers for decision making. Besides, Xue et al. [15] proposed a genetic ensemble of extreme learning machine (GE-ELM), which adopted genetic algorithms (GAs) to produce a group of candidate networks first. According to a specific ranking strategy, some of the networks were selected to ensemble a new network. More recently, Wang et al. [16] presented a parallelized ELM ensemble method based on M^{3}-network, called M^{3}-ELM. It could improve the computation efficiency by parallelism and solve imbalanced classification tasks through task decomposition.

To learn the exponentially increased number and types of data with high accuracy, the traditional learning algorithms may tend to suffer from overfitting problem. Hence, a robust and stable ensemble algorithm is of great importance. Dasarathy and Sheela [17] firstly introduced an ensemble system, whose idea is to partition the feature space using multiple classifiers. Furthermore, Hansen and Salamon [18] presented an ensemble of neural networks with a plurality consensus scheme to obtain far better performance in classification issues than approaching using single neural networks. After that, the ensemble-based algorithms have been widely explored [19–23]. Among the ensemble-based algorithms, Bagging and Boosting are the most prevailing methods for training neural network ensembles. The Bagging (short for Bootstrap Aggregation) algorithm randomly selects bootstrap samples from cardinality original training set (), and then the diversity in the bagging-based ensembles is ensured by the variations within the bootstrapped replicas on which each classifier is trained. By using relatively weak classifiers, the decision boundaries measurably vary with respect to relatively small perturbations in the training data. As an iterative method presented by Schapire [20] for generating a strong classifier, boosting could achieve arbitrarily low training error from an ensemble of weak classifiers, each of which can barely do better than random guessing. Whereafter, a novel boosting algorithm, called the adaptive boosting (AdaBoost), was presented by Schapire and Freund [21]. The AdaBoost algorithm makes improvement to traditional boosting methods in two perspectives. One is that the instances thereof are drawn into subsequent subdatasets from an iteratively updated sample distribution of the same training dataset. AdaBoost replaces randomly subsamples by weighted versions of the same training dataset which could be repeatedly utilized. The training dataset is therefore not required to be very large. Another is to define an ensemble classifier through combination of weighted majority voting of a set of weak classifiers, where voting weights are based on classifiers’ training errors.

However, many of the existing investigations on ensemble algorithms focus on classification problems. The ensemble algorithms on classification problems, unfortunately, cannot be directly applied on regression problems. Regression methods could provide predicated results through analyzing historical data. Forecasting and predication are important functional requirements for real-world applications, such as temperature prediction, inventory management, and positioning tracking in manufacturing execution system. To solve regression problems, based on the AdaBoost algorithm on the classification problem [24–26], Schapire and Freund [21] extended AdaBoost.M2 to AdaBoost.R. In addition, Drucker [27] proposed AdaBoost.R2 algorithm, which is based on ad hoc modification of AdaBoost.R. Besides, Avnimelech and Intrator [28] presented the notion of weak and strong learning and an appropriate equivalence theorem between them so as to improve the boosting algorithm in regression issues. What is more, Solomatine and Shrestha [29, 30] proposed a novel boosting algorithm, called as AdaBoost.RT. AdaBoost.RT projects the regression problems into the binary classification domain which could be processed by AdaBoost algorithm while filtering out those examples with the relative estimation error larger than the preset threshold value.

The proposed hybrid algorithm, which combines the effective learner, ELM, with the promising ensemble method, AdaBoost.RT algorithm, could inherit their intrinsic properties and shall be able to achieve good generalization performances for dealing with big data. Same as the development effort on general ensemble algorithms, the available ELM ensembles algorithms are mainly aimed at the classification problems, while the regression problems with ensemble algorithm have received relatively little attention. Tian and Mao [31] presented an ensemble ELM based on modified AdaBoost.RT algorithm (modified Ada-ELM) in order to predict the temperature of molten steel in ladle furnace. The novel hybrid learning algorithm combined the modified AdaBoost.RT with ELM, which possesses the advantage of ELM and overcomes the limitation of basic AdaBoost.RT by self-adaptively modifiable threshold value. The threshold value need not be constant; instead, it could be adjusted using a self-adaptive modification mechanism subjected to the change trend of the predication error at each iteration. The variation range of threshold value is set to be [0, 0.4], as suggested by Solomatine and Shrestha [29, 30]. However, the initial value of is manually fixed to be the mean of the variation range of threshold value, ex. , according to an empirical suggestion. When one error rate is smaller than that in previous iteration, , the value of will decrease and vice versa. Hence, such empirical suggestion based method is not fully self-adaptive in the whole threshold domain. Moreover, the manually fixed initial threshold is not related to the properties of input dataset and the weak learners, which make the ensemble ELM hardly reach a generally optimized learning effect.

This paper presents a robust AdaBoost.RT based ensemble ELM (RAE-ELM) for regression problems, which combined ELM with the robust AdaBoost.RT algorithm. The robust AdaBoost.RT algorithm not only overcomes the limitation of the original AdaBoost.RT algorithm (original Ada-ELM), but also makes the threshold value of adaptive to the input dataset and ELM networks instead of presetting. The main idea of RAE-ELM is as follows. The ELM algorithm is selected as the “weak” learning machines to build the hybrid ensemble model. A new robust AdaBoost.RT algorithm is proposed to utilize the error statistics method to dynamically determine the regression threshold value rather than via manual selection which may only be ideal for very few regression cases. The mean and the standard deviation of the approximation errors will be computed at each iteration. The* robust threshold* for each weak learner is defined to be a scaled standard deviation. Based on the concept of standard deviation, those individual training data with error exceeding the robust threshold are regarded as “flaws in this training process” and shall be rejected. The rejected data will be processed in the late part of weak learners’ iterations.

We then analyze the convergence of the proposed robust AdaBoost.RT algorithm. It could be proved that the error of the final hypothesis output by the proposed ensemble algorithm, , is within a significantly superior bound. The proposed robust AdaBoost.RT based ensemble extreme learning machine can avoid overfitting because of the characteristic of ELM. ELM can tend to reach the solutions straightforwardly, and the error rate of regression outcome at each training process is much smaller than 0.5. Therefore, the proposed robust AdaBoost.RT based ensemble extreme learning machine selecting ELM as the “weak” learner can avoid overfitting. Moreover, as ELM is a fast learner with quite high regression performance, it contributes to the overall generalization performance of the robust AdaBoost.RT based ensemble module. The experiment results have demonstrated that the proposed robust AdaBoost.RT ensemble ELM (RAE-ELM) has superior learning properties in terms of stability and accuracy for regression issues and have better generalization performance than other algorithms.

This paper is organized as follows. Section 2 gives a brief review of basic ELM. Section 3 introduces the original and the proposed robust AdaBoost.RT algorithm. The hybrid robust AdaBoost.RT ensemble ELM (RAE-ELM) algorithm is then presented in Section 4. The performance evaluation of RAE-ELM and its regression ability are verified using experiments in Section 5. Finally, the conclusion is drawn in the last section.

#### 2. Brief on ELM

Recently, Huang et al. [3, 4] proposed novel neural networks, called extreme learning machines (ELMs), for single-hidden layer feedforward neural networks (SLFNs) [32, 33]. ELM is based on the least-square method which could randomly assign the input weights and the hidden layer biases, and then the output weights between the hidden nodes and the output layer can be analytically determined. Since the learning process in ELM can take place without iterative tuning, the ELM algorithm could trend to reach the solutions straightforwardly without suffering from those problems including local minimal, slow learning speed, and overfitting.

From the standard optimization theory point of view, the objective of ELM in minimizing both the training errors and the outputs weights can be presented as [4]where is the training error vector of the output nodes with respect to the training sample . According to KKT theorem, training ELM is equivalent to solving the dual optimization problem:where is the vector of the weights between the hidden layer and the th output node and . is the regularization parameter representing the trade-off between the minimization of training errors and the maximization of the marginal distance.

According to KKT theorem, we can obtain different solutions as follows.

*(**1) Kernel Case.* Consider

The ELM output function is

A kernel matrix for ELM is defined as follows:

Then, the ELM output function can be as follows:

In the special case, a corresponding kernel is used in ELM, instead of using the feature mapping** h**(**x**) which need be known. We call ELM random kernel, where the feature mapping** h**(**x**) is randomly generated.

*(**2) Nonkernel Case. *Similarly, based on KKT theorem, we have

In this case, the ELM output function is

#### 3. The Proposed Robust AdaBoost.RT Algorithm

We first describe the original AdaBoost.RT algorithm for regression problem and then present a new robust AdaBoost.RT algorithm in this section. The corresponding analysis on the novel algorithm will also be given.

##### 3.1. The Original AdaBoost.RT Algorithm

Solomatine and Sherstha proposed AdaBoost.RT [29, 30], a new boost algorithm for regression problems, where the letters and represent regression and threshold, respectively. The original AdaBoost.RT algorithm is described as follows. Input the following:(i)sequence of examples , where output ,(ii)weak learning algorithm (weak learner),(iii)integer specifying number of iterations (machines),(iv)threshold for demarcating correct and incorrect predictions. Initialize the following:(i)machine number or iteration ,(ii)distribution for all i,(iii)error rate . Learning steps (iterate while ) are as follows. *Step 1.* Call weak learner, providing it with distribution . *Step 2.* Build the regression model: . *Step 3.* Calculate absolute relative error (ARE) for each training example as *Step 4.* Calculate the error rate of : . *Step 5.* Set , where is power coefficient (e.g., linear, square, or cubic). *Step 6.* Update distribution as follows: if , then ; else, , where is a normalization factor chosen such that will be a distribution. *Step 7. *Set . Output the following: the final hypotheses:

The AdaBoost.RT algorithm projects the regression problems into the binary classification domain. Based on the boosting regression estimators [28] and BEM [34], the AdaBoost.RT algorithm introduces the absolute relative error (ARE) to demarcate samples as either correct or incorrect predictions. If the ARE of any particular sample is greater than the threshold , the predicted value for this sample is regarded as the incorrect predictor. Otherwise, it is remarked as correct predictor. Such indication method is similar to the “misclassification” and “correct-classification” labeling used in classification problems. The algorithm will assign relatively large weights to those weak learners in the front of learner list that reach high correct prediction rate. The samples with incorrect prediction will be handled as ad hoc cases by the followed weak learners. The outputs from each weak learner are combined as the final hypotheses using the corresponding computed weights.

AdaBoost.RT algorithm requires manual selection of threshold , which is a main factor sensitively affecting the performance of committee machines. If is too small, very few samples will be treated as correct predictions which will easily get boosted. It requires the followed learners to handle a large number of ad hoc samples and make the ensemble algorithm unstable. On the other hand, if is too large, say, greater than 0.4, most of samples will be treated as correct predictions where they fail to reject those false samples. In fact, it will cause low convergence efficiency and overfitting. The initial AdaBoost.RT and its variant suffer the limitation in setting threshold value, which is specified either as a manually specified constant value or a variable changing in vicinity of 0.2. Both of their strategies are irrelevant to the regression capability of the weak learner. In order to determine the value effectively, a novel improvement of AdaBoost.RT is proposed in the following section.

##### 3.2. The Proposed Robust AdaBoost.RT Algorithm

To overcome the limitations suffered by the current works on AdaBoost.RT, we embed the statistics theory into the AdaBoost.RT algorithm. It overcomes the difficulty to optimally determining the initial threshold value and enables the intermediate threshold values to be dynamically self-adjustable according to the intrinsic property of the input data samples. The proposed robust AdaBoost.RT algorithm is described as follows. (1) Input the following: sequence of samples , where output , weak learning algorithm (weak learner), maximum number of iterations (machines) . (2) Initialize the following: iteration index , distribution for all , the weight vector: for all , error rate . (3) Iterate while the following.(1)Call weak learner, WL_{t}, providing it with distribution: where is a normalization factor chosen such that will be a distribution.(2)Build the regression model: .(3)Calculate each error: .(4)Calculate the error rate: , where , , stands for the expected value, and is defined as robust threshold ( stands for the standard deviation, the relative factor is defined as ). If , then set and abort loop.(5)Set .(6)Calculate contribution of to the final result: .(7)Update the weight vectors: if , then ; else, .(8)Set . (4) Normalize , such that . Output the final hypotheses: .

At each iteration of the proposed robust AdaBoost.RT algorithm, the standard deviation of the approximation error distribution is used as a criterion. In the probability and statistics theory, the standard deviation measures the amount of variation or dispersion from the average. If the data points tend to be very close to the mean value, the standard deviation is low. On the other hand, if the data points are spread out over a large range of values, a high standard deviation will be resulted in.

Standard deviation may be served as a measure of uncertainty for a set of repeated predictions. When deciding whether predictions agree with their correspondingly true values, the standard deviation of those predictions made by the underlined approximation function is of crucial importance: if the averaged distance from the predictions to the true values is large, then the regression model being tested probably needs to be revised. Because the sample points that fall outside the range of values could reasonably be expected to occur, the prediction accuracy rate of the model is low.

In the proposed robust AdaBoost.RT algorithm, the approximation error of th weak learner, WL_{t}, for an input dataset could be represented as one statistics distribution with parameters , where stands for the expected value, stands for the standard deviation, and is an adjustable relative factor that ranges from 0 to 1. The threshold value for WL_{t} is defined by the scaled standard deviation, . In the hybrid learning algorithm, the trained weak learners are assumed to be able to generate small prediction error (). For all , , , and , where denotes a small error limit approaching zero. The population mean of a regression error distribution is closer to the targeted zero than elements in the population. Therefore, the means of the obtained regression errors are fluctuating around zero within a small range. The standard deviation, , is solely determined by the individual samples and the generalization performance of the weak learner WL_{t}. Generally, is relatively large such that most of the outputs will be located within the range [, ], which tends to make the boosting process unstable. To maintain a stable adjusting of the threshold value, a relative factor is applied on the standard deviation, , which results in the robust threshold, . For those samples that fall within the threshold range [, ], they are treated as “accepted” samples. Other samples are treated as “rejected” samples. With the introduction of the robust threshold, the algorithm will be stable and resistant to noise in the data. According to the error rate of each weak learner’s regression model, each weak learner WL_{t} will be assigned with one accordingly computed weight .

For one regression problem, the performances of different weak learners may be different. The regression error distributions for different weak learners under robust threshold are shown in Figure 1.

**(a)**

**(b)**

In Figure 1(a), the weak learner WL_{t} generates a regression error distribution with large error rate, where the standard deviation is relatively large. On the other hand, another weak learner WL_{t+j} may generate an error distribution with small standard deviations as shown in Figure 1(b). Suppose red triangle points represent “rejected” samples, whose regression error rate is greater than the specified robust threshold, , where , , as described in Step of the proposed algorithm. The green circular points represent those “accepted” samples, which own regression errors less than the robust threshold. The weight vector for every “accepted” sample will be dynamically changed for each weak learner WL_{t}, while that for “rejected” samples will be unchanged.

As illustrated in Figure 1, in terms of stability and accuracy, the regression capability of weak learner WL_{t+j} is superior to WL_{t}. The robust threshold values for WL_{t+j} and WL_{t} are computed respectively, where the former is smaller than the latter, to discriminate their correspondingly different regression performances. The proposed method overcomes the limitation suffered by the existing methods where the threshold value is set empirically. The critical factor used in the boosting process, the threshold, becomes robust and self-adaptive to the individual weak learners’ performance on the input data samples. Therefore, the proposed robust AdaBoost.RT algorithm is capable to output the final hypotheses in optimally weighted ensemble of the weak learners.

In the following, we show that the training error of the proposed robust AdaBoost.RT algorithm is as bounded. One lemma needs to be given in order to prove the convergence of this algorithm.

Lemma 1. *The convexity argument, , holds for any and .*

Theorem 2. *The improved adaptive AdaBoost.RT algorithm generates hypotheses with errors . Then, the error of the final hypothesis output by this algorithm is bounded above by *

*Proof. *In this proof, we need to transform the regression problem into binary classification problems . In the proposed improved adaptive AdaBoost.RT algorithm, the mean of errors () is assumed to be closed to zero. Thus, the dynamically adaptive thresholds can ignore the mean of errors.

Let , if ; otherwise, .

The final hypothesis output makes a mistake on sample only if

The final weight of any sample is

Combining (13) and (14), the sum of the final weights is bounded by the sum of the final weights of rejected samples. Considerwhere the is the error of the final hypothesis output.

Based on Lemma 1,

Combining those inequalities for , the following equation could be obtained:

Combining (15) and (17), we obtain that

Considering that all factors in multiplication are positive, the minimization of the right hand side could be resorted to compute the minimization of each factor individually. could be computed as when setting the derivative of the th factor to be zero. Substitute this computed into (18), completing the proof.

Unlike the original AdaBoost.RT and its existent variants, the robust threshold in the proposed AdaBoost.RT algorithm is determined and could be self-adaptively adjusted according to the individual weak learners and data samples. Through the analysis on the convergence of the proposed robust AdaBoost.RT algorithm, it could be proved that the error of the final hypothesis output by the proposed ensemble algorithm, , is within a significantly superior bound. The study shows that the robust AdaBoost.RT algorithm proposed in this paper can overcome the limitations existing in the available AdaBoost.RT algorithms.

#### 4. A Robust AdaBoost.RT Ensemble-Based Extreme Learning Machine

In this paper, a robust AdaBoost.RT ensemble-based extreme learning machine (RAE-ELM), which combines ELM with the robust AdaBoost.RT algorithm described in previous section, is proposed to improve the robustness and stability of ELM. A set of number of ELMs is adopted as the “weak” learners. In the training phase, the RAE-ELM utilizes the proposed robust AdaBoost.RT algorithm to train every ELM model and assign an ensemble weight accordingly, in order that each ELM achieves corresponding distribution based on the training output. The optimally weighted ensemble model of ELMs, , is the final hypothesis output used for making prediction on testing dataset. The proposed RAE-ELM is illustrated as follows in Figure 2.

##### 4.1. Initialization

For the first weak learner, ELM_{1} is supplied with training samples with the uniformed distribution of weights in order that each sample owns equal opportunity to be chosen during the first training process for ELM_{1}.

##### 4.2. Distribution Updating

The relative prediction error rates are used to evaluate the performance of this ELM. The prediction error of th ELM, ELM_{t}, for the input data samples could be represented as one statistics distribution, , where stands for the expected value, and is defined as robust threshold ( stands for the standard deviation, and the relative factor is defined as ). The robust thresholdis applied to demarcate prediction errors as “accepted” or “rejected.” If the prediction error of one particular sample falls into the region that is bounded by the robust thresholds*,* the prediction of this sample is regarded as “accepted” for ELM_{t} and vice versa for “rejected” predictions. The probabilities of the “rejected” predictions are accumulated to calculate the error rate . ELM attempts to achieve the with small error rate._{}

The robust AdaBoost.RT algorithm will calculate the distribution for next ELM_{t+1}. For every sample that is correctly predicted by the current ELM_{t}, the corresponding weight vector will be multiplied by the error rate function . Otherwise, the weight vector remains unchanged. Such process will be iterated for the next ELM_{t+1} till the last learner ELM_{T}, unless is higher than 0.5. Because once the error rate is higher than 0.5, the AdaBoost algorithm does not converge and tends to overfitting [21]. Hence, the error rate must be less than 0.5.

##### 4.3. Decision Making on RAE-ELM

The weight updating parameter is used as an indicator of regression effectiveness of the ELM_{t} in the current iteration. According to the relationship between and , if increases, will become larger as well. The RAE-ELM will grant a small ensemble weight for the ELM_{t}. On the other hand, the ELM_{t} with relatively superior regression performance will be granted with a larger ensemble weight. The hybrid RAE-ELM model combines the set of ELMs under different weights as the final hypothesis for decision making.

#### 5. Performance Evaluation of RAE-ELM

In this section, the performance of the proposed RAE-ELM learning algorithm is compared with other popular algorithms on 14 real-world regression problems covering different domains from UCI Machine Learning Repository [35], whose specifications of benchmark datasets are shown in Table 1. The ELM based algorithms to be compared include basic ELM [4], original AdaBoost.RT based ELM (original Ada-ELM) [30], the modified self-adaptive AdaBoost.RT ELM (modified Ada-ELM) [31], support vector regression (SVR) [36], and least-square support vector regression (LS-SVR) [37]. All the evaluations are conducted in Matlab environment running on a Windows 7 machine with 3.20 GHz CPU and 4 GB RAM.

In our experiments, all the input attributes are normalized into the range of , while the outputs are normalized into . As the real-world benchmark datasets are embedded with noise and their distributions are unknown, which are of small sizes, low dimensions, large sizes, and high dimensions, for each trial of simulations, the whole data set of the application is randomly partitioned into training dataset and testing dataset with the number of samples shown in Table 1. 25% of the training data samples are used as the validation dataset. Each partitioned training, validation, and testing dataset will be kept fixed as inputs for all these algorithms.

For RAE-ELM, basic ELM, original Ada-ELM, and modified Ada-ELM algorithms, the suitable numbers of hidden nodes of them are determined using the preserved validation dataset, respectively. The sigmoid function is selected as the activation function in all the algorithms. Fifty trails of simulations have been conducted for each problem, with training, validation, and testing samples randomly split for each trail. The performances of the algorithms are verified using the average root mean square error (RMSE) in testing. The significantly better results are highlighted in boldface.

##### 5.1. Model Selection

In ensemble algorithms, the number of networks in the ensemble needs to be determined. According to Occam’s Razor theory, excessively complex models are affected by statistical noise, whereas simpler models may capture the underlying structure better and may thus have better predictive performance. Therefore, the parameter, , which is the number of weak learners need not be very large.

We define in (12) as , where is constant, it results inwhere KL is the* Kullback-Leibler divergence*.

We then simplify (19) by using instead of , that is, each is set to be the same. We can get

From (20), we can obtain the upper bound number of iterations of this algorithm. Consider

For RAE-ELM, the number of ELM networks need be determined. The number of ELM networks is set to be 5, 10, 15, 20, 25, and 30 in our simulations, and the optimal parameter is selected as the one which results in the best average RMSE in testing.

Besides, in our simulation trails, the relative factor in RAE-ELM is the parameter which needs to be optimized within the range . We start simulations for at 0.1 and increase them at the interval of 0.1. Table 2 shows the examples of setting both and for our simulation trail.

As illustrated in Table 2 and Figure 3, RAE-ELM with sigmoid activation function could achieve good generalization performance for Parkinson disease dataset as long as the number of ELM networks is larger than 15. For a given number of ELM networks, RMSE is less sensitive to the variation of and tends to be smaller when is around 0.5. For a fair comparison, we set RAE-ELM with and in the following experiments. For both original Ada-ELM and modified Ada-ELM, when the number of ELM networks is less than 7, the ensemble model is unstable. The number of ELM networks is also set to be 20 for both original Ada-ELM and modified Ada-ELM.

We use the popular Gaussian kernel function in both SVR and LS-SVR. As is known to all, the performances of SVR and LS-SVR are sensitive to the combinations of (, ). Hence, the cost parameter and the kernel parameter need to be adjusted appropriately in a wide range so as to obtain good generalization performances. For each data set, 50 different values of and 50 different values of , that is, 2500 pairs of (, ), are applied as the adjustment parameters. The different values of and are . In both SVR and LS-SVR, the best performed combinations of (, ) are selected for each data set as presented in Table 3.

For basic ELM and other ELM-based ensemble methods, the sigmoid function is selected as the activation function in all the algorithms. The parameters (, ) need be selected so as to achieve the best generalization performance, where the cost parameter is selected from the range and the different values of the hidden nodes are .

In addition, for the original AdaBoost.RT-based ensemble ELM, the threshold should be chosen before seeing the datasets. In the original AdaBoost.RT-based ensemble ELM, the threshold is required to be manually selected according to an empirical suggestion, which is a sensitive factor affecting the regression performance. If is too low, then it is generally difficult to obtain a sufficient number of “accepted” samples. However if is too high, some wrong samples are treated as “accepted” ones and the ensemble model tends to be unstable. According to Shrestha and Solomatine’s experiments, the threshold shall be defined between 0 and 0.4 in order to make the ensemble model stable [30]. In our simulations, we incrementally set thresholds within the range from 0 to 0.4. The original Ada-ELM with threshold values at could generate satisfied results for all the regression problems, where the best performed original Ada-ELM is shown in boldface. What is more, the modified Ada-ELM algorithm needs to select an initial value of to calculate the followed thresholds in the iterations. Tian and Mao suggested setting the default initial value of to be 0.2 [31]. Considering that the manually fixed initial threshold is not related to the characteristics of ELM prediction effect on input dataset, the algorithm may not reach the best generalization performance. In our simulations, we compare the performances of different modified Ada-ELMs at correspondingly different initial threshold values set to be . The best performed modified Ada-ELM is also presented in Table 3 as well.

##### 5.2. Performance Comparisons between RAE-ELM and Other Learning Algorithms

In this subsection, the performance of the proposed RAE-ELM is compared with other learning algorithms, including basic ELM [4], original Ada-ELM [30], and modified Ada-ELM [31], support vector regression [36], and least-square support vector regression [37]. The results comparisons of RAE-ELM and other learning algorithms for real-world data regressions are shown in Table 4.

Table 4 lists the averaging results of multiple trails of the four ELM based algorithms (RAE-ELM, basic ELM, original Ada-ELM [30], and modified Ada-ELM [31]), SVR [36], and LS-SVR [37] for fourteen representative real-world data regression problems. The selected datasets include large scale of data and small scale of data, as well as high dimensional data problems and low dimensional problems. It is easy to find that averaged testing RMSE obtained by RAE-ELM for all the fourteen cases are always the best among these six algorithms. For original Ada-ELM, the performance is sensitive to the selection of threshold value of *.* The best performed original Ada-ELM models for different regression problems own their correspondingly different threshold values. Therefore, the manual chosen strategy is not good. The generalization performance of modified Ada-ELM, in general, is better than original Ada-ELM. However, the empirical suggested initial threshold value at 0.2 does not ensure a mapping to the best performed regression model. The three AdaBoost.RT based ensemble ELMs (RAE-ELM, original Ada-ELM, and modified Ada-ELM) all perform better than the basic ELM, which verifies that an ensemble ELM using AdaBoost.RT can achieve better predication accuracy than using individual ELM as the predictor. The averaged generalization performance of basic ELM is better than SVR while it is slightly worse than LS-SVR.

To find the best performed original Ada-ELM model or modified Ada-ELM for a regression problem, as the input dataset and ELM networks are not related to the threshold selection, the optimal parameter need be searched by brute-force. One needs to carry out a set of experiments with different (initial) threshold values and then searches among them for the best ensemble ELM model. Such process is time consuming. Moreover, the generalization performance of the optimized ensemble ELMs using original Ada-ELM or modified Ada-ELM can hardly be better than that of the proposed RAE-ELM. In fact, the proposed RAE-ELM is always the best performed learner among the six candidates for all fourteen real-world regression problems.

#### 6. Conclusion

In this paper, a robust AdaBoost.RT based ensemble ELM (RAE-ELM) for regression problems is proposed, which combined ELM with the novel robust AdaBoost.RT algorithm. Combing the effective learner, ELM, with the novel ensemble method, the robust AdaBoost.RT algorithm could construct a hybrid method that inherits their intrinsic properties and achieves better predication accuracy than using only individual ELM as predictor. ELM tends to reach the solutions straightforwardly, and the error rate of regression prediction is, in general, much smaller than 0.5. Therefore, selecting ELM as the “weak” learner can avoid overfitting. Moreover, as ELM is a fast learner with quite high regression performance, it contributes to the overall generalization performance of the ensemble ELM.

The proposed robust AdaBoost.RT algorithm overcomes the limitations existing in the available AdaBoost.RT algorithm and its variants where the threshold value is manually specified, which may only be ideal for a very limited set of cases. The new robust AdaBoost.RT algorithm is proposed to utilize the statistics distribution of approximation error to dynamically determine a robust threshold. The robust threshold for each weak learner WL_{t} is self-adjustable and is defined as the scaled standard deviation of the approximation errors, . We analyze the convergence of the proposed robust AdaBoost.RT algorithm. It has been proved that the error of the final hypothesis output by the proposed ensemble algorithm, , is within a significantly superior bound.

The proposed RAE-ELM is robust with respect to the difference in various regression problems and variation of approximation error rates that do not significantly affect its highly stable generalization performance. As one of the key parameters in ensemble algorithm, threshold value does not need any human intervention; instead, it is able to be self-adjusted according to the real regression effect of ELM networks on the input dataset. Such mechanism enable RAE-ELM to make sensitive and adaptive adjustment to the intrinsic properties of the given regression problem. The experimental result comparisons in terms of stability and accuracy among the six prevailing algorithms (RAE-ELM, basic ELM, original Ada-ELM, modified Ada-ELM, SVR, and LS-SVR) for regression issues verify that all the AdaBoost.RT based ensemble ELMs perform better than the SVR, and, more remarkably, the proposed RAE-ELM always achieves the best performance. The boosting effect of the proposed method is not significant for small sized and low dimensional problems as the individual classifier (ELM network) could already be sufficient to handle such problems well. It is worth pointing out that the proposed RAE-ELM has better performance than others especially for high dimensional or large sized datasets, which is a convincing indicator for good generalization performance.

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgments

The authors would like to thank Professor Guang-Bin Huang from Nanyang Technological University, for providing inspiring comments and suggestions on our research. This work is financially supported by the University of Macau with Grant no. MYRG079(Y1-L2)-FST13-YZX.