Abstract
A novel ensemble scheme for extreme learning machine (ELM), named Stochastic Gradient Boostingbased Extreme Learning Machine (SGBELM), is proposed in this paper. Instead of incorporating the stochastic gradient boosting method into ELM ensemble procedure primitively, SGBELM constructs a sequence of weak ELMs where each individual ELM is trained additively by optimizing the regularized objective. Specifically, we design an objective function based on the boosting mechanism where a regularization item is introduced simultaneously to alleviate overfitting. Then the derivation formula aimed at solving the outputlayer weights of each weak ELM is determined using the secondorder optimization. As the derivation formula is hard to be analytically calculated and the regularized objective tends to employ simple functions, we take the outputlayer weights learned by the current pseudo residuals as an initial heuristic item and thus obtain the optimal outputlayer weights by using the derivation formula to update the heuristic item iteratively. In comparison with several typical ELM ensemble methods, SGBELM achieves better generalization performance and predicted robustness, which demonstrates the feasibility and effectiveness of SGBELM.
1. Introduction
Extreme learning machine (ELM) was proposed as a promising learning algorithm for singlehiddenlayer feedforward neural networks (SLFN) by Huang [1–3], which randomly chooses weights and biases for hidden nodes and analytically determines the outputlayer weights by using MoorePenrose (MP) generalized inverse [4]. Due to avoiding the iterative parameter adjustment and timeconsuming weight updating, ELM obtains an extremely fast learning speed and thus attracts a lot of attention. However, random initialization of inputlayer weights and hidden biases might generate some suboptimal parameters, which have negative impact on its generalization performance and predicted robustness.
To alleviate such weakness, many works have been proposed to further improve the generalization capability and stability of ELM, where ELM ensemble algorithms are the representative ones. Three representative ELM ensemble algorithms are summarized as follows. The earliest ensemble based ELM (ENELM) method was presented by Liu and Wang in [5]. ENELM introduced the crossvalidation scheme into its training phase, where the original training dataset was partitioned into subsets and then pairs of training and validation sets were obtained so that each training set consists of  subsets. Additionally, with updated input weights and hidden biases, individual ELMs were trained based on each pair of the training and validation set. There were totally ELMs that were constructed for decisionmaking in ENELM algorithm. Cao et al. [6] proposed a votingbased ELM (VELM) ensemble algorithm, which made the final decision based on the majority voting mechanism in classification applications. All the individual ELMs in VELM were trained on the same training dataset and the learning parameters of each basic ELM were randomly initialized independently. Moreover, a genetic ensemble of ELM (GEELM) method was designed by Xue et al. in [7], which used the genetic algorithm to produce optimal input weights as well as hidden biases for individual ELMs and selected ELMs equipped with not only higher fitness values but also smaller norm of output weights from the candidate networks. In GEELM, the fitness value of each individual ELM was evaluated based on the validation set which was randomly selected from the entire training dataset. There are still several other types of ELM ensemble algorithms which can be found in literatures [8–13].
As for ensemble of the traditional neural networks, the most prevailing approaches are Bagging and Boosting. In Bagging scheme [14], it generates several training datasets from the original training dataset and then trains a component neural network from each of those training datasets. Boosting mechanism [15] generates a series of component neural networks whose training datasets are determined by the performance of former ones. There are also many other approaches for training the component neural networks. Hampshire [16] utilizes different object functions to train distinct component neural networks. Xu et al. [17] introduce the stochastic gradient boosting ensemble scheme to bioinformatics applications. Yao et al. [18] regard all the individuals in an evolved population of neural networks as component networks.
In this paper, a new ELM ensemble scheme called Stochastic Gradient Boostingbased Extreme Learning Machine (SGBELM) which makes use of the mechanism of stochastic gradient boosting [19, 20] is proposed. SGBELM constructs an ensemble model by training a sequence of ELMs where the output weights of each individual ELM is learned by optimizing the regularized objective in an additive manner. More specifically, we design an objective based on the training mechanism of boosting method. In order to alleviate overfitting, we introduce a regularization item which controls the complexity of our ensemble model to the objective function concurrently. Then the derivation formula aimed at solving output weights of the newly added ELM is determined by optimizing the objective using secondorder approximation. As the output weights of the newly added ELM at each iteration are hard to be analytically calculated based on the derivation formula, we take the output weights learned by the pseudoresidualsbased training dataset as an initial heuristic item and thus obtain the optimal output weights by using the derivation formula to update the heuristic item iteratively. Because the regularized objective tends to employ not only predictive but also simple functions and meanwhile a randomly selected subset rather than the whole training set is used to minimize training residuals at each iteration, SGBELM can continually improve the generalization capability of ELM while effectively avoiding overfitting. The experimental results in comparison with Bagging ELM, Boosting ELM, ENELM, and VELM show that SGBELM obtains better classification and regression performances, which demonstrates the feasibility and effectiveness of SGBELM algorithm.
The rest of this paper is organized as follows. In Section 2, we briefly summarize the basic ELM model as well as the stochastic gradient boosting method. Section 3 introduces our proposed SGBELM algorithm. Experimental results are presented in Section 4. Finally, we conclude this paper and make some discussions in Section 5.
2. Preliminaries
In this section, we briefly review the principles of basic ELM model and the stochastic gradient boosting method to provide necessary backgrounds for the development of SGBELM algorithm in Section 3.
2.1. Extreme Learning Machine
ELM is a special learning algorithm for SLFN, which randomly selects weights (linking the input layer to the hidden layer) and biases for hidden nodes and analytically determines the output weights (linking the hidden layer to the output layer) by using MP generalized inverse. Suppose we have a training dataset with instances , where and . It is known that for regression and for classification. In ELM, the input weights and hidden biases can be randomly chosen according to any continuous probability distribution [2]. Namely, we randomly select the learning parameters within the range of asandwhere is the number of hiddenlayer nodes in SLFN. Depending on the theory proved in [2], the outputlayer weights in ELM model can be analytically calculated byHere, is the MP generalized inverse of the hiddenlayer output matrixwhere , , and is the sigmoid activation function, andis the target matrix. Generally, for an unseen instance , ELM predicts its output as follows:where is the hiddenlayer output vector of .
Due to avoiding the iterative adjustment to inputlayer weights and hidden biases, ELM’s training speed can be thousands of times faster than those of traditional gradientbased learning algorithms [2]. At the meantime, ELM also produces good generalization performance. It has been verified that ELM can achieve the equal generalization performance with the typical Support Vector Machine algorithm [3].
2.2. Stochastic Gradient Boosting
Stochastic gradient boosting scheme was proposed by Friedman in [20], and it is a variant of the gradient boosting method presented in [19]. Given a training set , the goal is to learn a hypothesis that maps to and minimizes the training loss as follows:where is the loss function which evaluates the difference between the predicted value and the target and K denotes the number of iterations. In boosting mechanism, K additive individual learners are trained sequentially byandwhere . It is shown that the optimization problem depends much on the loss function and becomes unsolvable when is complex. Creatively, gradient boosting constructs the weak individuals based on the pseudo residuals, which are the gradient of loss function with respect to the model values predicted at the current learning step. For instance, let be the pseudo residual of the th sample at the th iteration written asand thus the th weak learner is trained by
As gradient boosting constructs additive ensemble model by sequentially fitting a weak individual learner to the current pseudoresiduals of whole training dataset at each iteration, it costs much training time and may suffer from overfitting problem. In view of that, a minor modification named stochastic gradient boosting is proposed to incorporate some randomization to the procedure. Specifically, at each iteration a randomly selected subset instead of the full training dataset is used to fit the individual learner and compute the model update for the current iteration. Namely, let be a random permutation of the integers , and the subset with size of the entire training dataset can be given by . Furthermore, the th weak learner using the stochastic gradient boosting ensemble scheme is trained by solving the following optimization problem asGiven the base learner which is trained by the initial training dataset, the final ensemble learning model constructed by stochastic gradient boosting scheme predicts an unknown testing instance as follows:
Stochastic gradient boosting is also considered as a special linear search optimization algorithm, which makes the newly added individual learner fit the fastest descent direction of partial training loss at each learning step.
3. Stochastic Gradient BoostingBased Extreme Learning Machine (SGBELM)
SGBELM is a novel hybrid learning algorithm, which introduces the stochastic gradient boosting method into ELM ensemble procedure. As boosting mechanism focuses on gradually reducing the training residuals at each iteration and ELM is a special multiparameters network (for classification tasks particularly), instead of combining the ELM and stochastic gradient boosting primitively, we design an enhanced training scheme to alleviate possible overfitting in our proposed SGBELM algorithm. The detailed implementation of SGBELM is presented in Algorithm 2, where the determination of optimal output weights for each individual ELM learner is illustrated in Algorithm 1 accordingly.


There are many existing secondorder approximation methods including sequential quadratic programming (SQP) [21] and majorizationminimization algorithm (MM) [22]. SQP is an effective method for nonlinearly constrained optimization by solving quadratic subproblems. MM aims to optimize the local alternative objective which is easier to solve in comparison with the original cost function. Instead of using secondorder approximation directly, SGBELM designs an optimization criterion for the outputlayer weights of each individual ELM. In view of that, quadratic approximation is merely employed as an optimization tool in SGBELM.
In SGBELM, the key issue is to determine the optimal outputlayer weights of each weak individual ELM, which is expected to further decrease the training loss and meanwhile keep a simple network structure. Consequently, we design a learning objective considering not only the fitting ability for training instances but also the complexity of our ensemble model as follows:where is a differentiable loss function that measures the difference between the predicted output and the target value . The second term represents the complexity of the ensemble model consisting of weak individual learners. Moreover, is a regularization factor that makes a balance between training loss and architectural risk. It is obvious that the objective falls back to the traditional gradient booting method when the regularization factor is set to zero.
As for boosting training mechanism, each individual ELM is greedily added to the current ensemble model sequentially so that it can most improve our model according to (8). Specifically, let be the predicted value of the th instance at the th iteration and be the th weak ELM learner that needs to be incorporated into the ensemble model, then the prediction of the th instance at the th iteration can be written asIn order to obtain the newly added individual ELM, we first introduce to the existing learned ensemble model and then minimize the following objective:where is already obtained at the previous iterations. As a consequence, the complexity of the learned ensemble model is a constant, and we only need to take into consideration. Removing the constant item, the objective is simplified asStochastic gradient boosting selects a random subset with size of the whole training set to fit the individual learner at each iteration. Namely, let be a random permutation of the integers , then we can define a stochastic subset as . Accordingly, the objective using stochastic gradient boosting is transformed asWe use secondorder approximation to optimize the above learning objective, where the lose function is derived by Taylor expansion as follows:where is the new index for in the randomly generated subset,is the firstorder gradient statistics on the loss function with respect to the current predicted output , andis the secondorder gradient statistics on the loss function with respect to the current predicted output . Due to the approximation for training loss, we can provide a general solution scheme regardless of the specific type of loss function. In addition, secondorder optimization tends to achieve better convergence in comparison with the traditional gradient method [23]. Obviously, is a fixed value, and thus the objective can be further expressed asLet , and the objective can be rewritten in a matrix form aswhereandThe th individual learner is a basic ELM model, which randomly selects inputlayer weights and hidden biases . Given the hiddenlayer output matrix can be expressed aswhereis the outputlayer weight matrix that needs to be determined. As Bartlett [24] pointed out that networks tend to perform better generalization with not only small training error but also small norm of weights (), we use L2norm to evaluate the complexity of a basic ELM model asAccordingly, the conclusive objective can be written as From (30), we can find that the objective is only sensitive to at the th iteration. For singlevariable optimization, solving partial derivative is conducted aswhere each element in conducted a partial derivative, respectively. Thus we obtain the derivation formula as follows:where and . It is shown that is difficult to be calculated analytically. Since our designed regularized objective tends to generate an ensemble model employing predictive as well as simple hypotheses, (32) derived by the objective can be used as an optimization criterion. Specifically, we take the outputlayer weights determined by pseudoresiduals dataset as an initial heuristic item and thus obtain the optimal outputlayer weights by using the derivation formula to update the heuristic item iteratively. Algorithm 1 illustrates how the optimal output weight matrix is determined and the detailed implementation of SGBELM is presented in Algorithm 2.
In Algorithm 2, all the input weights and hidden biases of individual ELMs are randomly chosen within the range of . For boostingbased ensemble methods, the initial base learner is expected to be enhanced by adding weak individual learners to the current ensemble model step by step. In view of that, highprecision initial base learner might affect the effectiveness of ensemble negatively. In order to control the fitting ability of the initial base learner and meanwhile reduce the instability brought by random determination of the input weights and hidden biases, SGBELM conducts multiple random initializations for parameters in and takes the average at last. For instance, we take the average of 100 random initializations aswhere and . For the weak individual ELM, which plays a smaller role in the whole ensemble model, random initialization of parameters exactly increases the diversity between weak individual learners.
4. Performance Validation
In this section, a series of experiments are conducted to validate the feasibility and effectiveness of our proposed SGBELM algorithm, and meanwhile we compare the generalization performance and predicted stability of several typical ensemble learning methods (ENELM [5], VELM [6], Bagging [14], and Adaboost [15]) on 4 KEEL [25] regression and 5 UCI [26] classification datasets. Among all the abovementioned ensemble methods, the basic ELM model proposed in [2] is used as the individual learner, where the sigmoid function is selected as the activation function. All the experiments are carried out on a desktop computer with Win10 operating system, Intel (R) i54590 3.30GHz CPU, and 12GB memory and implemented with Matlab 9.0 version. Meanwhile, all the experimental results are the average of 50 repeated trials. The experiments are generally divided into two parts: one part is to evaluate the performance of SGBELM, and the other part is to measure the effect of learning parameters on training SGBELM algorithm.
4.1. Performance Evaluation of SGBELM
For regression problem, the performances of SGBELM and other comparative algorithms are both measured by Root Mean Square Error (RMSE), which reveals the difference between the predicted value and the target. Additionally, in this paper, we take the squared loss as our loss function in SGBELM algorithm for regression task. Suppose and are the predicted value and the target of the th instance, respectively, and the loss function is given bySince VELM and ENELM are designed for classification applications, we compare the generalization capability of SGBELM with the basic ELM, simple ensemble ELM, Bagging ELM, and Adaboost ELM in regression tasks. Among them, simple ensemble ELM can be considered as a variant of the VELM method, which trains a number of individual ELMs independently and takes the simple average of all the predictions at last. Adaboost ELM is implemented by Adaboost.R2 method [27], which applies the primitive Adaboost algorithm designed for classification tasks [15] to the regression field. Furthermore, we adopt resampling the original training dataset rather than assigning a weight to every instance to train each individual learner in Adaboost.R2 ELM.
The performances of the traditional ELM, simple ensemble ELM, Bagging ELM, Adaboost ELM, and our proposed SGBELM are compared on 4 representative regression datasets, which are selected from the KEEL [25] repository. Experimentally, all the inputs of each dataset are normalized into the range of . The characteristics of these datasets are summarized in Table 1, where each original dataset is divided into two groups including a training set () and a testing set (). In our regression experiments, for each dataset, the number of hidden nodes is selected from . The parameters in SGBELM are set as , , and . The settings of other comparative algorithms can be found in Table 2. Figure 1 shows the training and testing RMSE of different learning methods during 50 trials on Friedman dataset. The detailed comparison results between SGBELM and other learning algorithms on 4 regression benchmark datasets are shown in Table 2. Furthermore, we compare the training and testing performances of SGBELM with those of Adaboost.R2 with regard to the number of iterations on Mortgage dataset, which is presented in Figure 3(a).
As for classification problem, like other typical feedforward neural networks (for instance, BP neural networks [28]), SGBELM evaluates the predicted output by calculating the sum of squared errors. Specifically, let be the predicted output vector and be the target encoded by OneHot scheme [29] of the th sample, respectively, and we define the loss function in SGBELM for classification as follows:It is shown that SGBELM aims at reducing the training RMSE inch by inch for classification problem. Accordingly, we compare SGBELM with several representative ensemble learning methods including VELM, ENELM, Bagging ELM, and Adaboost ELM. Among them, VELM and ENELM have been briefly summarized in Section 1, and Adaboost ELM is implemented by Adaboost.SAMME method [30], which extends the original Adaboost designed for binary classification to multiclassification problem.
Similarly, we select 5 popular classification datasets from the UCI Machine Learning Repository [26] to verify the performance of our proposed SGBELM algorithm. For each dataset, all the decision attributes are encoded by OneHot scheme [29]. The characteristics of these datasets are described in Table 3, where each original data set is equally divided into two groups including a training set () and a testing set (). The number of hidden nodes is also selected from for each dataset. The parameters in SGBELM are set as , , and . The crossvalidation is tenfold () in ENELM. The number of individual ELMs for ensemble is 7 () in VELM. Other settings can be found in Table 4. Figure 2 shows the training and testing accuracy of different algorithms during 50 trials on Segmentation dataset. The detailed performances of SGBELM in comparison with other learning algorithms on 5 classification benchmark datasets are summarized in Table 4. Lastly, the training and testing accuracy of SGBELM and Adaboost.SAMME with regard to the number of iterations on the Segmentation dataset are presented in Figure 3(b).
(a) On the Mortgage dataset (regression)
(b) On the Segmentation dataset (classification)
Tables 2 and 4 present the comparison results including training time, training RMSE/accuracy, and testing RMSE/accuracy for regression and classification tasks, respectively. It is shown that SGBELM obtains the better generalization capability in most cases without significantly increasing the training time. At the same time, SGBELM tends to have smaller training Dev and testing Dev than those of the comparative learning algorithms, which exactly validates the robustness and stability of our proposed SGBELM Algorithm. In particular, since SGBELM adopts the similar training mechanism with Adaboost which integrates multiple weak individual learners sequentially, the number of hidden nodes is set as a smaller value in both SGBELM and Adaboost method. It is worth noting that SGBELM can achieve better performance than the existing methods with less hidden nodes and outperforms Adaboost with the same number of hidden nodes.
From Figures 1 and 2, we can find that SGBELM is more stable than the traditional ELM, simple ensemble, Bagging, and Adaboost.R2 in regression problem and also produces better robustness than VELM, ENELM, Bagging, and Adaboost.SAMME in classification problem. It is shown that SGBELM not only focuses on reducing the predicted bias as other boosting like methods, but also generates a robust ensemble model with a low variance. As observed in Figure 2 although Adaboost.SAMME generates higher training accuracy than SGBELM during the most of 50 trials, SGBELM obtains the better generalization capability (testing accuracy). It can be explained by two reasons as we introduce a regularization item (L2norm) to the learning objective to control the complexity of our ensemble learning model; a randomly selected subset rather than the whole training dataset is used to minimize the training loss at each iteration in our proposed SGBELM algorithm.
Figure 3 shows the training RMSE/accuracy and testing RMSE/accuracy of Adaboost (Adaboost.R2 for regression and Adaboost.SAMME for classification) and SGBELM with regard to the number of iterations. The fixed reference line denotes the training and testing performance of a traditional ELM, which is equipped with much more hidden nodes. As shown in Figure 3, SGBELM obviously improves the generalization capability of the initial base ELM in both regression and classification tasks. From Figure 3(a), we can find that the training and testing RMSE is declining gradually as the number of iterations increases. Similarly, both the training and testing accuracy curve show an increasing trend in Figure 3(b). Because we conduct multiple random initializations for parameters in the initial base learner and take the average at last, the fitting ability of is artificially weakened to some extent. As a result, the initial training and testing RMSE/accuracy of SGBELM are much lower than the initial Adaboost. It is shown that both SGBELM and Adaboost outperform the traditional ELM equipped more hidden nodes after a small number of learning steps. Furthermore, we can find that SGBELM produces better performance than Adaboost after only 5 iterations in regression tasks and 10 iterations in classification tasks. It verifies the significant convergence of secondorder optimization method, which is incorporated into the procedure of SGBELM.
From the experimental results of both regression and classification problems, we can conclude that our proposed SGBELM algorithm can not only achieve better generalization capability (low predicted bias) than the typical existing variants of ELM, but also obtain an enough robust ELM ensemble learning model (low predicted variance).
4.2. Impact of Learning Parameters on Training SGBELM
To achieve good generalization performance, three learning parameters of SGBELM including the number of hidden nodes , the regularization factor , and the size of subset need to be chosen appropriately. In this section, we attempt to evaluate the impact of learning parameters on training SGBELM algorithm and provide some empirical references of choosing these parameters.
For the basic ELM model, the number of hidden nodes decides the model's capacity. In other words, an ELM with more hidden nodes is more complex and can deal with more training instances. However, it tends to obtain an overfitting model when is set as a value too large. The regularization factor makes a balance between the training loss and the complexity of model. It means that can control the capacity or the complexity of our model. The size of subset represents the number of training instances at each iteration and it introduces some randomization to the training procedure of SGBELM. Firstly, we use gridsearch method to observe the training and testing performance of SGBELM with different and . Specifically, we set , , and a fixed . The training and testing performance of SGBELM with regard to the combination of on the Spambase dataset is shown in Figure 4. Secondly, as we empirically find that the optimal depends much on the size of training dataset, we conduct two experiments (including a small dataset and a large dataset) to measure the impact of on training SGBELM. We choose the optimal value of according to the gridsearch results and set . Figure 5 shows the training and testing performance of SGBELM with different sampling fraction () on the Wizmir and Spambase datasets.
(a) On the Wizmir dataset
(b) On the Spambase dataset
As shown in Figure 4, changing the value of has a significant effect on the training and testing accuracy of SGBELM algorithm. It is obvious that SGBELM with excess hidden nodes is more likely to produce overfitting when the regularization factor is set as a small value. It also demonstrates that SGBELM can effectively reduce overfitting when is assigned a proper value. In addition, from Figure 4 we can find that SGBELM achieves better performance with enough hidden nodes and a proper . It can be explained by the rule that although SGBELM with a small number of hidden nodes can avoid overfitting intuitively, meanwhile it produces a barrier to fit the current training residuals appropriately.
From Figure 5, it is obvious that randomization improves the performance of SGBELM substantially. As each weak individual ELM is learned based on randomly selected subset of the whole training dataset, it exactly increases the diversity between all the individuals. On the other hand, randomization introduces a noisy estimate of the total training loss. As a result, it slows down the convergence and even makes the learning curve fluctuate (higher variance) if is too small. It is shown that the best value of the sampling fraction is approximately on the Wizmir dataset and on the Spambase dataset, where there are a typical improvement in testing performance comparing to no sampling at all. Since the optimal values of are different on the Wizmir and Spambase datasets, it indicates that the sampling fraction () is expected to be determined based on the specific learning tasks and assigned a bigger value on the training dataset containing more instances.
5. Conclusions
In this paper, we proposed a novel ensemble model named Stochastic Gradient Boostingbased Extreme Learning Machine (SGBELM). Instead of combining ELM and stochastic gradient boosting primitively, we construct an ELM flow or ELM sequence where the outputlayer weights of each weak ELM are determined by optimizing the regularized objective additively. Firstly, by minimizing the objective using secondorder approximation, the derivation formula aimed at solving the outputlayer weights of each individual ELM is determined. Then we take the outputlayer weights learned by the current pseudo residuals as a heuristic item and thus obtain the optimal outputlayer weights by updating the heuristic item iteratively. The performance of SGBELM was evaluated on 4 regression and 5 classification datasets. In comparison with several typical ELM ensemble methods, SGBELM obtained better performance and robustness, which demonstrated the feasibility and effectiveness of SGBELM algorithm.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this article.
Authors’ Contributions
Hua Guo and Jikui Wang contributed equally the same to this work.
Acknowledgments
This work is supported by National Natural Science Foundations of China (61503252 and 61473194), China Postdoctoral Science Foundation (2016T90799), and Natural Science Foundation of Gansu Province (17JR5RA177).