Table of Contents Author Guidelines Submit a Manuscript
Computational Intelligence and Neuroscience
Volume 2018, Article ID 4058403, 14 pages
https://doi.org/10.1155/2018/4058403
Research Article

SGB-ELM: An Advanced Stochastic Gradient Boosting-Based Ensemble Scheme for Extreme Learning Machine

1School of Information Engineering, Lanzhou University of Finance and Economics, Lanzhou 730020, China
2College of Computer Science & Software Engineering, Shenzhen University, Shenzhen 518060, China

Correspondence should be addressed to Jikui Wang; nc.ude.uzs@bewkjw

Received 11 December 2017; Revised 10 May 2018; Accepted 4 June 2018; Published 26 June 2018

Academic Editor: Pedro Antonio Gutierrez

Copyright © 2018 Hua Guo et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

A novel ensemble scheme for extreme learning machine (ELM), named Stochastic Gradient Boosting-based Extreme Learning Machine (SGB-ELM), is proposed in this paper. Instead of incorporating the stochastic gradient boosting method into ELM ensemble procedure primitively, SGB-ELM constructs a sequence of weak ELMs where each individual ELM is trained additively by optimizing the regularized objective. Specifically, we design an objective function based on the boosting mechanism where a regularization item is introduced simultaneously to alleviate overfitting. Then the derivation formula aimed at solving the output-layer weights of each weak ELM is determined using the second-order optimization. As the derivation formula is hard to be analytically calculated and the regularized objective tends to employ simple functions, we take the output-layer weights learned by the current pseudo residuals as an initial heuristic item and thus obtain the optimal output-layer weights by using the derivation formula to update the heuristic item iteratively. In comparison with several typical ELM ensemble methods, SGB-ELM achieves better generalization performance and predicted robustness, which demonstrates the feasibility and effectiveness of SGB-ELM.

1. Introduction

Extreme learning machine (ELM) was proposed as a promising learning algorithm for single-hidden-layer feedforward neural networks (SLFN) by Huang [13], which randomly chooses weights and biases for hidden nodes and analytically determines the output-layer weights by using Moore-Penrose (MP) generalized inverse [4]. Due to avoiding the iterative parameter adjustment and time-consuming weight updating, ELM obtains an extremely fast learning speed and thus attracts a lot of attention. However, random initialization of input-layer weights and hidden biases might generate some suboptimal parameters, which have negative impact on its generalization performance and predicted robustness.

To alleviate such weakness, many works have been proposed to further improve the generalization capability and stability of ELM, where ELM ensemble algorithms are the representative ones. Three representative ELM ensemble algorithms are summarized as follows. The earliest ensemble based ELM (EN-ELM) method was presented by Liu and Wang in [5]. EN-ELM introduced the cross-validation scheme into its training phase, where the original training dataset was partitioned into subsets and then pairs of training and validation sets were obtained so that each training set consists of - subsets. Additionally, with updated input weights and hidden biases, individual ELMs were trained based on each pair of the training and validation set. There were totally ELMs that were constructed for decision-making in EN-ELM algorithm. Cao et al. [6] proposed a voting-based ELM (V-ELM) ensemble algorithm, which made the final decision based on the majority voting mechanism in classification applications. All the individual ELMs in V-ELM were trained on the same training dataset and the learning parameters of each basic ELM were randomly initialized independently. Moreover, a genetic ensemble of ELM (GE-ELM) method was designed by Xue et al. in [7], which used the genetic algorithm to produce optimal input weights as well as hidden biases for individual ELMs and selected ELMs equipped with not only higher fitness values but also smaller norm of output weights from the candidate networks. In GE-ELM, the fitness value of each individual ELM was evaluated based on the validation set which was randomly selected from the entire training dataset. There are still several other types of ELM ensemble algorithms which can be found in literatures [813].

As for ensemble of the traditional neural networks, the most prevailing approaches are Bagging and Boosting. In Bagging scheme [14], it generates several training datasets from the original training dataset and then trains a component neural network from each of those training datasets. Boosting mechanism [15] generates a series of component neural networks whose training datasets are determined by the performance of former ones. There are also many other approaches for training the component neural networks. Hampshire [16] utilizes different object functions to train distinct component neural networks. Xu et al. [17] introduce the stochastic gradient boosting ensemble scheme to bioinformatics applications. Yao et al. [18] regard all the individuals in an evolved population of neural networks as component networks.

In this paper, a new ELM ensemble scheme called Stochastic Gradient Boosting-based Extreme Learning Machine (SGB-ELM) which makes use of the mechanism of stochastic gradient boosting [19, 20] is proposed. SGB-ELM constructs an ensemble model by training a sequence of ELMs where the output weights of each individual ELM is learned by optimizing the regularized objective in an additive manner. More specifically, we design an objective based on the training mechanism of boosting method. In order to alleviate overfitting, we introduce a regularization item which controls the complexity of our ensemble model to the objective function concurrently. Then the derivation formula aimed at solving output weights of the newly added ELM is determined by optimizing the objective using second-order approximation. As the output weights of the newly added ELM at each iteration are hard to be analytically calculated based on the derivation formula, we take the output weights learned by the pseudo-residuals-based training dataset as an initial heuristic item and thus obtain the optimal output weights by using the derivation formula to update the heuristic item iteratively. Because the regularized objective tends to employ not only predictive but also simple functions and meanwhile a randomly selected subset rather than the whole training set is used to minimize training residuals at each iteration, SGB-ELM can continually improve the generalization capability of ELM while effectively avoiding overfitting. The experimental results in comparison with Bagging ELM, Boosting ELM, EN-ELM, and V-ELM show that SGB-ELM obtains better classification and regression performances, which demonstrates the feasibility and effectiveness of SGB-ELM algorithm.

The rest of this paper is organized as follows. In Section 2, we briefly summarize the basic ELM model as well as the stochastic gradient boosting method. Section 3 introduces our proposed SGB-ELM algorithm. Experimental results are presented in Section 4. Finally, we conclude this paper and make some discussions in Section 5.

2. Preliminaries

In this section, we briefly review the principles of basic ELM model and the stochastic gradient boosting method to provide necessary backgrounds for the development of SGB-ELM algorithm in Section 3.

2.1. Extreme Learning Machine

ELM is a special learning algorithm for SLFN, which randomly selects weights (linking the input layer to the hidden layer) and biases for hidden nodes and analytically determines the output weights (linking the hidden layer to the output layer) by using MP generalized inverse. Suppose we have a training dataset with instances , where and . It is known that for regression and for classification. In ELM, the input weights and hidden biases can be randomly chosen according to any continuous probability distribution [2]. Namely, we randomly select the learning parameters within the range of asandwhere is the number of hidden-layer nodes in SLFN. Depending on the theory proved in [2], the output-layer weights in ELM model can be analytically calculated byHere, is the MP generalized inverse of the hidden-layer output matrixwhere ,  , and is the sigmoid activation function, andis the target matrix. Generally, for an unseen instance , ELM predicts its output as follows:where is the hidden-layer output vector of .

Due to avoiding the iterative adjustment to input-layer weights and hidden biases, ELM’s training speed can be thousands of times faster than those of traditional gradient-based learning algorithms [2]. At the meantime, ELM also produces good generalization performance. It has been verified that ELM can achieve the equal generalization performance with the typical Support Vector Machine algorithm [3].

2.2. Stochastic Gradient Boosting

Stochastic gradient boosting scheme was proposed by Friedman in [20], and it is a variant of the gradient boosting method presented in [19]. Given a training set , the goal is to learn a hypothesis that maps to and minimizes the training loss as follows:where is the loss function which evaluates the difference between the predicted value and the target and K denotes the number of iterations. In boosting mechanism, K additive individual learners are trained sequentially byandwhere . It is shown that the optimization problem depends much on the loss function and becomes unsolvable when is complex. Creatively, gradient boosting constructs the weak individuals based on the pseudo residuals, which are the gradient of loss function with respect to the model values predicted at the current learning step. For instance, let be the pseudo residual of the th sample at the th iteration written asand thus the th weak learner is trained by

As gradient boosting constructs additive ensemble model by sequentially fitting a weak individual learner to the current pseudo-residuals of whole training dataset at each iteration, it costs much training time and may suffer from overfitting problem. In view of that, a minor modification named stochastic gradient boosting is proposed to incorporate some randomization to the procedure. Specifically, at each iteration a randomly selected subset instead of the full training dataset is used to fit the individual learner and compute the model update for the current iteration. Namely, let be a random permutation of the integers , and the subset with size of the entire training dataset can be given by . Furthermore, the th weak learner using the stochastic gradient boosting ensemble scheme is trained by solving the following optimization problem asGiven the base learner which is trained by the initial training dataset, the final ensemble learning model constructed by stochastic gradient boosting scheme predicts an unknown testing instance as follows:

Stochastic gradient boosting is also considered as a special linear search optimization algorithm, which makes the newly added individual learner fit the fastest descent direction of partial training loss at each learning step.

3. Stochastic Gradient Boosting-Based Extreme Learning Machine (SGB-ELM)

SGB-ELM is a novel hybrid learning algorithm, which introduces the stochastic gradient boosting method into ELM ensemble procedure. As boosting mechanism focuses on gradually reducing the training residuals at each iteration and ELM is a special multiparameters network (for classification tasks particularly), instead of combining the ELM and stochastic gradient boosting primitively, we design an enhanced training scheme to alleviate possible overfitting in our proposed SGB-ELM algorithm. The detailed implementation of SGB-ELM is presented in Algorithm 2, where the determination of optimal output weights for each individual ELM learner is illustrated in Algorithm 1 accordingly.

Algorithm 1: The determination of .
Algorithm 2: SGB-ELM.

There are many existing second-order approximation methods including sequential quadratic programming (SQP) [21] and majorization-minimization algorithm (MM) [22]. SQP is an effective method for nonlinearly constrained optimization by solving quadratic subproblems. MM aims to optimize the local alternative objective which is easier to solve in comparison with the original cost function. Instead of using second-order approximation directly, SGB-ELM designs an optimization criterion for the output-layer weights of each individual ELM. In view of that, quadratic approximation is merely employed as an optimization tool in SGB-ELM.

In SGB-ELM, the key issue is to determine the optimal output-layer weights of each weak individual ELM, which is expected to further decrease the training loss and meanwhile keep a simple network structure. Consequently, we design a learning objective considering not only the fitting ability for training instances but also the complexity of our ensemble model as follows:where is a differentiable loss function that measures the difference between the predicted output and the target value . The second term represents the complexity of the ensemble model consisting of weak individual learners. Moreover, is a regularization factor that makes a balance between training loss and architectural risk. It is obvious that the objective falls back to the traditional gradient booting method when the regularization factor is set to zero.

As for boosting training mechanism, each individual ELM is greedily added to the current ensemble model sequentially so that it can most improve our model according to (8). Specifically, let be the predicted value of the th instance at the th iteration and be the th weak ELM learner that needs to be incorporated into the ensemble model, then the prediction of the th instance at the th iteration can be written asIn order to obtain the newly added individual ELM, we first introduce to the existing learned ensemble model and then minimize the following objective:where is already obtained at the previous iterations. As a consequence, the complexity of the learned ensemble model is a constant, and we only need to take into consideration. Removing the constant item, the objective is simplified asStochastic gradient boosting selects a random subset with size of the whole training set to fit the individual learner at each iteration. Namely, let be a random permutation of the integers , then we can define a stochastic subset as . Accordingly, the objective using stochastic gradient boosting is transformed asWe use second-order approximation to optimize the above learning objective, where the lose function is derived by Taylor expansion as follows:where is the new index for in the randomly generated subset,is the first-order gradient statistics on the loss function with respect to the current predicted output , andis the second-order gradient statistics on the loss function with respect to the current predicted output . Due to the approximation for training loss, we can provide a general solution scheme regardless of the specific type of loss function. In addition, second-order optimization tends to achieve better convergence in comparison with the traditional gradient method [23]. Obviously, is a fixed value, and thus the objective can be further expressed asLet , and the objective can be rewritten in a matrix form aswhereandThe th individual learner is a basic ELM model, which randomly selects input-layer weights and hidden biases . Given the hidden-layer output matrix can be expressed aswhereis the output-layer weight matrix that needs to be determined. As Bartlett [24] pointed out that networks tend to perform better generalization with not only small training error but also small norm of weights (), we use L2-norm to evaluate the complexity of a basic ELM model asAccordingly, the conclusive objective can be written as From (30), we can find that the objective is only sensitive to at the th iteration. For single-variable optimization, solving partial derivative is conducted aswhere each element in conducted a partial derivative, respectively. Thus we obtain the derivation formula as follows:where and . It is shown that is difficult to be calculated analytically. Since our designed regularized objective tends to generate an ensemble model employing predictive as well as simple hypotheses, (32) derived by the objective can be used as an optimization criterion. Specifically, we take the output-layer weights determined by pseudo-residuals dataset as an initial heuristic item and thus obtain the optimal output-layer weights by using the derivation formula to update the heuristic item iteratively. Algorithm 1 illustrates how the optimal output weight matrix is determined and the detailed implementation of SGB-ELM is presented in Algorithm 2.

In Algorithm 2, all the input weights and hidden biases of individual ELMs are randomly chosen within the range of . For boosting-based ensemble methods, the initial base learner is expected to be enhanced by adding weak individual learners to the current ensemble model step by step. In view of that, high-precision initial base learner might affect the effectiveness of ensemble negatively. In order to control the fitting ability of the initial base learner and meanwhile reduce the instability brought by random determination of the input weights and hidden biases, SGB-ELM conducts multiple random initializations for parameters in and takes the average at last. For instance, we take the average of 100 random initializations aswhere and . For the weak individual ELM, which plays a smaller role in the whole ensemble model, random initialization of parameters exactly increases the diversity between weak individual learners.

4. Performance Validation

In this section, a series of experiments are conducted to validate the feasibility and effectiveness of our proposed SGB-ELM algorithm, and meanwhile we compare the generalization performance and predicted stability of several typical ensemble learning methods (EN-ELM [5], V-ELM [6], Bagging [14], and Adaboost [15]) on 4 KEEL [25] regression and 5 UCI [26] classification datasets. Among all the above-mentioned ensemble methods, the basic ELM model proposed in [2] is used as the individual learner, where the sigmoid function is selected as the activation function. All the experiments are carried out on a desktop computer with Win10 operating system, Intel (R) i5-4590 3.30GHz CPU, and 12GB memory and implemented with Matlab 9.0 version. Meanwhile, all the experimental results are the average of 50 repeated trials. The experiments are generally divided into two parts: one part is to evaluate the performance of SGB-ELM, and the other part is to measure the effect of learning parameters on training SGB-ELM algorithm.

4.1. Performance Evaluation of SGB-ELM

For regression problem, the performances of SGB-ELM and other comparative algorithms are both measured by Root Mean Square Error (RMSE), which reveals the difference between the predicted value and the target. Additionally, in this paper, we take the squared loss as our loss function in SGB-ELM algorithm for regression task. Suppose and are the predicted value and the target of the th instance, respectively, and the loss function is given bySince V-ELM and EN-ELM are designed for classification applications, we compare the generalization capability of SGB-ELM with the basic ELM, simple ensemble ELM, Bagging ELM, and Adaboost ELM in regression tasks. Among them, simple ensemble ELM can be considered as a variant of the V-ELM method, which trains a number of individual ELMs independently and takes the simple average of all the predictions at last. Adaboost ELM is implemented by Adaboost.R2 method [27], which applies the primitive Adaboost algorithm designed for classification tasks [15] to the regression field. Furthermore, we adopt resampling the original training dataset rather than assigning a weight to every instance to train each individual learner in Adaboost.R2 ELM.

The performances of the traditional ELM, simple ensemble ELM, Bagging ELM, Adaboost ELM, and our proposed SGB-ELM are compared on 4 representative regression datasets, which are selected from the KEEL [25] repository. Experimentally, all the inputs of each dataset are normalized into the range of . The characteristics of these datasets are summarized in Table 1, where each original dataset is divided into two groups including a training set () and a testing set (). In our regression experiments, for each dataset, the number of hidden nodes is selected from . The parameters in SGB-ELM are set as ,  , and . The settings of other comparative algorithms can be found in Table 2. Figure 1 shows the training and testing RMSE of different learning methods during 50 trials on Friedman dataset. The detailed comparison results between SGB-ELM and other learning algorithms on 4 regression benchmark datasets are shown in Table 2. Furthermore, we compare the training and testing performances of SGB-ELM with those of Adaboost.R2 with regard to the number of iterations on Mortgage dataset, which is presented in Figure 3(a).

Table 1: Details of 4 KEEL regression datasets.
Table 2: The comparison results between SGB-ELM and other representative algorithms on 4 regression datasets.
Figure 1: The training and testing performance of ELM, simple ensemble, Bagging, Adaboost.R2, and SGB-ELM during 50 trials on the Friedman dataset.

As for classification problem, like other typical feedforward neural networks (for instance, BP neural networks [28]), SGB-ELM evaluates the predicted output by calculating the sum of squared errors. Specifically, let be the predicted output vector and be the target encoded by One-Hot scheme [29] of the th sample, respectively, and we define the loss function in SGB-ELM for classification as follows:It is shown that SGB-ELM aims at reducing the training RMSE inch by inch for classification problem. Accordingly, we compare SGB-ELM with several representative ensemble learning methods including V-ELM, EN-ELM, Bagging ELM, and Adaboost ELM. Among them, V-ELM and EN-ELM have been briefly summarized in Section 1, and Adaboost ELM is implemented by Adaboost.SAMME method [30], which extends the original Adaboost designed for binary classification to multiclassification problem.

Similarly, we select 5 popular classification datasets from the UCI Machine Learning Repository [26] to verify the performance of our proposed SGB-ELM algorithm. For each dataset, all the decision attributes are encoded by One-Hot scheme [29]. The characteristics of these datasets are described in Table 3, where each original data set is equally divided into two groups including a training set () and a testing set (). The number of hidden nodes is also selected from for each dataset. The parameters in SGB-ELM are set as ,  , and . The cross-validation is tenfold () in EN-ELM. The number of individual ELMs for ensemble is 7 () in V-ELM. Other settings can be found in Table 4. Figure 2 shows the training and testing accuracy of different algorithms during 50 trials on Segmentation dataset. The detailed performances of SGB-ELM in comparison with other learning algorithms on 5 classification benchmark datasets are summarized in Table 4. Lastly, the training and testing accuracy of SGB-ELM and Adaboost.SAMME with regard to the number of iterations on the Segmentation dataset are presented in Figure 3(b).

Table 3: Details of 5 UCI classification datasets.
Table 4: The comparison results between SGB-ELM and other representative algorithms on 5 classification datasets.
Figure 2: The training and testing performance of V-ELM, EN-ELM, Bagging, Adaboost.SAMME, and SGB-ELM during 50 trials on the Segmentation dataset.
Figure 3: The training and testing RMSE/accuracy of SGB-ELM with regard to the number of iterations in comparison with Adaboost method.

Tables 2 and 4 present the comparison results including training time, training RMSE/accuracy, and testing RMSE/accuracy for regression and classification tasks, respectively. It is shown that SGB-ELM obtains the better generalization capability in most cases without significantly increasing the training time. At the same time, SGB-ELM tends to have smaller training Dev and testing Dev than those of the comparative learning algorithms, which exactly validates the robustness and stability of our proposed SGB-ELM Algorithm. In particular, since SGB-ELM adopts the similar training mechanism with Adaboost which integrates multiple weak individual learners sequentially, the number of hidden nodes is set as a smaller value in both SGB-ELM and Adaboost method. It is worth noting that SGB-ELM can achieve better performance than the existing methods with less hidden nodes and outperforms Adaboost with the same number of hidden nodes.

From Figures 1 and 2, we can find that SGB-ELM is more stable than the traditional ELM, simple ensemble, Bagging, and Adaboost.R2 in regression problem and also produces better robustness than V-ELM, EN-ELM, Bagging, and Adaboost.SAMME in classification problem. It is shown that SGB-ELM not only focuses on reducing the predicted bias as other boosting like methods, but also generates a robust ensemble model with a low variance. As observed in Figure 2 although Adaboost.SAMME generates higher training accuracy than SGB-ELM during the most of 50 trials, SGB-ELM obtains the better generalization capability (testing accuracy). It can be explained by two reasons as we introduce a regularization item (L2-norm) to the learning objective to control the complexity of our ensemble learning model; a randomly selected subset rather than the whole training dataset is used to minimize the training loss at each iteration in our proposed SGB-ELM algorithm.

Figure 3 shows the training RMSE/accuracy and testing RMSE/accuracy of Adaboost (Adaboost.R2 for regression and Adaboost.SAMME for classification) and SGB-ELM with regard to the number of iterations. The fixed reference line denotes the training and testing performance of a traditional ELM, which is equipped with much more hidden nodes. As shown in Figure 3, SGB-ELM obviously improves the generalization capability of the initial base ELM in both regression and classification tasks. From Figure 3(a), we can find that the training and testing RMSE is declining gradually as the number of iterations increases. Similarly, both the training and testing accuracy curve show an increasing trend in Figure 3(b). Because we conduct multiple random initializations for parameters in the initial base learner and take the average at last, the fitting ability of is artificially weakened to some extent. As a result, the initial training and testing RMSE/accuracy of SGB-ELM are much lower than the initial Adaboost. It is shown that both SGB-ELM and Adaboost outperform the traditional ELM equipped more hidden nodes after a small number of learning steps. Furthermore, we can find that SGB-ELM produces better performance than Adaboost after only 5 iterations in regression tasks and 10 iterations in classification tasks. It verifies the significant convergence of second-order optimization method, which is incorporated into the procedure of SGB-ELM.

From the experimental results of both regression and classification problems, we can conclude that our proposed SGB-ELM algorithm can not only achieve better generalization capability (low predicted bias) than the typical existing variants of ELM, but also obtain an enough robust ELM ensemble learning model (low predicted variance).

4.2. Impact of Learning Parameters on Training SGB-ELM

To achieve good generalization performance, three learning parameters of SGB-ELM including the number of hidden nodes , the regularization factor , and the size of subset need to be chosen appropriately. In this section, we attempt to evaluate the impact of learning parameters on training SGB-ELM algorithm and provide some empirical references of choosing these parameters.

For the basic ELM model, the number of hidden nodes decides the model's capacity. In other words, an ELM with more hidden nodes is more complex and can deal with more training instances. However, it tends to obtain an overfitting model when is set as a value too large. The regularization factor makes a balance between the training loss and the complexity of model. It means that can control the capacity or the complexity of our model. The size of subset represents the number of training instances at each iteration and it introduces some randomization to the training procedure of SGB-ELM. Firstly, we use grid-search method to observe the training and testing performance of SGB-ELM with different and . Specifically, we set ,  , and a fixed . The training and testing performance of SGB-ELM with regard to the combination of on the Spambase dataset is shown in Figure 4. Secondly, as we empirically find that the optimal depends much on the size of training dataset, we conduct two experiments (including a small dataset and a large dataset) to measure the impact of on training SGB-ELM. We choose the optimal value of according to the grid-search results and set . Figure 5 shows the training and testing performance of SGB-ELM with different sampling fraction () on the Wizmir and Spambase datasets.

Figure 4: The training and testing performance of SGB-ELM with different combinations of on the Spambase dataset.
Figure 5: The training and testing performance of SGB-ELM with different sampling fraction () on the Wizmir and Spambase datasets.

As shown in Figure 4, changing the value of has a significant effect on the training and testing accuracy of SGB-ELM algorithm. It is obvious that SGB-ELM with excess hidden nodes is more likely to produce overfitting when the regularization factor is set as a small value. It also demonstrates that SGB-ELM can effectively reduce overfitting when is assigned a proper value. In addition, from Figure 4 we can find that SGB-ELM achieves better performance with enough hidden nodes and a proper . It can be explained by the rule that although SGB-ELM with a small number of hidden nodes can avoid overfitting intuitively, meanwhile it produces a barrier to fit the current training residuals appropriately.

From Figure 5, it is obvious that randomization improves the performance of SGB-ELM substantially. As each weak individual ELM is learned based on randomly selected subset of the whole training dataset, it exactly increases the diversity between all the individuals. On the other hand, randomization introduces a noisy estimate of the total training loss. As a result, it slows down the convergence and even makes the learning curve fluctuate (higher variance) if is too small. It is shown that the best value of the sampling fraction is approximately on the Wizmir dataset and on the Spambase dataset, where there are a typical improvement in testing performance comparing to no sampling at all. Since the optimal values of are different on the Wizmir and Spambase datasets, it indicates that the sampling fraction () is expected to be determined based on the specific learning tasks and assigned a bigger value on the training dataset containing more instances.

5. Conclusions

In this paper, we proposed a novel ensemble model named Stochastic Gradient Boosting-based Extreme Learning Machine (SGB-ELM). Instead of combining ELM and stochastic gradient boosting primitively, we construct an ELM flow or ELM sequence where the output-layer weights of each weak ELM are determined by optimizing the regularized objective additively. Firstly, by minimizing the objective using second-order approximation, the derivation formula aimed at solving the output-layer weights of each individual ELM is determined. Then we take the output-layer weights learned by the current pseudo residuals as a heuristic item and thus obtain the optimal output-layer weights by updating the heuristic item iteratively. The performance of SGB-ELM was evaluated on 4 regression and 5 classification datasets. In comparison with several typical ELM ensemble methods, SGB-ELM obtained better performance and robustness, which demonstrated the feasibility and effectiveness of SGB-ELM algorithm.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Authors’ Contributions

Hua Guo and Jikui Wang contributed equally the same to this work.

Acknowledgments

This work is supported by National Natural Science Foundations of China (61503252 and 61473194), China Postdoctoral Science Foundation (2016T90799), and Natural Science Foundation of Gansu Province (17JR5RA177).

References

  1. G. B. Huang, Q. Y. Zhu, and C. K. Siew, “Extreme learning machine: a new learning scheme of feedforward neural networks,” in Proceedings of the IEEE International Joint Conference on Neural Networks, vol. 2, pp. 985–990, July 2004. View at Publisher · View at Google Scholar · View at Scopus
  2. G. B. Huang, Q. Y. Zhu, and C. K. Siew, “Extreme learning machine: theory and applications,” Neurocomputing, vol. 70, no. 1–3, pp. 489–501, 2006. View at Publisher · View at Google Scholar · View at Scopus
  3. G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learning machine for regression and multiclass classification,” IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 42, no. 2, pp. 513–529, 2012. View at Publisher · View at Google Scholar · View at Scopus
  4. R. Penrose, “A generalized inverse for matrices,” Mathematical Proceedings of the Cambridge Philosophical Society, vol. 51, no. 3, pp. 406–413, 1955. View at Publisher · View at Google Scholar · View at Scopus
  5. N. Liu and H. Wang, “Ensemble based extreme learning machine,” IEEE Signal Processing Letters, vol. 17, no. 8, pp. 754–757, 2010. View at Publisher · View at Google Scholar · View at Scopus
  6. J. Cao, Z. Lin, G.-B. Huang, and N. Liu, “Voting based extreme learning machine,” Information Sciences, vol. 185, pp. 66–77, 2012. View at Publisher · View at Google Scholar · View at MathSciNet
  7. X. Xue, M. Yao, Z. Wu, and J. Yang, “Genetic ensemble of extreme learning machine,” Neurocomputing, vol. 129, pp. 175–184, 2014. View at Publisher · View at Google Scholar · View at Scopus
  8. A. O. M. Abuassba, D. Zhang, X. Luo, A. Shaheryar, and H. Ali, “Improving Classification Performance through an Advanced Ensemble Based Heterogeneous Extreme Learning Machines,” Computational Intelligence and Neuroscience, vol. 2017, 2017. View at Google Scholar · View at Scopus
  9. M. Han and B. Liu, “Ensemble of extreme learning machine for remote sensing image classification,” Neurocomputing, vol. 149, pp. 65–70, 2015. View at Publisher · View at Google Scholar · View at Scopus
  10. H.-J. Lu, C.-L. An, E.-H. Zheng, and Y. Lu, “Dissimilarity based ensemble of extreme learning machine for gene expression data classification,” Neurocomputing, vol. 128, pp. 22–30, 2014. View at Publisher · View at Google Scholar · View at Scopus
  11. B. Mirza, Z. Lin, and N. Liu, “Ensemble of subset online sequential extreme learning machine for class imbalance and concept drift,” Neurocomputing, vol. 149, pp. 316–329, 2015. View at Publisher · View at Google Scholar
  12. D. Wang and M. Alhamdoosh, “Evolutionary extreme learning machine ensembles with size control,” Neurocomputing, vol. 102, pp. 98–110, 2013. View at Publisher · View at Google Scholar · View at Scopus
  13. X.-Z. Wang, R. Wang, H.-M. Feng, and H.-C. Wang, “A new approach to classifier fusion based on upper integral,” IEEE Transactions on Cybernetics, vol. 44, no. 5, pp. 620–635, 2014. View at Publisher · View at Google Scholar · View at Scopus
  14. L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996. View at Google Scholar · View at Scopus
  15. Y. Freund and R. Schapire, “A short introduction to boosting,” Journal of Japanese Society For Artificial Intelligence, vol. 14, pp. 771–780, 1999. View at Google Scholar · View at MathSciNet
  16. J. B. Hampshire and A. H. Waibel, “Novel objective function for improved phoneme recognition using time-delay neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 1, no. 2, pp. 216–228, 1990. View at Publisher · View at Google Scholar · View at Scopus
  17. Q. Xu, Y. Xiong, H. Dai et al., “PDC-SGB: Prediction of effective drug combinations using a stochastic gradient boosting algorithm,” Journal of Theoretical Biology, vol. 417, pp. 1–7, 2017. View at Publisher · View at Google Scholar · View at Scopus
  18. . Xin Yao and . Yong Liu, “Making use of population information in evolutionary artificial neural networks,” IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), vol. 28, no. 3, pp. 417–425. View at Publisher · View at Google Scholar
  19. J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” The Annals of Statistics, vol. 29, no. 5, pp. 1189–1232, 2001. View at Publisher · View at Google Scholar · View at MathSciNet
  20. J. H. Friedman, “Stochastic gradient boosting,” Computational Statistics & Data Analysis, vol. 38, no. 4, pp. 367–378, 2002. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  21. P. T. Boggs and J. W. Tolle, “Sequential Quadratic Programming,” Acta Numerica, vol. 4, pp. 1–51, 1995. View at Publisher · View at Google Scholar · View at Scopus
  22. M. A. Figueiredo, J. M. Bioucas-Dias, and R. D. Nowak, “Majorization-minimization algorithms for wavelet-based image restoration,” IEEE Transactions on Image Processing, vol. 16, no. 12, pp. 2980–2991, 2007. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  23. R. Battiti, “First- and Second-Order Methods for Learning: Between Steepest Descent and Newton's Method,” Neural Computation, vol. 4, no. 2, pp. 141–166, 1992. View at Publisher · View at Google Scholar
  24. P. L. Bartlett, “The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network,” Institute of Electrical and Electronics Engineers Transactions on Information Theory, vol. 44, no. 2, pp. 525–536, 1998. View at Publisher · View at Google Scholar · View at MathSciNet
  25. J. Alcalá-Fdez, A. Fernández, J. Luengo et al., “KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework,” Journal of Multiple-Valued Logic and Soft Computing, vol. 17, no. 2-3, pp. 255–287, 2011. View at Google Scholar · View at Scopus
  26. M. Lichman, UCI Machine Learning Repository, University of California, School of Information and Computer Science, Irvine, 2013.
  27. H. Drucker, “Improving regressors using boosting techniques,” in Proceedings of the International Conference on Machine Learning, vol. 97, pp. 107–115, 1997.
  28. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986. View at Publisher · View at Google Scholar · View at Scopus
  29. A. Coates and A. Y. Ng, “The importance of encoding versus training with sparse coding and vector quantization,” in Proceedings of the 28th International Conference on Machine Learning (ICML '11), pp. 921–928, July 2011. View at Scopus
  30. J. Zhu, H. Zou, S. Rosset, and T. Hastie, “Multi-class AdaBoost,” Statistics and Its Interface, vol. 2, no. 3, pp. 349–360, 2009. View at Publisher · View at Google Scholar · View at MathSciNet