Research Article  Open Access
Online Regularized and Kernelized Extreme Learning Machines with Forgetting Mechanism
Abstract
To apply the single hiddenlayer feedforward neural networks (SLFN) to identify timevarying system, online regularized extreme learning machine (ELM) with forgetting mechanism (FORELM) and online kernelized ELM with forgetting mechanism (FOKELM) are presented in this paper. The FORELM updates the output weights of SLFN recursively by using ShermanMorrison formula, and it combines advantages of online sequential ELM with forgetting mechanism (FOSELM) and regularized online sequential ELM (ReOSELM); that is, it can capture the latest properties of identified system by studying a certain number of the newest samples and also can avoid issue of illconditioned matrix inversion by regularization. The FOKELM tackles the problem of matrix expansion of kernel based incremental ELM (KBIELM) by deleting the oldest sample according to the block matrix inverse formula when samples occur continually. The experimental results show that the proposed FORELM and FOKELM have better stability than FOSELM and have higher accuracy than ReOSELM in nonstationary environments; moreover, FORELM and FOKELM have time efficiencies superiority over dynamic regression extreme learning machine (DRELM) under certain conditions.
1. Introduction
Plenty of research work has shown that the single hiddenlayer feedforward neural networks (SLFN) can approximate any function and form decision boundaries with arbitrary shapes if the activation function is chosen properly [1–3]. However, most of traditional approaches (such as BP algorithm) for training SLFN are slow due to their iterative steps. To train SLFN fast, Huang et al. proposed a learning algorithm called extreme learning machine (ELM), which randomly assigns the hidden nodes parameters (the input weights and hidden layer biases of additive networks or the centers and impact factors of RBF networks) and then determines the output weights by the MoorePenrose generalized inverse [4, 5]. The original ELM is batch learning algorithm.
For some practical fields where the training data are generated gradually, online sequential learning algorithms are preferred over batch learning algorithms as sequential learning algorithms do not require retraining whenever a new sample is received. Hence, Liang et al. developed a kind of online sequential ELM (OSELM) using the recursive least square [6]. OSELM for SLFN produced better generalization performance at faster learning speed compared with the previous sequential learning algorithms. Moreover, for timevarying environments, recently several incremental sequential ELMs are presented; they apply constant or adaptive forgetting factor [7, 8] or iteration approach [9] to strengthen new sample’s contribution on model. Speaking theoretically, they cannot eliminate old samples’ effect on model thoroughly. To let ELM study the latest properties of identified object, Zhao et al. developed online sequential ELM with forgetting mechanism (FOSELM) [10]. Fixedmemory extreme learning machine (FMELM) of Zhang and Wang [11] can be thought of as a special case of FOSELM with the parameter in [10] being 1. Although experimental results show FOSELM has higher accuracy [10], it may encounter the matrix singularity problem and run unstably.
As a variant of ELM, regularized ELM (RELM) [12–14], which is equivalent to the constrained optimization based ELM [15, 16] mathematically, absorbing the thought of structural risk minimization of statistical learning theory [17], can overcome the overfitting problem of ELM and provides better generalization ability than original ELM when noises or outliers exist in the dataset [12]. Furthermore, the regularized OSELM (ReOSELM) developed by Huynh and Won [13], which is equivalent to sequential regularized ELM (SRELM) [14] and least square incremental ELM (LSIELM) [18] essentially, can avoid singularity problem.
If the feature mapping in SLFN is unknown to users, the kernel based ELM (KELM) can be constructed [15, 16]. For the application where samples arrive gradually, Guo et al. developed kernel based incremental ELM (KBIELM) [18].
However, in timevarying or nonstationary applications, the newer training data usually carry more information about systems, and the older ones possibly carry less information, even misleading information; that is, the training samples usually have timeliness. The ReOSELM and KBIELM cannot reflect the timeliness of sequential training data well. On the other hand, if a huge number of samples emerge, for KBIELM, the required storage space for matrix will infinitely increase with learning ongoing and new samples arriving ceaselessly, and at last storage overflow will happen necessarily, so KBIELM cannot be utilized at all under the circumstances.
In this paper, we combine advantages of FOSELM and ReOSELM and propose online regularized ELM with forgetting mechanism (FORELM) for timevarying applications. FORELM can overcome the potential matrix singularity problem by using regularization and eliminate effections of the outdated data on model by incorporating forgetting mechanism. Like FOSELM, the ensemble skill also may be employed in FORELM to enhance its stability; that is, FORELM comprises () ReOEELMs with forgetting mechanism, each of which trains a SLFN; the average of those outputs represents the final output of the ensemble of these SLFN. Additionally, forgetting mechanism also is incorporated into KBIELM and online kernelized ELM with forgetting mechanism (FOKELM) is presented, which can deal with the problem of matrix expansion of KBIELM. The designed FORELM and FOKELM update model recursively. The experimental results show the better performance of FORELM and FOKELM approach in nonstationary environments.
It should be noted that our methods adjust the output weights of SLFN due to addition and deletion of the samples one by one, namely, learn and forget samples sequentially, and network architecture is fixed. They are completely different from those offline incremental ELMs (IELM) [19–21] and incremental RELM [22] seeking optimal network architecture by adding hidden nodes one by one and learning the data in batch mode.
The rest of this paper is organized as follows. Section 2 gives a brief review of the basic concepts and related works of ReOSELM and KBIELM. Section 3 proposes new online learning algorithms, namely, FORELM and FOKELM. Performance evaluation is conducted in Section 4. Conclusions are drawn in Section 5.
2. Brief Review of the ReOSELM and KBIELM
For simplicity, ELM based learning algorithm for SLFN with multiple input single output is discussed.
The output of a SLFN with hidden nodes (additive or RBF nodes) can be represented by where and are the learning parameters of hidden nodes, is the vector of the output weights, and denotes the output of the th hidden node with respect to the input , that is, activation function. is a feature mapping from the dimensional input space to the dimensional hiddenlayer feature space. In ELM, and are randomly determined firstly.
For a given set of distinct training data , where is an dimensional input vector and is the corresponding scalar observation, the RELM, that is, constrained optimization based ELM, can be formulated aswhere denotes the training error. indicates the target value of all the samples. is the mapping matrix for the inputs of all the samples. is the regularization parameter (a positive constant).
Based on the KKT theorem, the constrained optimization of (2a) and (2b) can be transferred to the following dual optimization problem: where is the Lagrange multipliers vector. Using KKT optimality conditions, the following equations can be obtained:Ultimately, can be obtained as follows [12, 15, 16]:
In order to reduce computational costs, when , one may prefer to apply solution (5a), and when , one may prefer to apply solution (5b).
If the feature mapping is unknown, one can apply Mercer’s conditions on RELM. The kernel matrix is defined as and . Then, the output of SLFN by kernel based RELM can be given as
2.1. ReOSELM
The ReOSELM, that is, SRELM and LSIELM, can be retold as follows.
For time , let , , ; then, according to (5a), solution of RELM can be expressed as
For time , the new sample arrives; thus , ; applying ShermanMorrisonWoodbury (SMW) formula [23], the current and can be computed as
2.2. KBIELM
For time , let
For time , the new sample () arrives; thus where , . Using the block matrix inverse formula [23], can be calculated from as where .
3. The Proposed FORELM and FOKELM
When SLFN is employed to model online for timevarying system, training samples are not only generated one by one, but also often have the property of timeliness; that is, training data have a period of validity. Therefore, during the learning process by online sequential learning algorithm, the older or outdated training data, whose effectiveness is less or is lost after several unit times, should be abandoned, which is the idea of forgetting mechanism [10]. ReOSELM (i.e., SRELM or LSIELM) and KBIELM cannot reflect the timeliness of sequential training data. In this section, the forgetting mechanism is added to them to eliminate the outdated data that might have misleading or bad effect on built SLFN. On the other hand, for KBIELM, to abandon samples can prevent matrix of (11) from expanding infinitely. The computing procedures of deleting sample are given, and the completed online regularized ELM and kernelized ELM with forgetting mechanism are presented.
3.1. Decremental RELM and FORELM
After RELM has studied a given number of samples and SLFN has been applied for prediction, RELM would discard the oldest sample from samples set.
Let , ; then , , and Furthermore, using SMW formula, then Moreover, Next time, and can be calculated from (viewed as ) and (viewed as ) according to (8a) and (8b), respectively.
Suppose that FORELM may consist of ReOSELMs with forgetting mechanism, by which SFLNs trained have the same output of hidden node and the same number of hidden nodes . In the following FORELM algorithm, the variables and parameters with superscript are relevant to the th SFLN to be trained by the th ReOSELM with forgetting mechanism. Synthesize ReOSELM and the decremental RELM; then we can get FORELM as follows.
Step 1. Initialization:(1)Choose the hidden output function of SFLN with the certain activation function and the number of hidden nodes . Set the value of .(2)Randomly assign hidden parameters ,, , .(3)Determine ; set
Step 2. Incrementally learn initial samples; that is, repeat the following procedure for times:(1)Get current sample .(2)Calculate and : ; calculate by (8a).(3)Calculate : if sample is the first one, then , else calculate by (8b).
Step 3. Online modeling and prediction: repeat the following procedure during every step:(1)Acquire current , form new sample , and calculate , , and by (8a) and (8b).(2)Prediction: form , the output of the th SFLN; that is, prediction of can be calculated by (1): , the final prediction .(3)Delete the oldest sample : calculate and by (13) and (14), respectively.
3.2. Decremental KELM and FOKELM
After KELM has studied the given number of samples and SLFN has been applied for prediction, KELM would discard the oldest sample from samples set.
Let , . , , ; then can be written in the following partitioned matrix form:
Moreover, using the block matrix inverse formula, such equation can be obtained: where .
Rewrite in the partitioned matrix form as
Compare (16) and (17); can be calculated as
Next time, compute from (viewed as ) according to (11).
Integrate KBIELM with the decremental KELM; further, we can obtain FOKELM as follows.
Step 1. Initialization: choose kernel with corresponding parameter values, and determine .
Step 2. Incrementally learn initial samples: calculate : if there exists a sample only, then , else calculate by (11).
Step 3. Online modeling and prediction:(1)Acquire new sample and calculate by (11).(2)Prediction: form , and calculate prediction of by (6); namely, (3)Delete the oldest sample : calculate by (18).
4. Performance Evaluation
In this section, the performance of the presented FORELM and FOKELM is verified via the timevarying nonlinear process identification simulations. Those simulations are designed from the aspects of accuracy, stability, and computation complexity of the proposed FORELM and FOKELM by comparison with the FOSELM, ReOSELM (i.e., SRELM or LSIELM), and dynamic regression extreme learning machine (DRELM) [24]. DRELM also is a kind of online sequential RELM and is designed by Zhang and Wang using solution (5b) of RELM and the block matrix inverse formula.
All the performance evaluations were executed in MATLAB 7.0.1 environment running on Windows XP with Intel Core i33220 3.3 GHz CPU and 4 GB RAM.
Simulation 1. The unknown identified system is a modified version of the one addressed in [25], by changing the constant and the coefficients of variables, to form a timevarying system, as done in [11]:
The system (20) can be expressed as follows: where is a nonlinear function and is the regression input data vector with , , and being model structure parameters. Apply SLFN to approximate (20); accordingly is the learning sample of SLFN.
Denote , . The input is set as follows: where generates random numbers which are uniformly distributed in the interval .
In all experiments, the output of hidden node with respect to the input x of a SLFN in (1) is set to the sigmoidal additive function; that is, , the components of ; that is, the input weights and bias are randomly chosen from the range . In FOKELM, the Gaussian kernel function is applied; namely, .
The rootmeansquare error (RMSE) of prediction and the maximal absolute prediction error (MAPE) are regarded as measure indices of model accuracy and stability, respectively. Consider where , . Simulation is carried on for 650 instances. In model (21), , , and . Due to randomness of parameters , , and during initial stage , the results of simulation must possess variation. For each approach, the results are averaged over 5 trials.
ReOSELM does not discard any old sample; thus it has .
In offline modeling, the training samples set is fixed; thus one may search relative optimal values for model parameter of ELMs. Nevertheless, in online modeling for timevarying system, the training samples set is changing; it is difficult to choose optimal values for parameters in practice. Therefore, we set the same parameter of these ELMs with the same value manually; for example, their parameters all are set to 250; then we compare their performances.
RMSE and MAPE of the proposed ELMs and other aforementioned ELMs are listed in Tables 1 and 2, respectively, and the corresponding running time (i.e., training time plus predicting time) of these various ELMs is given in Table 3.
 
“—” represents nondefinition or inexistence in the case. “×” represents nullification owing to the too large RMSE or MAPE. 


From Tables 1–3, one can see the following results.(1)RMSE and MAPE of FORELM are smaller than those of FOSELM with the same and values. The reason for this is that () in FOSELM may be (nearly) singular at some instances; thus calculated recursively is nonsense and unbelievable, and when , FOSELM cannot work owing to its too large RMSE or MAPE. Accordingly, “×” represents nullification in Tables 1–3, whereas FORELM does not suffer from such a problem. In addition, RMSE and MAPE of FOKELM also are smaller than those of FOSELM with the same values.(2)RMSE of FORELM and FOKELM is smaller than that of ReOSELM with the same and when parameter , namely, the length of sliding time windows, is set properly. The reason for this is that ReOSELM neglects timeliness of samples of the timevarying process and does not get rid of effects of old samples; contrarily, FORELM and FOKELM stress actions of the newest ones. It should be noticed that is relevant to characteristics of the specific timevarying system and its inputs. In Table 1, when , 70, or 100, the effect is good.(3)When is fixed, with increasing, RMSEs of these ELMs tend to decrease firstly. But changes are not obvious later.(4)FORELM requires nearly the same time as FOSELM, but more time than ReOSELM. It is because both FORELM and FOSELM involve incremental and decremental learning procedures, but ReOSELM does one incremental learning procedure only.(5)Both FORELM and DRELM use regularization trick, so they should obtain the same or similar prediction effect theoretically, and Tables 1 and 2 also show they have similar simulation results statistically. However, their time efficiencies are different. From Table 3, it can be seen that, for the case where is small or , FORELM takes less time than DRELM. Thus, when is small and modeling speed is preferred to accuracy, one may try to employ FORELM.(6)When is fixed, if is large enough, there are no significant differences between RMSE of FORELM and DRELM and RMSE of FOKELM with appropriate parameter values. In Table 3, it is obvious that, with increasing, DRELM costs more and more time. But time cost by FOKELM is irrelevant with . According to the procedures of DRELM and FOKELM, if is large enough to make calculating more complex than calculating , FOKELM will take less time than DRELM.
To intuitively observe and compare the accuracy and stability of these ELMs with the same , values and initial signal, absolute prediction error (APE) curves of one trial of every approach , are shown in Figure 1. Clearly, Figure 1(a) shows that, at a few instances, prediction errors of FOSELM are much greater, although prediction errors are very small at other instances; thus FOSELM is unstable. Comparing Figures 1(b), 1(c), 1(d), and 1(e), we can see that, at most instances, prediction errors of DRELM, FORELM, and FOKELM are smaller than those of ReOSELM, and prediction effect of FORELM is similar to that of DRELM.
(a) APE curves of FOSELM
(b) APE curves of ReOSELM
(c) APE curves of DRELM
(d) APE curves of FORELM
(e) APE curves of FOKELM
On the whole, FORELM and FOKELM have higher accuracy than FOSELM and ReOSELM.
Simulation 2. In this subsection, the proposed methods are tested on modeling for a secondorder bioreactor process described by the following differential equations [26]: where is the cell concentration that is considered as the output of the process , is the amount of nutrients per unit volume, and represents the flow rate as the control input (the excitation signal for modeling); the and can take values between zero and one, and is allowed a magnitude in interval .
In the simulation, with the growth rate parameter , the nutrient inhibition parameter is considered as the timevarying parameter; that is,
Let indicate sampling interval. Denote , . The input is set below: where, at every sampling instance, generates random numbers which are uniformly distributed in the interval .
Set s. With the same , , and initial , APE curves of every approach (, ) for one trial are drawn in Figure 2. Clearly, on the whole, APE curves of FORELM and FOKELM are smaller than those of FOSELM and ReOSELM, and FORELM has nearly the same prediction effect as DRELM. Further, RMSE of FOSELM, ReOSELM, DRELM, FORELM, and FOKELM are 0.096241, 0.012203, 0.007439, 0.007619, and 0.007102, respectively.
(a) APE curves of FOSELM
(b) APE curves of ReOSELM
(c) APE curves of DRELM
(d) APE curves of FORELM
(e) APE curves of FOKELM
Through many comparative trials, we may attain the same results as the ones in Simulation 1.
5. Conclusions
ReOSELM (i.e., SRELM or LSIELM) can yield good generalization models and will not suffer from matrix singularity or illposed problems, but it is unsuitable in timevarying applications. On the other hand, FOSELM, thanks to its forgetting mechanism, can reflect the timeliness of data and train SLFN in nonstationary environments, but it may encounter the matrix singularity problem and run unstably.
In the paper, the forgetting mechanism is incorporated to ReOSELM, and we obtain FORELM which blends advantages of ReOSELM and FOSELM. In addition, the forgetting mechanism also is added to KBIELM; consequently, FOKELM is obtained, which can overcome matrix expansion problem of KBIELM.
Performance comparison between the proposed ELMs and other ELMs was carried out on identification of timevarying systems in the aspects of accuracy, stability, and computational complexity. The experimental results show that FORELM and FOKELM have better stability than FOSELM and have higher accuracy than ReOSELM in nonstationary environments statistically. When the number of hidden nodes is small or the length of sliding time windows, FORELM has time efficiency superiority over DRELM. On the other hand, if is large enough, FOKELM will be faster than DRELM.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgment
The work is supported by the Hunan Provincial Science and Technology Foundation of China (2011FJ6033).
References
 J. Park and I. W. Sandberg, “Universal approximation using radial basis function networks,” Neural Computation, vol. 3, no. 2, pp. 246–257, 1991. View at: Publisher Site  Google Scholar
 G. B. Huang, Y. Q. Chen, and H. A. Babri, “Classification ability of single hidden layer feedforward neural networks,” IEEE Transactions on Neural Networks, vol. 11, no. 3, pp. 799–801, 2000. View at: Publisher Site  Google Scholar
 S. Ferrari and R. F. Stengel, “Smooth function approximation using neural networks,” IEEE Transactions on Neural Networks, vol. 16, no. 1, pp. 24–38, 2005. View at: Publisher Site  Google Scholar
 G. B. Huang, Q. Y. Zhu, and C. K. Siew, “Extreme learning machine: a new learning scheme of feedforward neural networks,” in Proceedings of the IEEE International Joint Conference on Neural Networks, vol. 2, pp. 985–990, July 2004. View at: Publisher Site  Google Scholar
 G. B. Huang, Q. Y. Zhu, and C. K. Siew, “Extreme learning machine: Theory and applications,” Neurocomputing, vol. 70, no. 1–3, pp. 489–501, 2006. View at: Publisher Site  Google Scholar
 N. Y. Liang, G. B. Huang, P. Saratchandran, and N. Sundararajan, “A fast and accurate online sequential learning algorithm for feedforward networks,” IEEE Transactions on Neural Networks, vol. 17, no. 6, pp. 1411–1423, 2006. View at: Publisher Site  Google Scholar
 J. S. Lim, “Partitioned online sequential extreme learning machine for large ordered system modeling,” Neurocomputing, vol. 102, pp. 59–64, 2013. View at: Publisher Site  Google Scholar
 J. S. Lim, S. Lee, and H. S. Pang, “Low complexity adaptive forgetting factor for online sequential extreme learning machine (OSELM) for application to nonstationary system estimations,” Neural Computing and Applications, vol. 22, no. 34, pp. 569–576, 2013. View at: Publisher Site  Google Scholar
 Y. Gu, J. F. Liu, Y. Q. Chen, X. L. Jiang, and H. C. Yu, “TOSELM : timeliness online sequential extreme learning machine,” Neurocomputing, vol. 128, no. 27, pp. 119–127, 2014. View at: Google Scholar
 J. W. Zhao, Z. H. Wang, and D. S. Park, “Online sequential extreme learning machine with forgetting mechanism,” Neurocomputing, vol. 87, no. 15, pp. 79–89, 2012. View at: Publisher Site  Google Scholar
 X. Zhang and H. L. Wang, “Fixedmemory extreme learning machine and its applications,” Control and Decision, vol. 27, no. 8, pp. 1206–1210, 2012. View at: Google Scholar
 W. Y. Deng, Q. H. Zheng, and L. Chen, “Regularized extreme learning machine,” in Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM '09), pp. 389–395, April 2009. View at: Publisher Site  Google Scholar
 H. T. Huynh and Y. Won, “Regularized online sequential learning algorithm for singlehidden layer feedforward neural networks,” Pattern Recognition Letters, vol. 32, no. 14, pp. 1930–1935, 2011. View at: Publisher Site  Google Scholar
 X. Zhang and H. L. Wang, “Time series prediction based on sequential regularized extreme learning machine and its application,” Acta Aeronautica et Astronautica Sinica, vol. 32, no. 7, pp. 1302–1308, 2011. View at: Publisher Site  Google Scholar
 G. B. Huang, D. H. Wang, and Y. Lan, “Extreme learning machines: a survey,” International Journal of Machine Learning and Cybernetics, vol. 2, no. 2, pp. 107–122, 2011. View at: Publisher Site  Google Scholar
 G. B. Huang, H. M. Zhou, X. J. Ding, and R. Zhang, “Extreme learning,” IEEE Transactions on Systems, Man, and Cybernetics B: Cybernetics, vol. 42, no. 2, pp. 513–529, 2012. View at: Publisher Site  Google Scholar
 V. N. Vapnik, Statistical Learning Theory, John Wiley & Sons, New York, NY, USA, 1998. View at: MathSciNet
 L. Guo, J. H. Hao, and M. Liu, “An incremental extreme learning machine for online sequential learning problems,” Neurocomputing, vol. 128, no. 27, pp. 50–58, 2014. View at: Publisher Site  Google Scholar
 G. B. Huang, L. Chen, and C. K. Siew, “Universal approximation using incremental constructive feedforward networks with random hidden nodes,” IEEE Transactions on Neural Networks, vol. 17, no. 4, pp. 879–892, 2006. View at: Publisher Site  Google Scholar
 G. B. Huang and L. Chen, “Convex incremental extreme learning machine,” Neurocomputing, vol. 70, no. 16–18, pp. 3056–3062, 2007. View at: Publisher Site  Google Scholar
 G. B. Huang and L. Chen, “Enhanced random search based incremental extreme learning machine,” Neurocomputing, vol. 71, no. 16–18, pp. 3460–3468, 2008. View at: Publisher Site  Google Scholar
 X. Zhang and H. L. Wang, “Incremental regularized extreme learning machine based on Cholesky factorization and its application to time series prediction,” Acta Physica Sinica, vol. 60, no. 11, Article ID 110201, 2011. View at: Publisher Site  Google Scholar
 G. H. Golub and C. F. van Loan, Matrix Computations, The Johns Hopkins University Press, Baltimore, Md, USA, 3rd edition, 1996. View at: MathSciNet
 X. Zhang and H. L. Wang, “Dynamic regression extreme learning machine and its application to smallsample time series prediction,” Information and Control, vol. 40, no. 5, pp. 704–709, 2011. View at: Google Scholar
 K. S. Narendra and K. Parthasarathy, “Identification and control of dynamical systems using neural networks,” IEEE Transactions on Neural Networks, vol. 1, no. 1, pp. 4–27, 1990. View at: Publisher Site  Google Scholar
 M. O. Efe, E. Abadoglu, and O. Kaynak, “A novel analysis and design of a neural network assisted nonlinear controller for a bioreactor,” International Journal of Robust and Nonlinear Control, vol. 9, no. 11, pp. 799–815, 1999. View at: Publisher Site  Google Scholar  Zentralblatt MATH  MathSciNet
Copyright
Copyright © 2014 Xinran Zhou et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.