Online Regularized and Kernelized Extreme Learning Machines with Forgetting Mechanism
To apply the single hidden-layer feedforward neural networks (SLFN) to identify time-varying system, online regularized extreme learning machine (ELM) with forgetting mechanism (FORELM) and online kernelized ELM with forgetting mechanism (FOKELM) are presented in this paper. The FORELM updates the output weights of SLFN recursively by using Sherman-Morrison formula, and it combines advantages of online sequential ELM with forgetting mechanism (FOS-ELM) and regularized online sequential ELM (ReOS-ELM); that is, it can capture the latest properties of identified system by studying a certain number of the newest samples and also can avoid issue of ill-conditioned matrix inversion by regularization. The FOKELM tackles the problem of matrix expansion of kernel based incremental ELM (KB-IELM) by deleting the oldest sample according to the block matrix inverse formula when samples occur continually. The experimental results show that the proposed FORELM and FOKELM have better stability than FOS-ELM and have higher accuracy than ReOS-ELM in nonstationary environments; moreover, FORELM and FOKELM have time efficiencies superiority over dynamic regression extreme learning machine (DR-ELM) under certain conditions.
Plenty of research work has shown that the single hidden-layer feedforward neural networks (SLFN) can approximate any function and form decision boundaries with arbitrary shapes if the activation function is chosen properly [1–3]. However, most of traditional approaches (such as BP algorithm) for training SLFN are slow due to their iterative steps. To train SLFN fast, Huang et al. proposed a learning algorithm called extreme learning machine (ELM), which randomly assigns the hidden nodes parameters (the input weights and hidden layer biases of additive networks or the centers and impact factors of RBF networks) and then determines the output weights by the Moore-Penrose generalized inverse [4, 5]. The original ELM is batch learning algorithm.
For some practical fields where the training data are generated gradually, online sequential learning algorithms are preferred over batch learning algorithms as sequential learning algorithms do not require retraining whenever a new sample is received. Hence, Liang et al. developed a kind of online sequential ELM (OS-ELM) using the recursive least square . OS-ELM for SLFN produced better generalization performance at faster learning speed compared with the previous sequential learning algorithms. Moreover, for time-varying environments, recently several incremental sequential ELMs are presented; they apply constant or adaptive forgetting factor [7, 8] or iteration approach  to strengthen new sample’s contribution on model. Speaking theoretically, they cannot eliminate old samples’ effect on model thoroughly. To let ELM study the latest properties of identified object, Zhao et al. developed online sequential ELM with forgetting mechanism (FOS-ELM) . Fixed-memory extreme learning machine (FM-ELM) of Zhang and Wang  can be thought of as a special case of FOS-ELM with the parameter in  being 1. Although experimental results show FOS-ELM has higher accuracy , it may encounter the matrix singularity problem and run unstably.
As a variant of ELM, regularized ELM (RELM) [12–14], which is equivalent to the constrained optimization based ELM [15, 16] mathematically, absorbing the thought of structural risk minimization of statistical learning theory , can overcome the overfitting problem of ELM and provides better generalization ability than original ELM when noises or outliers exist in the dataset . Furthermore, the regularized OS-ELM (ReOS-ELM) developed by Huynh and Won , which is equivalent to sequential regularized ELM (SRELM)  and least square incremental ELM (LS-IELM)  essentially, can avoid singularity problem.
If the feature mapping in SLFN is unknown to users, the kernel based ELM (KELM) can be constructed [15, 16]. For the application where samples arrive gradually, Guo et al. developed kernel based incremental ELM (KB-IELM) .
However, in time-varying or nonstationary applications, the newer training data usually carry more information about systems, and the older ones possibly carry less information, even misleading information; that is, the training samples usually have timeliness. The ReOS-ELM and KB-IELM cannot reflect the timeliness of sequential training data well. On the other hand, if a huge number of samples emerge, for KB-IELM, the required storage space for matrix will infinitely increase with learning ongoing and new samples arriving ceaselessly, and at last storage overflow will happen necessarily, so KB-IELM cannot be utilized at all under the circumstances.
In this paper, we combine advantages of FOS-ELM and ReOS-ELM and propose online regularized ELM with forgetting mechanism (FORELM) for time-varying applications. FORELM can overcome the potential matrix singularity problem by using regularization and eliminate effections of the outdated data on model by incorporating forgetting mechanism. Like FOS-ELM, the ensemble skill also may be employed in FORELM to enhance its stability; that is, FORELM comprises () ReOE-ELMs with forgetting mechanism, each of which trains a SLFN; the average of those outputs represents the final output of the ensemble of these SLFN. Additionally, forgetting mechanism also is incorporated into KB-IELM and online kernelized ELM with forgetting mechanism (FOKELM) is presented, which can deal with the problem of matrix expansion of KB-IELM. The designed FORELM and FOKELM update model recursively. The experimental results show the better performance of FORELM and FOKELM approach in nonstationary environments.
It should be noted that our methods adjust the output weights of SLFN due to addition and deletion of the samples one by one, namely, learn and forget samples sequentially, and network architecture is fixed. They are completely different from those offline incremental ELMs (I-ELM) [19–21] and incremental RELM  seeking optimal network architecture by adding hidden nodes one by one and learning the data in batch mode.
The rest of this paper is organized as follows. Section 2 gives a brief review of the basic concepts and related works of ReOS-ELM and KB-IELM. Section 3 proposes new online learning algorithms, namely, FORELM and FOKELM. Performance evaluation is conducted in Section 4. Conclusions are drawn in Section 5.
2. Brief Review of the ReOS-ELM and KB-IELM
For simplicity, ELM based learning algorithm for SLFN with multiple input single output is discussed.
The output of a SLFN with hidden nodes (additive or RBF nodes) can be represented by where and are the learning parameters of hidden nodes, is the vector of the output weights, and denotes the output of the th hidden node with respect to the input , that is, activation function. is a feature mapping from the -dimensional input space to the -dimensional hidden-layer feature space. In ELM, and are randomly determined firstly.
For a given set of distinct training data , where is an -dimensional input vector and is the corresponding scalar observation, the RELM, that is, constrained optimization based ELM, can be formulated aswhere denotes the training error. indicates the target value of all the samples. is the mapping matrix for the inputs of all the samples. is the regularization parameter (a positive constant).
Based on the KKT theorem, the constrained optimization of (2a) and (2b) can be transferred to the following dual optimization problem: where is the Lagrange multipliers vector. Using KKT optimality conditions, the following equations can be obtained:Ultimately, can be obtained as follows [12, 15, 16]:
In order to reduce computational costs, when , one may prefer to apply solution (5a), and when , one may prefer to apply solution (5b).
If the feature mapping is unknown, one can apply Mercer’s conditions on RELM. The kernel matrix is defined as and . Then, the output of SLFN by kernel based RELM can be given as
The ReOS-ELM, that is, SRELM and LS-IELM, can be retold as follows.
For time , let , , ; then, according to (5a), solution of RELM can be expressed as
For time , the new sample arrives; thus , ; applying Sherman-Morrison-Woodbury (SMW) formula , the current and can be computed as
For time , let
For time , the new sample () arrives; thus where , . Using the block matrix inverse formula , can be calculated from as where .
3. The Proposed FORELM and FOKELM
When SLFN is employed to model online for time-varying system, training samples are not only generated one by one, but also often have the property of timeliness; that is, training data have a period of validity. Therefore, during the learning process by online sequential learning algorithm, the older or outdated training data, whose effectiveness is less or is lost after several unit times, should be abandoned, which is the idea of forgetting mechanism . ReOS-ELM (i.e., SRELM or LS-IELM) and KB-IELM cannot reflect the timeliness of sequential training data. In this section, the forgetting mechanism is added to them to eliminate the outdated data that might have misleading or bad effect on built SLFN. On the other hand, for KB-IELM, to abandon samples can prevent matrix of (11) from expanding infinitely. The computing procedures of deleting sample are given, and the completed online regularized ELM and kernelized ELM with forgetting mechanism are presented.
3.1. Decremental RELM and FORELM
After RELM has studied a given number of samples and SLFN has been applied for prediction, RELM would discard the oldest sample from samples set.
Let , ; then , , and Furthermore, using SMW formula, then Moreover, Next time, and can be calculated from (viewed as ) and (viewed as ) according to (8a) and (8b), respectively.
Suppose that FORELM may consist of ReOS-ELMs with forgetting mechanism, by which SFLNs trained have the same output of hidden node and the same number of hidden nodes . In the following FORELM algorithm, the variables and parameters with superscript are relevant to the th SFLN to be trained by the th ReOS-ELM with forgetting mechanism. Synthesize ReOS-ELM and the decremental RELM; then we can get FORELM as follows.
Step 1. Initialization:(1)Choose the hidden output function of SFLN with the certain activation function and the number of hidden nodes . Set the value of .(2)Randomly assign hidden parameters ,, , .(3)Determine ; set
Step 2. Incrementally learn initial samples; that is, repeat the following procedure for times:(1)Get current sample .(2)Calculate and : ; calculate by (8a).(3)Calculate : if sample is the first one, then , else calculate by (8b).
Step 3. Online modeling and prediction: repeat the following procedure during every step:(1)Acquire current , form new sample , and calculate , , and by (8a) and (8b).(2)Prediction: form , the output of the th SFLN; that is, prediction of can be calculated by (1): , the final prediction .(3)Delete the oldest sample : calculate and by (13) and (14), respectively.
3.2. Decremental KELM and FOKELM
After KELM has studied the given number of samples and SLFN has been applied for prediction, KELM would discard the oldest sample from samples set.
Let , . , , ; then can be written in the following partitioned matrix form:
Moreover, using the block matrix inverse formula, such equation can be obtained: where .
Rewrite in the partitioned matrix form as
Compare (16) and (17); can be calculated as
Next time, compute from (viewed as ) according to (11).
Integrate KB-IELM with the decremental KELM; further, we can obtain FOKELM as follows.
Step 1. Initialization: choose kernel with corresponding parameter values, and determine .
Step 2. Incrementally learn initial samples: calculate : if there exists a sample only, then , else calculate by (11).
Step 3. Online modeling and prediction:(1)Acquire new sample and calculate by (11).(2)Prediction: form , and calculate prediction of by (6); namely, (3)Delete the oldest sample : calculate by (18).
4. Performance Evaluation
In this section, the performance of the presented FORELM and FOKELM is verified via the time-varying nonlinear process identification simulations. Those simulations are designed from the aspects of accuracy, stability, and computation complexity of the proposed FORELM and FOKELM by comparison with the FOS-ELM, ReOS-ELM (i.e., SRELM or LS-IELM), and dynamic regression extreme learning machine (DR-ELM) . DR-ELM also is a kind of online sequential RELM and is designed by Zhang and Wang using solution (5b) of RELM and the block matrix inverse formula.
All the performance evaluations were executed in MATLAB 7.0.1 environment running on Windows XP with Intel Core i3-3220 3.3 GHz CPU and 4 GB RAM.
Simulation 1. The unknown identified system is a modified version of the one addressed in , by changing the constant and the coefficients of variables, to form a time-varying system, as done in :
The system (20) can be expressed as follows: where is a nonlinear function and is the regression input data vector with , , and being model structure parameters. Apply SLFN to approximate (20); accordingly is the learning sample of SLFN.
Denote , . The input is set as follows: where generates random numbers which are uniformly distributed in the interval .
In all experiments, the output of hidden node with respect to the input x of a SLFN in (1) is set to the sigmoidal additive function; that is, , the components of ; that is, the input weights and bias are randomly chosen from the range . In FOKELM, the Gaussian kernel function is applied; namely, .
The root-mean-square error (RMSE) of prediction and the maximal absolute prediction error (MAPE) are regarded as measure indices of model accuracy and stability, respectively. Consider where , . Simulation is carried on for 650 instances. In model (21), , , and . Due to randomness of parameters , , and during initial stage , the results of simulation must possess variation. For each approach, the results are averaged over 5 trials.
ReOS-ELM does not discard any old sample; thus it has .
In offline modeling, the training samples set is fixed; thus one may search relative optimal values for model parameter of ELMs. Nevertheless, in online modeling for time-varying system, the training samples set is changing; it is difficult to choose optimal values for parameters in practice. Therefore, we set the same parameter of these ELMs with the same value manually; for example, their parameters all are set to 250; then we compare their performances.
RMSE and MAPE of the proposed ELMs and other aforementioned ELMs are listed in Tables 1 and 2, respectively, and the corresponding running time (i.e., training time plus predicting time) of these various ELMs is given in Table 3.
From Tables 1–3, one can see the following results.(1)RMSE and MAPE of FORELM are smaller than those of FOS-ELM with the same and values. The reason for this is that () in FOS-ELM may be (nearly) singular at some instances; thus calculated recursively is nonsense and unbelievable, and when , FOS-ELM cannot work owing to its too large RMSE or MAPE. Accordingly, “×” represents nullification in Tables 1–3, whereas FORELM does not suffer from such a problem. In addition, RMSE and MAPE of FOKELM also are smaller than those of FOS-ELM with the same values.(2)RMSE of FORELM and FOKELM is smaller than that of ReOS-ELM with the same and when parameter , namely, the length of sliding time windows, is set properly. The reason for this is that ReOS-ELM neglects timeliness of samples of the time-varying process and does not get rid of effects of old samples; contrarily, FORELM and FOKELM stress actions of the newest ones. It should be noticed that is relevant to characteristics of the specific time-varying system and its inputs. In Table 1, when , 70, or 100, the effect is good.(3)When is fixed, with increasing, RMSEs of these ELMs tend to decrease firstly. But changes are not obvious later.(4)FORELM requires nearly the same time as FOS-ELM, but more time than ReOS-ELM. It is because both FORELM and FOS-ELM involve incremental and decremental learning procedures, but ReOS-ELM does one incremental learning procedure only.(5)Both FORELM and DR-ELM use regularization trick, so they should obtain the same or similar prediction effect theoretically, and Tables 1 and 2 also show they have similar simulation results statistically. However, their time efficiencies are different. From Table 3, it can be seen that, for the case where is small or , FORELM takes less time than DR-ELM. Thus, when is small and modeling speed is preferred to accuracy, one may try to employ FORELM.(6)When is fixed, if is large enough, there are no significant differences between RMSE of FORELM and DR-ELM and RMSE of FOKELM with appropriate parameter values. In Table 3, it is obvious that, with increasing, DR-ELM costs more and more time. But time cost by FOKELM is irrelevant with . According to the procedures of DR-ELM and FOKELM, if is large enough to make calculating more complex than calculating , FOKELM will take less time than DR-ELM.
To intuitively observe and compare the accuracy and stability of these ELMs with the same , values and initial signal, absolute prediction error (APE) curves of one trial of every approach , are shown in Figure 1. Clearly, Figure 1(a) shows that, at a few instances, prediction errors of FOS-ELM are much greater, although prediction errors are very small at other instances; thus FOS-ELM is unstable. Comparing Figures 1(b), 1(c), 1(d), and 1(e), we can see that, at most instances, prediction errors of DR-ELM, FORELM, and FOKELM are smaller than those of ReOS-ELM, and prediction effect of FORELM is similar to that of DR-ELM.
(a) APE curves of FOS-ELM
(b) APE curves of ReOS-ELM
(c) APE curves of DR-ELM
(d) APE curves of FORELM
(e) APE curves of FOKELM
On the whole, FORELM and FOKELM have higher accuracy than FOS-ELM and ReOS-ELM.
Simulation 2. In this subsection, the proposed methods are tested on modeling for a second-order bioreactor process described by the following differential equations : where is the cell concentration that is considered as the output of the process , is the amount of nutrients per unit volume, and represents the flow rate as the control input (the excitation signal for modeling); the and can take values between zero and one, and is allowed a magnitude in interval .
In the simulation, with the growth rate parameter , the nutrient inhibition parameter is considered as the time-varying parameter; that is,
Let indicate sampling interval. Denote , . The input is set below: where, at every sampling instance, generates random numbers which are uniformly distributed in the interval .
Set s. With the same , , and initial , APE curves of every approach (, ) for one trial are drawn in Figure 2. Clearly, on the whole, APE curves of FORELM and FOKELM are smaller than those of FOS-ELM and ReOS-ELM, and FORELM has nearly the same prediction effect as DR-ELM. Further, RMSE of FOS-ELM, ReOS-ELM, DR-ELM, FORELM, and FOKELM are 0.096241, 0.012203, 0.007439, 0.007619, and 0.007102, respectively.
(a) APE curves of FOS-ELM
(b) APE curves of ReOS-ELM
(c) APE curves of DR-ELM
(d) APE curves of FORELM
(e) APE curves of FOKELM
Through many comparative trials, we may attain the same results as the ones in Simulation 1.
ReOS-ELM (i.e., SRELM or LS-IELM) can yield good generalization models and will not suffer from matrix singularity or ill-posed problems, but it is unsuitable in time-varying applications. On the other hand, FOS-ELM, thanks to its forgetting mechanism, can reflect the timeliness of data and train SLFN in nonstationary environments, but it may encounter the matrix singularity problem and run unstably.
In the paper, the forgetting mechanism is incorporated to ReOS-ELM, and we obtain FORELM which blends advantages of ReOS-ELM and FOS-ELM. In addition, the forgetting mechanism also is added to KB-IELM; consequently, FOKELM is obtained, which can overcome matrix expansion problem of KB-IELM.
Performance comparison between the proposed ELMs and other ELMs was carried out on identification of time-varying systems in the aspects of accuracy, stability, and computational complexity. The experimental results show that FORELM and FOKELM have better stability than FOS-ELM and have higher accuracy than ReOS-ELM in nonstationary environments statistically. When the number of hidden nodes is small or the length of sliding time windows, FORELM has time efficiency superiority over DR-ELM. On the other hand, if is large enough, FOKELM will be faster than DR-ELM.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
The work is supported by the Hunan Provincial Science and Technology Foundation of China (2011FJ6033).
J. Park and I. W. Sandberg, “Universal approximation using radial basis function networks,” Neural Computation, vol. 3, no. 2, pp. 246–257, 1991.View at: Publisher Site | Google Scholar
G. B. Huang, Y. Q. Chen, and H. A. Babri, “Classification ability of single hidden layer feedforward neural networks,” IEEE Transactions on Neural Networks, vol. 11, no. 3, pp. 799–801, 2000.View at: Publisher Site | Google Scholar
S. Ferrari and R. F. Stengel, “Smooth function approximation using neural networks,” IEEE Transactions on Neural Networks, vol. 16, no. 1, pp. 24–38, 2005.View at: Publisher Site | Google Scholar
G. B. Huang, Q. Y. Zhu, and C. K. Siew, “Extreme learning machine: a new learning scheme of feedforward neural networks,” in Proceedings of the IEEE International Joint Conference on Neural Networks, vol. 2, pp. 985–990, July 2004.View at: Publisher Site | Google Scholar
G. B. Huang, Q. Y. Zhu, and C. K. Siew, “Extreme learning machine: Theory and applications,” Neurocomputing, vol. 70, no. 1–3, pp. 489–501, 2006.View at: Publisher Site | Google Scholar
N. Y. Liang, G. B. Huang, P. Saratchandran, and N. Sundararajan, “A fast and accurate online sequential learning algorithm for feedforward networks,” IEEE Transactions on Neural Networks, vol. 17, no. 6, pp. 1411–1423, 2006.View at: Publisher Site | Google Scholar
J. S. Lim, “Partitioned online sequential extreme learning machine for large ordered system modeling,” Neurocomputing, vol. 102, pp. 59–64, 2013.View at: Publisher Site | Google Scholar
J. S. Lim, S. Lee, and H. S. Pang, “Low complexity adaptive forgetting factor for online sequential extreme learning machine (OS-ELM) for application to nonstationary system estimations,” Neural Computing and Applications, vol. 22, no. 3-4, pp. 569–576, 2013.View at: Publisher Site | Google Scholar
Y. Gu, J. F. Liu, Y. Q. Chen, X. L. Jiang, and H. C. Yu, “TOSELM : timeliness online sequential extreme learning machine,” Neurocomputing, vol. 128, no. 27, pp. 119–127, 2014.View at: Google Scholar
J. W. Zhao, Z. H. Wang, and D. S. Park, “Online sequential extreme learning machine with forgetting mechanism,” Neurocomputing, vol. 87, no. 15, pp. 79–89, 2012.View at: Publisher Site | Google Scholar
X. Zhang and H. L. Wang, “Fixed-memory extreme learning machine and its applications,” Control and Decision, vol. 27, no. 8, pp. 1206–1210, 2012.View at: Google Scholar
W. Y. Deng, Q. H. Zheng, and L. Chen, “Regularized extreme learning machine,” in Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM '09), pp. 389–395, April 2009.View at: Publisher Site | Google Scholar
H. T. Huynh and Y. Won, “Regularized online sequential learning algorithm for single-hidden layer feedforward neural networks,” Pattern Recognition Letters, vol. 32, no. 14, pp. 1930–1935, 2011.View at: Publisher Site | Google Scholar
X. Zhang and H. L. Wang, “Time series prediction based on sequential regularized extreme learning machine and its application,” Acta Aeronautica et Astronautica Sinica, vol. 32, no. 7, pp. 1302–1308, 2011.View at: Publisher Site | Google Scholar
G. B. Huang, D. H. Wang, and Y. Lan, “Extreme learning machines: a survey,” International Journal of Machine Learning and Cybernetics, vol. 2, no. 2, pp. 107–122, 2011.View at: Publisher Site | Google Scholar
G. B. Huang, H. M. Zhou, X. J. Ding, and R. Zhang, “Extreme learning,” IEEE Transactions on Systems, Man, and Cybernetics B: Cybernetics, vol. 42, no. 2, pp. 513–529, 2012.View at: Publisher Site | Google Scholar
V. N. Vapnik, Statistical Learning Theory, John Wiley & Sons, New York, NY, USA, 1998.View at: MathSciNet
L. Guo, J. H. Hao, and M. Liu, “An incremental extreme learning machine for online sequential learning problems,” Neurocomputing, vol. 128, no. 27, pp. 50–58, 2014.View at: Publisher Site | Google Scholar
G. B. Huang, L. Chen, and C. K. Siew, “Universal approximation using incremental constructive feedforward networks with random hidden nodes,” IEEE Transactions on Neural Networks, vol. 17, no. 4, pp. 879–892, 2006.View at: Publisher Site | Google Scholar
G. B. Huang and L. Chen, “Convex incremental extreme learning machine,” Neurocomputing, vol. 70, no. 16–18, pp. 3056–3062, 2007.View at: Publisher Site | Google Scholar
G. B. Huang and L. Chen, “Enhanced random search based incremental extreme learning machine,” Neurocomputing, vol. 71, no. 16–18, pp. 3460–3468, 2008.View at: Publisher Site | Google Scholar
X. Zhang and H. L. Wang, “Incremental regularized extreme learning machine based on Cholesky factorization and its application to time series prediction,” Acta Physica Sinica, vol. 60, no. 11, Article ID 110201, 2011.View at: Publisher Site | Google Scholar
G. H. Golub and C. F. van Loan, Matrix Computations, The Johns Hopkins University Press, Baltimore, Md, USA, 3rd edition, 1996.View at: MathSciNet
X. Zhang and H. L. Wang, “Dynamic regression extreme learning machine and its application to small-sample time series prediction,” Information and Control, vol. 40, no. 5, pp. 704–709, 2011.View at: Google Scholar
K. S. Narendra and K. Parthasarathy, “Identification and control of dynamical systems using neural networks,” IEEE Transactions on Neural Networks, vol. 1, no. 1, pp. 4–27, 1990.View at: Publisher Site | Google Scholar
M. O. Efe, E. Abadoglu, and O. Kaynak, “A novel analysis and design of a neural network assisted nonlinear controller for a bioreactor,” International Journal of Robust and Nonlinear Control, vol. 9, no. 11, pp. 799–815, 1999.View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet