Abstract

Accurate and fast prediction of nonstationary time series is challenging and of great interest in both practical and academic areas. In this paper, an online sequential extreme learning machine with new weighted strategy is proposed for nonstationary time series prediction. First, a new leave-one-out (LOO) cross-validation error estimation for online sequential data is proposed based on inversion of block matrix. Second, a new weighted strategy based on the proposed LOO error estimation is proposed. This strategy ranks the samples’ importance by means of the LOO error of each new added sample and then assigns various weights. Performance comparisons of the proposed method with other existing algorithms are presented based on chaotic and real-world nonstationary time series data. The results show that the proposed method outperforms the classical ELM and OS-ELM in terms of generalization performance and numerical stability.

1. Introduction

Time series prediction is generally playing an important role in many engineering fields, for example, dynamic mechanics, weather diagnostics, and so on. The key goal of time series prediction is to mine the inner regular patterns in time data in order to predict future data effectively [1]. Many traditional methods such as AR, ARMA, and ARIMA are well applied to solving stationary time series prediction. However, in practical applications, time series is almost nonstationary, which restricts the stationary methods above. From Takens’ phase space delay reconstructing theory [2], this kind of data generally needs to reconstruct the phase space via delay coordinate at first. For example, in chaotic time series the -dimensional vector is defined as follows:where is embedding dimension and is delay constant. The prediction model can be described as , where is a nonlinear map. From this reconstruction, the time correlation is transformed to spatial correlation. Then support vector machines (SVMs) [3, 4], neural networks (NNs) [5, 6], and other machine learning methods [7, 8] are successfully introduced to approximate the spatial correlation in nonstationary time series data.

Generally speaking, there are two main challenges for predicting nonstationary time series effectively. One is how to choose a proper baseline algorithm which should be computationally inexpensive and accurate enough. Another is how to distinguish the importance of different samples in time series. Different from SVMs and NNs, extreme learning machine (ELM), introduced by Huang et al. [9], has shown its very high learning speed and good generalization performance in solving many problems of regression estimate and pattern recognition [10, 11]. As a sequential modification of ELM, online sequential ELM (OS-ELM) proposed by Liang et al. [12] can learn data one-by-one or chunk-by-chunk. In many applications such as time-series forecasting, OS-ELMs also show good generalization at extremely fast learning speed. Therefore, OS-ELM is a proper solution for the first challenge. Many researches were devoted to solve the second challenge. As recent data usually carry more important information than the distant past data, a typical and effective method is weight-setting. Lin and Wang [13] held the first sample in time series with the lowest importance while the most recent sample with the highest importance and then assigned fuzzy memberships to every sample. Tay and Cao [14] used exponential function to calculate every sample’s importance in financial time series prediction. Very different from these stationary weight-setting strategy, Mao et al. [15] established a heurist algorithm to dynamically choose the optimal weights. Bao et al. [16] solved this problem from multi-input multioutput perspective. He regarded the time samples as multiple outputs in a time slot, and utilized multidimensional SVM to establish model. Considering ELM, Wang and Han [17] introduced kernel trick on OS-ELM for nonstationary time series. Its essential idea is to transform spatial space for better approximation. Grigorievskiy et al. [18] used optimally-pruned ELM to tackle long-term time series prediction and obtained more comparable results than with SVM.

However, although ELM-based methods mentioned above work well in time series prediction [19], it still does not yet successfully solve the second challenge, that is, to distinguish different samples’ significance. Specifically speaking, as a sample is added sequentially, it does not seem clear whether this sample is the most important or is even the newest. In this scenario, the inner structure hidden in time series data will determine the samples’ significance, especially in nonstationary setting. In other words, it could not guarantee the new added sample to be most valuable for prediction. Therefore, to solve this problem, this paper firstly develops a new leave-one-out (LOO) cross-validation error estimation for OS-ELM aiming at time series prediction. Based on inversion of block matrix, this LOO estimation is fast enough for time series data. As proved by many theoretical works [20], LOO error using PRESS statistics is approximately unbiased and has been successfully applied to ELM with Tikhonov regularization [21]. To our best knowledge, this LOO error estimation is the first attempt to evaluate the generalization performance of OS-ELM on time series data. Moreover, this paper utilizes this LOO error estimation of each new added sample to measure its importance. Obeying this weight-setting strategy, this paper then proposes a new weighted learning method for OS-ELM. The short version of this paper has been published in the proceedings of 5th International Conference on Extreme Learning Machine (ELM 2014). Experimental results on chaotic and real-life time series data demonstrate the proposed method outperforms the traditional ELMs in generalization performance and numerical stable.

The paper is organized as follows. In Section 2, a brief review on OS-ELM and LOO cross-validation estimation is provided. In Section 3, we describe the LOO error estimation and the weighted learning algorithm of OS-ELM on time series data. Section 4 is devoted to computer experiments on two different types of time series data sets, followed by a conclusion of the paper in Section 5.

2. Brief Review

As the theoretical foundations of ELM, [22] studied the learning performance of SLFN on small-size data set and found that SLFN with at most hidden neurons can learn distinct samples with zero errors by adopting any bounded nonlinear activation function. Then, based on this concept, Huang et al. [9] pointed out that ELM can analytically determine the output weights by a simple matrix inversion procedure as soon as the input weights and hidden layer biases are generated randomly and then obtain good generalization performance with very high learning speed. Here a brief summary of ELM is provided.

Given a set of i.i.d. training samples , standard SLFNs with hidden nodes are mathematically formulated as [9]:where is activation function, is input weight vector connecting input nodes and the hidden node, is the output weight vector connecting output nodes and the hidden node, and is bias of the hidden node. Huang et al. [9] has rigorously proved that, then, for arbitrary distinct samples and any randomly chosen from according to any continuous probability distribution, the hidden layer output matrix of a standard SLFN with hidden nodes and is invertible and with probability one if the activation function is infinitely differentiable in any interval. Then, given , training a SLFN equals finding a least-squares solution of the following equation [9]:whereConsidering most cases in which , cannot be computed through the direct matrix inversion. Therefore, the smallest norm least-squares solution of (3) is calculated as follows:where is the Moore-Penrose generalized inverse of matrix . Based on the analysis above, Huang et al. [9] proposed ELM whose framework can be stated as follows.

Step 1. Randomly generate input weight and bias , .

Step 2. Compute the hidden layer output matrix .

Step 3. Compute the output weight .

Therefore, the output of SLFN can be calculated by and :

Like ELM, all the hidden node parameters in OS-ELM are randomly generated, and the output weights are analytically determined based on the sequentially arrived data. OS-ELM process is divided into two steps: initialization phase and sequential learning phase [12].

Step 1. Initialization phase: choose a small chunk of initial training data, where .(1)Randomly generate the input weight and bias ,  . Calculate the initial hidden layer output matrix :(2)Calculate the output weight vector:where , .(3)Set .

Step 2. Sequential learning phase.(1)Learn the th training data: .(2)Calculate the partial hidden layer output matrix:Set .(3)Calculate the output weight vector:(4)Set . Go to Step 2.

The generalization ability of ELM has been analyzed by many researchers. Lan et al. [23] added a refinement stage that used leave-one-out (LOO) error to evaluate the neurons significance in each backward step. Feng et al. [24] presented a fast LOO error estimation for regularized ELM. From the incremental learning point of view, Feng et al. [24] proposed an error minimized extreme learning machine which measured the residual error caused by adding a new added hidden node in an incremental manner. We highly recommend the following work. As for ELM, Liu et al. [25] derived a fast LOO error estimation of ELM. The generalization error in th LOO iteration can be expressed as follows:where means the th element, is hidden layer matrix, and means the row about the sample in .

Liu et al. [25] have shown that the proposed algorithm can accurately calculate the LOO error and can avoid the times observable model training process of the original cross-validation method. By the simulation experiment of artificial and real data sets, it has been verified that the LOO cross-validation algorithm based on ELM is efficient and has good generalization performance.

3. OS-ELM with LOO Weighted Strategy

As shown above, in the training process of the classic OS-ELM algorithm, all samples are equally treated. As long as a new sample is arriving, the network weight will be updated. This rigid weight updating mechanism lacks adjustment flexibility according to the actual situation. Moreover, it tends to increase the unnecessary computation.

To improve the generalization ability of OS-ELM while maintaining model’s simplicity, this paper improves this rigid weight updating mechanism of traditional OS-ELM effectively via adopting dynamic weighted strategy. This strategy determines each sample’s importance according to its LOO error estimation in online scenario. Consequently, a new OS-ELM based on online LOO cross-validation weight-setting strategy (LW-OSELM) is proposed.

3.1. LOO Error Estimation of ELM

As discussed in Section 2, the fast LOO error estimation of ELM proposed by Feng et al. [24] derived that the generalization error in th LOO iteration can be expressed as follows:

Obviously, (12) works mainly on offline learning setting rather than online sequential scenario. The key reason is that cannot be updated in online stage. Considering the sequential reaching of the sample in online setting, (12) can be extended to online sequential scenario, and it provides a channel to calculate the LOO error of each sample. The key is calculating from (12) in online manner. However, the time complexity of matrix inversion is , where is the number of samples. Therefore, the modeling time will be significantly increased along with the number of training patterns. To avoid complex calculation and make the established model simple, we adopt the idea of block matrix inversion, which transforms the complex calculation into linear operation, for decreasing the computation greatly.

As pointed out by many theoretical researches, LOO error is almost unbiased estimation of true generalization performance. Once a sample’s LOO error is smaller, this sample’s contribution in the decision model is greater. In order to highlight the samples’ contribution and ensure the generalization of models, we set the corresponding weights of each sample according to the value of LOO error in the process of online. At the same time, to ensure the simplicity of the model, the oldest sample, which has the furthest distance from the current moment, is eliminated. Namely, the samples cost is zero. To avoid complex calculation and make the established model simple, we follow the idea of block matrix inversion [26], which transforms the complex calculation into linear operation, for decreasing the computation greatly in online learning stage. Thus, a new kind of extreme learning machine based on online leave-one-out cross-validation is put forward, as in the following sections.

3.2. Initial Stage of Training

Suppose there are training samples . The hidden layer output matrix is , and the output vector is , calculating the output weight vector:whereLet ; (14) can be rewritten as .

3.3. Add New Sample

Add the new arrived sample into training set. The output vector becomes , and the hidden layer matrix becomes . Then we haveLet ; thenbecauseFor (17), according to Block matrix inversion, we havewhere , . So, can be calculated based on , which reduces computational cost largely. Then we have by substituting (18) into (16).

3.4. Calculate the LOO Error

Let set up the online LOO model; then the LOO error in th LOO iteration can be expressed as follows:Then we can obtain the corresponding LOO error, , of each sample from (19). According to the value of , where , we set the relevant weight of each sample, where . Note that the smaller is, the bigger is. To emphasize the newest sample and make the decision model simple, we reset the weight of the newest sample as . This paper defines as 1.02. And we set the weight of the oldest sample as zero; namely, we set its contribution to the model as zero.

3.5. Weighted Training

After adding the new sample , we set the weight of oldest sample as zero, namely, excluding this sample. After excluding , the output vector becomes , and the hidden matrix becomes . Then we haveLet ; thenFrom (21), contains two parts: and . Because the calculation of involves matrix inversion, we only set weight on in order to avoid the huge computational cost in calculating LOO error. Then the hidden matrixbecomes

And becomeswhere are the corresponding weights , , and , becauseFrom (25), there is a relationship between and . So can be calculated on the basis of to simplify the calculation.

Assume that can be partitioned and expressed as follows:where , , and .

As in (25), let , ; then (25) is equivalent toBy the definition of matrix inversion, ; namely,Through the block matrix multiplication, we haveCalculating (29), we haveThus, we have by substituting (24) and (30) into (21).

Then we can update the network weights according to the following equation:

3.6. Algorithm

The algorithm of LW-OSELM can be described into two steps.

Step 1 (The initial stage of training). For the initial training sample set , calculate the network weights based on the training samples by the following equation:where is the hidden layer matrix based on new training set and .
Set .

Step 2 (Online learning stage). Set the th arrived sample as . Now the output vector becomes , and the hidden layer matrix becomes . Calculate :where Calculate the LOO error of each sample by the following equation:Then each sample is setting the corresponding weight according to the value of , where , and the smaller the is, the bigger the is. In similar way, let ; .
Set the oldest samples weight . Now the output vector becomes , and the hidden layer matrix becomes Calculate by the following equation:Make represented as a new block:In (37), considerThen substitute (39) into (37) to get .
Use (40) to update the network weight: Let , , and . Let , and go to step .

For better understanding, here we provide a flow chart of the proposed algorithm, shown as in Figure 1.

4. Experimental Results

In this section, we run experiments to test the proposed algorithm. Our goal is to demonstrate that the proposed algorithm can efficiently improve the generalization performance of OS-ELM in time series prediction. For comparison, we choose two baselines: the classical ELM [9] and OS-ELM [12]. The proposed OS-ELM algorithm with fast LOO weighted strategy is named as LW-OSELM.

For completeness, we examine two types of nonstationary time series data sets. We start with two benchmark chaotic time series data. We further present results on a real-world data set, that is, air pollutants forecasting in Macau [27]. These data sets are described below. The goal is to test the generalization performance on time series data as well as the running speed. In each experiment, all results are the mean of 100 trials. RBF activation function is used in each algorithm. Each variable is linearly rescaled. A method with higher classification accuracy is better.

4.1. Chaotic Time Series

To verify the effectiveness of LW-OSELM for online time series prediction, we take the simulational experiment on three benchmark chaotic time series data: Mackey-Glass, Henon, and Lorenz.

As a chaotic map widely used, Mackey-Glass time series is shown as the following equation:

The initial values are , , and . Here we generate a set of 1300 points with the first 700 points for training and the last 600 points for test. is set as 1, and the embedding dimension is set as 4.

Henon time series is shown as the following equation (42):

The initial values are , , , and . We generate a set of 600 points with the first 500 points for training and last 100 points for test. In phase reconstruction stage, is set as 1, and the embedding dimension is set as 3.

Lorenz time series is shown as the following equation:

The initial values are = 16, , , , and . We generate a set of 1600 points with the first 600 points for training and last 1000 points for test. In phase reconstruction stage, is set as 1, and the embedding dimension is set as 6.

First, we report the numerical results on the three standard data sets. The mean RMSE of 100 trials on the three data sets are listed in Tables 1, 2, and 3. Here the numbers of neurons are 16, 25, and 25, respectively.

As shown in Tables 1 to 3, although the training time of LW-OSELM is the longest, it gets not only the least training errors, but also the least test errors compared to the others, which is the key factor to measure the performance of a model. Specifically speaking, on Mackey-Glass data, LW-OSELM gets 40% improvement against ELM and 10.6% against OS-ELM in terms of test error. These two values are 4.8% and 4.4%, respectively. In addition, the training times of LW-OSELM on two data sets are 2.844 and 0.2194, respectively, which is in an acceptable range, so it does not affect the model’s performance too much. Therefore, LW-OSELM has better performance than the ELM and OS-ELM.

As the only adjustable parameters, it is necessary to test the performance with the change of hidden neurons. Figures 2 and 3, respectively, illustrate the change of the training error and test error of LW-OSELM, OS-ELM, and ELM, with different number of hidden neurons.

From Figures 2 and 3, it is obvious that the change trends of training error and test error of three algorithms are roughly the same, which shows that LW-OSELM is meaningful, and LW-OSELM has smaller errors than the others with the same hidden neurons in most cases. So we can declare that the LW-OSELM has better performance.

In order to further prove the comparative results, we illustrate the predictive performance of three models on data set Mackey-Glass, Henon, and Lorenz, as shown in Figures 4, 5, and 6, respectively.

Note that the figures on the right column enlarge the predictive results for more illustration. The original data have no noise, so the predictive curve of three models are all very close to the real curve. From Figures 4(a) to 6(a), it is obvious that the predictive results of three models are roughly the same with real values, which proves the rationality of ELM-based methods. Furthermore, in enlarged part (we take it randomly) from Figures 4(b) to 6(b), we can easily find the prediction curve of LW-OSELM is nearer to the real curve than those of ELM and OS-ELM. These experimental results demonstrate that the proposed algorithm has higher generalization ability.

In brief, LW-OSELM has smaller training errors and test errors than ELM and OS-ELM. By calculating the LOO error of each sample, it is feasible to determine their importance. Through setting the corresponding weight for each sample by means of its LOO error estimation, LW-OSELM is endowed better generalization and stability.

4.2. Macao Meteorological Time Series

In application of air quality monitoring where data often reach in online sequential manner, it is a typical kind of nonstationary time series. In order to further test the stability and generalization of LW-OSELM, we choose suspended particulate matters () to conduct experiment. Due to the limitation of the acquisition data, this paper adopts the air quality data of Macao meteorological bureau to conduct simulation experiments [24].

4.2.1. Data Pre-Processing

We define the training data set , where represents the input variables and represents the output variables. The input sample is constructed by the suspended particulate matters (), nitrogen dioxide (), and sulfur dioxide (); that is, , , , . And the output sample is the value of the nest day’s (); that is, . The training data and test data were normalized to a range of , as follows:where is a variable in or . and are the minimum value and maximum value of in study period, respectively.

4.2.2. Experiment Result

To verify the validity of the LW-OSELM for time series prediction, we use the data collected from 2010 to 2013 to conduct experiment. Specifically, the data in 2010 are used for initial offline training, the data in 2011 are used for online training, the data in 2012 are used for test, and the data in 2013 are used for testing the generalization performance.

Considering the deficiency of traditional OS-ELM in time series prediction, we put forward a new kind of weighted extreme learning machine based on online leave-one-out cross-validation. According to the value of the online LOO error estimation, weigh the corresponding weight for each sample. To verify the rationality of this dynamic weight-setting strategy, it is necessary to compare LW-OSELM and the fixed-weighing OS-ELM (namely, WELM). The WELM will set weight value as 0.98 for all old samples and 1.02 for the latest sample. We set hidden neurons as 25. The comparison of LW-OSELM and WELM is illustrated in Table 4.

From Table 4, compared with WELM, the training error and test error of LW-OSELM are much smaller, which demonstrates that the dynamic weight-setting strategy, that is, setting weights on samples according to their online LOO errors, is feasible.

We compare the performance of ELM, OS-ELM, and LW-OSELM. Given the hidden layer activation function of RBF kernel function, the mean RMES of 100 trials on Macao meteorological time series is listed in Table 5. Here the numbers of neurons are 15.

By contrast, LW-OSELM has little smaller training error and test error while its training time is a little longer. Note that, in this experiment, the comparative results are not remarkable like in Section 4.1. The reason is quite likely that we did not employ embedding dimension, that is, reconstructing phase space like in (1). Here we merely use the data in the latest day as input sample, rather than using the data of past few days.

We also examine the effect of hidden neurons. Figures 7 and 8, respectively, show the change of training error and test error with different number of hidden neurons.

From Figures 7 and 8, the training error and test error of LW-OSELM are both smaller than the others with the most hidden neurons. ELM tends to be the most unstable with drastic fluctuation. And LW-OSELM performs similarly stable to OS-ELM, which keeps pace with the results in Table 5.

Moreover, we report the generalization performance of three algorithms with different prediction step, as in Figure 9.

Obviously, LW-OSELM is better than the others. For further clarification, we also compare the mean and variance of three models. The mean of ELM, OS-ELM, and LW-OSELM is, respectively, 0.0726, 0.0617, and 0.0605, and the variance is, respectively, 3.8340e-004, 3.5584e-004, and 3.0809e-004. Obviously, the average error and variance of LW-OSELM are both the least, which indicates that LW-OSELM has better accuracy and stronger stability.

5. Conclusion and Future Work

In this paper, nonstationary time series prediction is addressed. The key idea is distinguishing the importance of samples in time series using its LOO cross-validation error. This idea is a new attempt for weight-setting strategy. To realize this strategy, this paper utilizes OS-ELM as baseline algorithm and proposes a new LOO error estimation which is fast and quite applied to time series prediction. Based on this estimation, this paper proposes a dynamic weight-setting algorithm for OS-ELM. The experimental results on two benchmark chaotic time series data sets and a real-world data set demonstrate the effectiveness of the proposed approach. It is a matter of choosing more efficient method for calculating leave-one-out error. Another problem is how to extend the proposed algorithm to multi-input multioutput regression, which should be achieved by introducing the corresponding background knowledge and will be studied in our future research.

Conflict of Interests

The authors declare that there is no conflict of interests regrading the publication of this paper.

Acknowledgments

The authors wish to thank the author C. M. Vong of [27] for useful discussion and instruction. This work was supported by the National Natural Science Foundation of China (nos. U1204609, 61173071), China Postdoctoral Science Foundation (no. 2014M550508), and Postgraduate Technology Innovation Project of Henan Normal University (no. YL201420).