Abstract

The embedding dimension and the number of nearest neighbors are very important parameters in the prediction of a chaotic time series. In order to reduce the uncertainties in the determination of the forgoing two parameters, a new adaptive local linear prediction method is proposed in this study. In the new method, the embedding dimension and the number of nearest neighbors are combined as a parameter set and change adaptively in the process of prediction. The generalized degree of freedom is used to help select the optimal parameters. Real hydrological time series are taken to examine the performance of the new method. The prediction results indicate that the new method can choose the optimal parameters of embedding dimension and the nearest neighbor number adaptively in the prediction process. And the nonlinear hydrological time series perhaps could be modeled better by the new method.

1. Introduction

The global and regional climates have already begun changing [1], and meteorological-driven processes have been studied by some researchers just as the study in the signal analysis and other fields [28]. Many hydrological processes, such as runoff, are usually nonlinear, complex, dynamic processes because of the involved physical process and the considerable spatial and temporal variability [9]. The simulation of the nonlinear hydrological time series was turned out to be very difficult with the traditional deterministic mathematic models. However, the emergence of chaos theory provides a new way to study this kind of highly complex system and makes it possible to extract deterministic regulation from the seemingly disordered hydrological phenomenon. Chaos theory is an important part of nonlinear science and some scientists have done research on its theories and applications [10, 11]. From the last decades, a series of theories and methods identifying chaotic essences in dynamic systems have gradually been established [1217]. The first application of chaos theory in hydrological processes could be traced back to the analysis of a series of 1008 monthly rainfall values recorded in Nauru Island by Hense in 1987 [18], and then the chaos theory became a more and more reliable tool for the study of hydrological processes.

With the development of chaos theory and research on its application technique, many methods have been proposed to predict chaotic time series, which can broadly be divided into two main categories: global method [19] and local method [20, 21]. The global method is made an attempt to approximate the whole time series on all attractors and seek a function valid at every point. It is obvious that this kind of method has the disadvantage that if new information is added into the model, the parameters may be changed and a lot of time will be wasted in the parameter estimation. However, the local method overcomes this drawback by building model only on the local attractor and utilizing only part of the past information. Farmer and Sidorowich [20] have already proved that the local prediction method is better than the global prediction method. On the other hand, by using some combining techniques, the forecast accuracy can be improved both in the global method and in the local method [21].

In the process of the local prediction, the phase space reconstruction is the first step and three parameters (the embedding dimension , the time delay , and the number of the nearest neighbors ) should be determined. The studies of the nonlinear predictions indicate that the selection of delay time does not affect the reconstructed attractor reflecting the dynamics of the system unambiguously; so the key problem is the determination of the optimal and for prediction. (Usually, the number of the nearest neighbors is taken greater than in general condition.)

Although there are many discussions on how to determine the optimal embedding dimension, three basic methods presented below are most usually used. The first kind of methods is based on calculating some geometrical invariants of the attractor, such as Grassberger-Procaccia method (G-P method) [16]. By increasing the embedding dimension, the minimum embedding dimension can be selected when the value of the invariant becomes saturated. The typical problems of this method are time-consuming for computation, unsuited for a short time series, and certainly subjective and sensitive to noise. The second kind of methods, such as the false nearest neighbors method (FNN) [22, 23] and Aleksić method [24], is based on the theory of false neighbors that the far apart points in the original phase space will moving closer together in the reconstruction phase space when a too low embedding dimension is selected. But the criterion is subjective in the judgment of the false neighbors, and Cao has improved the method [25]. The third kind of methods, such as principal component analysis (PCA) [26], singular-spectrum analysis (SSA) [27] and singular-value decomposition (SVD) [28], is based on a singular value decomposition proposed by Broomhead and King [26]. But the singular value decomposition is essentially a linear method based on the covariance matrix which reflects the linear dependence [29]. This approach is also subjective to some extent that the number of large singular values may depend on the details of the embedding process and the accuracy of data. Numerical experience has led several researchers [28, 30] to suspect the authenticity of this method in the analysis of nonlinear time series [31]. All of the methods mentioned above have some shortcomings like inapplicability to a short time series, more or less subjective, sensitivity to noise, and so forth [31].

The uncertainties in the determination of the embedding dimension and the number of the nearest neighbors can affect the forecast precision in the process of the traditional local prediction. So as to improve the prediction accuracy, the problem that arises then is how to reduce the parameter uncertainties in the prediction method. As we know, the time series data are continually updated as the prediction process proceeded; the parameter estimated by the phase space reconstruction using the original time sequence may not be able to reconstruct the chaotic attractor of the new sequence. In order to reduce the uncertainties caused by the time sequence data updating, the embedding dimension and the number of nearest neighbors should be changed adaptively in the forecast procedure. For the purpose of getting a better prediction results, Jayawardena et al. [32] proposed a method using the generalized degrees of freedom (GDF) to determine the optimal number of the neighbors for the prediction. But in his method, the embedding dimensions take the same values in the whole forecast process like the traditional local prediction; so this method still needs to be improved.

In this study, a parameter set is established on the basis of the two uncertain parameters, that is, embedding dimension and the number of nearest neighbors . Then a new adaptive local prediction method is proposed, in which and are not fixed as certain values but are changed as the prediction steps developed. In the selection of the optimal parameter set for each prediction step, the generalized degrees of freedom (GDF) are used and different error variances are calculated under different combinations of . The optimal parameter set is chosen when the variance obtains the smallest value. In order to examine the validity of the new method, some real hydrological time series are used.

2. The New Local Linear Prediction Method

2.1. The Phase Space Reconstruction

For a scalar time series , the multidimensional phase-space can be reconstructed by the Takens embedding theory [15], according to where , is the total points number of the phase-space, and , is the delay time, is the dimension of the vector , called as embedding dimension, and is the length of the time series.

2.1.1. The Determination of Delay Time

Many methods have been developed in estimating the delay time. And the autocorrelation function [15, 16, 33, 34], the most widely used tool in determining the delay time, is employed in this study. The autocorrelation function can be described as where is the autocorrelation coefficient, is the lag time, and and are the mean and standard variation of the time series, respectively. The delay time is always selected when the autocorrelation coefficient has dropped to of its initial value ( is the base of natural logarithm).

2.1.2. The Determination of Embedding Dimension

Among a number of embedding dimension calculation methods, the false nearest neighbors (FNN) method [22] is most widely used. Cao has proposed a new method on the basis of FNN method. The Cao method [25] can be described as follows. Assume that where is some measurements of Euclidian distance, is the ith phase points in the -dimension reconstructed phase space, and is the nearest neighbor of . The mean value of all is

When the embedding dimension changes from to , can be defined as

If stops changing when is greater than some value , then can be taken as the minimum embedding dimension to reconstruct the phase space.

2.2. The Traditional Local Linear Prediction Method

Many local prediction methods have been developed during the last decades, and the local linear prediction method is considered in this study. The first step of the local linear prediction is to find the nearest neighbor points of the current phase point in the reconstructed phase space. The Euclidean distances between the current vector and its preceding delay vectors will be calculated and then nearest neighbor points (, is taken value great than in general condition) will be selected.

The local linear prediction model in an -dimensional reconstructed phase space is an autoregressive model and the prediction value, which is a linear superposition of the elements in the delay vector , can be given as follows: where is a coefficient vector that needs to be determined, and

Deterministic predictions assume that if the phase space point was similar to the current point , then the future point will also be close to the future point . The coefficient vector can be estimated by the current phase vector and its nearest neighbor points through the following equation: where is the next series value of the points , and

Because the value of and is known, so the estimation of coefficient vector can be obtained by the least squares method as from (2.8), and then by using (2.6), the prediction value can be calculated. The new prediction value is added to the original time series as the prediction steps developed, and the last phase point is by the phase space reconstruction now. Following the same scheme in the forecast of and reestimating the coefficient vector , can be computed.

2.3. The Determination of the Optimal Parameters

In the regression analysis, the degrees of freedom are often used as a model complexity measure in various model selection criteria, such as Mallows , Akaike information criteria (AIC), and Bayesian information criterion (BIC). Yet these model selection criterion are asymptotic in nature and do not take into account the modeling procedure which can often be very complex [32]. So Ye [35] developed a concept of generalized degrees of freedom (GDF) that is applicable for evaluation of the model selection.

The GDF is defined as the sum of the sensitivities of each fitted value of the model to perturbations in the corresponding observed value [32]. It is nonasymptotic in nature and thus is free of the sample-size constraint [35]. In the process of chaotic local linear prediction, the GDF can be viewed as the cost of the modeling process. Considering the uncertainty in the determination of the optimal embedding dimension and the number of nearest neighbors for the prediction, different unbiased estimates of the error variance can be obtained under different parameter sets . A better for the model prediction is the one that has a smaller error variance. So the optimal parameters can be selected by comparing the different error variances that are calculated. In every step of the prediction, the GDF can be calculated and the optimal parameters can be selected; so the GDF can provide a novel way to guide the changing adaptively.

Using the local linear prediction method introduced earlier, the future values of the data series can be obtained. Here we will show how to choose the optimal for the prediction by using the GDF method.

From the foregoing description, the estimate of the coefficient vector in (2.8) can also be given as follows: where , , are the same in (2.8) and is a row vector of error values.

Using the least squares method, and ( is the mean vector of ) can be estimated as

The sum of squared residuals for a fixed embedding dimension RSS is then expressed as

Then an unbiased estimation of the error variance is given as where is the number of nearest neighbors. is the GDF and can be estimated by , .

This provides a tool to evaluate the goodness of the model with the chosen . With different values of the parameter set , the matrix and fitted vector functions will be different, then different error variances can be obtained. An optimal for the prediction is the one that has the smallest variance.

2.4. The New Local Linear Prediction Method

By comparing the estimations of error variances for different , the optimal parameter set can be selected. It is then used for prediction. The new local linear method proposed in this study is described as follows.

Step 1. Phase space reconstruction. Determine the embedding dimension and the time delay for the original time series.

Step 2. Calculate the error variances. Let the embedding dimension change from to , and the number of the nearest neighbor change from to (In this study, and are selected as 2 and , resp. This is not the only choice. and are taken as and , the same as Jayawardena’s method [32].) local linear models can be obtained. For each model, the error variance can be estimated using (2.13) and (2.14).

Step 3. Choose the optimal which has the minimum variance.

Step 4. Use to reconstruct the original time series and construct a local model to predict the next value .

Step 5. Then can be computed by the same scheme following the last point; now it is .

Some real hydrological time series are chosen to examine the performance of the new local linear method (NLLP-2), and the new method is compared with the traditional local linear prediction method (TLLP) and Jayawardena’s method (NLLP-1) [32].

3. Application in Hydrological Time Series

3.1. Study Area and Data Description

The real hydrological time series in this study are chosen as the daily discharges of Peschanaya at Tochil'noye (N and E, basin area, 4720 km2) in Russian for the period January 1936–December 1980, and the daily discharges of Tim at Napas (N and E, basin area, 24500 km2) in Russian for the period January 1953–December 1994. Some important statistical values are shown in Table 1.

For the daily discharge from Peschanaya, the time series is divided into two data sets, the first training data set is the data during the period January 1936–December 1979, and the second prediction data set is the data during the period January 1, 1980–February 9, 1980. In the same way, the first training data set for Tim is the data during the period January 1953–December 1993, the second prediction data set is the data during the period January 1, 1994–February 9, 1994. The prediction lead time for both of the time series is 40 days.

3.2. Phase Space Reconstruction

In this study, the delay time and the embedding dimension of the above two daily discharge time series are determined by the autocorrelation function method and the Cao method, respectively. The changes of the autocorrelation functions for both of the time series are presented in Figure 1 and the values of are shown in Figure 2.

The delay time is selected when the autocorrelation coefficient has dropped to of its initial value. From Figure 1, the delay time for the Peschanaya and Tim can be determined as 9 and 20, respectively. (The red line in Figure 1 presents the situation that the autocorrelation function value is of its initial value.)

From Figure 2, the value of becomes saturated from ; so the embedding dimension can be determined as for Peschanaya. In the same way, the embedding dimension for Tim can be determined as 6 too.

The third parameter in the local linear prediction of the daily discharge time series is the number of the nearest neighbors. In this study, only the traditional prediction method (TLLP) needs to give this parameter as a prior; the rest two methods (NLLP-1 and NLLP-2) can obtain the number of nearest neighbors adaptively in the prediction procedure.

3.3. The Prediction Results and Analysis

Both of the time series were predicted by TLLP, NLLP-1, and NLLP-2, respectively. In order to obtain the best forecast results with the TLLP, the number of nearest neighbors for TLLP is changed from to . By comparing the prediction results under different , the optimal forecast results by TLLP in different can be selected. The prediction results by TLLP were shown in Table 2.

The accuracy of prediction is evaluated by three measurement indices in this study, which are the Mean absolute error (MAE), Root mean square error (RMSE), and Correlation coefficient (CC). The definitions of the three indices can be given as follows: where is the number of the time series under investigation. The and are the predicted values and observed values in the time series, respectively, and and are the mean of observed values and predicted values, respectively.

Generally, in the above measurement indices, lower values with the MAE and RMSE indicate better agreement, and higher positive values with the CC indicate better agreement between the observed values and predicted values.

From Table 2, it can be seen that the best forecast results for Peschanaya by TLLP method are obtained when the number of nearest neighbors is , that is, 13, (with MAE = 0.4781, RMSE = 0.6878, CC = 0.9092). And the best forecast results for Tim by TLLP method are obtained when the number of nearest neighbors is , that is, 22, (with MAE = 7.6517, RMSE = 9.4725, CC = 0.9089).

The results of NLLP-1, NLLP-2, and the optimal results of the TLLP are showed in Figure 3. The comparisons of the three methods between 20 lead time steps and 40 lead time steps are showen in Tables 3 and 4, respectively.

From Figure 3(a) and Table 3, it can be seen that NLLP-1 gets the best prediction results at the beginning of the prediction process for Peschanaya (with MAE = 0.2367, RMSE = 0.2767 and CC = 0.9636 for NLLP-1 when the lead time step is 20), and under this condition, both of the results by NLLP-1 and NLLP-2 are better than those by TLLP. But as the prediction step increased, the forecast precision of NLLP-1 and NLLP-2 becomes worse than that of TLLP. When the lead time step becomes 40, the TLLP (with MAE = 0.4781, RMSE = 0.6878 and CC = 0.9092) obtains the best results.

From Figure 3(b) and Table 4, it can be seen that, not only at the beginning of the prediction procedure but also along the whole procedure when the lead time step is 40, the NLLP-2 gets the best prediction results (with MAE = 2.6050, RMSE = 2.9928, and CC = 0.9113 when the lead time step is 20 and MAE = 1.7600, RMSE = 2.2413, and CC = 0.9054 for TLLP when the lead time step is 40).

From the above analysis, the prediction performance of the new method proposed in this study (NLLP-2) is better than that of TLLP at the beginning of prediction process for Peschanaya. For the daily discharge from Tim, the NLLP-2 is working superior than both TLLP and NLLP-1.

The changes of the embedding dimension and the number of nearest neighbors in the foregoing three methods are shown in Figures 4 and 5 (the lead time step is 20 days). It can be found that the embedding dimension and the number of nearest neighbors do change adaptively in the prediction processes.

4. Conclusion and Prospect

To obtain the optimum prediction results, a new adaptive local linear prediction method is proposed in which the embedding dimension and the number of nearest neighbors are combined as a parameter set and change adaptively as the forecast steps increased in this study. The main results are given as follows.(i)The optimal parameters can be gotten by using the GDF method. In the process of the selection of the optimal parameters for each prediction step, the generalized degree of freedom (GDF) is used and different error variances are calculated under different combinations of . The optimal parameters are chosen when the variance is the smallest. Then the optimal parameters are used to reconstruct the phase space and predict the next value of the time series. (ii)For the sake of comparing the performance of the different prediction method, real nonlinear hydrological time series are chosen to be examined. The results from the used example indicate that the new adaptive local linear prediction method proposed in this study can choose the embedding dimension and the number of the nearest neighbors adaptively in the prediction process. Compared with the methods of TLLP and NLLP-1, the new method is better than TLLP at the beginning of the prediction steps for both of the time series, and NLLP-2 is better than TLLP and NLLP-1 for the time series from Tim during the whole forecast period.(iii)The new adaptive local linear prediction method can be used in predicting other nonlinear time series in the future and its theory will be further studied.

Acknowledgments

This work was supported by the Key Program of the National Science Foundation of China (Grant no. 50939001), the National Basic Research Program of China (Grant no. 2010CB429003), and the National Key Technology R&D Program of China (Grant no. 2006BAB04A09)