Initialization by a Novel Clustering for Wavelet Neural Network as Time Series Predictor
The architecture and parameter initialization of wavelet neural network are discussed and a novel initialization method is proposed. The new approach can be regarded as a dynamic clustering procedure which will derive the neuron number as well as the initial value of translation and dilation parameters according to the input patterns and the activating wavelets functions. Three simulation examples are given to examine the performance of our method as well as Zhang's heuristic initialization approach. The results show that the new approach not only can decide the WNN structure automatically, but also provides superior initial parameter values that make the optimization process more stable and quickly.
An artificial neural network (ANN) is a highly parallel distributed network of connected processing units called neurons. Due to their fascinating characteristics of robustness, fault tolerance, adaptive learning ability, and massive parallel processing capabilities, ANNs possess the capability of learning from examples with both linear and nonlinear relationships between the input and output signals, which makes them a popular tool for time series prediction [1, 2], feature extraction [3, 4], pattern recognition [5, 6], and classification [7, 8]. However, ANNs have limited ability to characterize local features, such as discontinuities in curvature, jumps in value, or other edges.
Instead of using common sigmoid activation functions, the wavelet neural network (WNN) employing nonlinear wavelet basis functions [9, 10], which are localized in both the time space and frequency space, has been developed as an alternative approach to nonlinear fitting problem. It has been proven that families of wavelet frames are universal approximators , which give a theoretical basis to their use in the framework of function approximation and process modeling.
There are two different WNN architectures: one type has fixed wavelet bases possessing fixed dilation and translation parameters (WNN-Type1). In this one only the output layer weights are adjustable. Another type has the variable wavelet base whose dilation and translation parameters and output layer weights are adjustable (WNN-Type2). Several WNN models have been proposed in the literatures. In , a four-layer self-constructing wavelet network (SCWN) controller for nonlinear systems control is described and the orthogonal wavelet functions are adopted as its node functions. In , a local linear wavelet neural network (LLWNN) is presented whose connection weights between the hidden layer and output layer of conventional WNN are replaced by a local linear model. In , a model of multiwavelet-based neural networks is proposed. The structure of this network is similar to that of the wavelet network, except that the orthonormal scaling functions are replaced by orthonormal multiscaling functions.
A time series is a sequence of observations taken sequentially in time . Time series prediction is an important research and application area. Much effort has been devoted over the past several decades to the development and improvement of time series prediction models. Besides the well-known linear models such as moving average, exponential smoothing, and the autoregressive integrated moving average, nonlinear models including artificial neural network, wavelet neural network, and fuzzy system models also become the well-established time series models. In this paper, the wavelet neural network (WNN) is used as the time series predictor, and the detailed research works are described subsequently.
We adopt WNN-Type2 with adjustable translation and dilation parameters and multiplication form of multidimensional wavelets as the nonlinear model for time series prediction in this paper. Key problems in designing of this type of WNN consist of determining WNN architecture, initializing the translation and dilation vectors, and choosing learning algorithm that can be effectively used for training the WNN. This study mainly focuses on the first two points. In the practical applications, the number of hidden neurons which determines the structure of the network is often set by experience or the time-consuming trial-and-error tests, and the initial values of parameters are often set randomly. Due to the rapidly vanishing property of wavelet functions, the random initialization scheme to the dilation and translation parameters may cause the wavelets’ effective response regions out of interest which makes the learning performance very instable. So it is inadvisable to adopt random initialization scheme for dilations and translations in WNN. In , Zhang proposes a heuristic initialization procedure which considers the interesting domain of input patterns. But, in its implementation, the wavelet functions used in WNN are not considered, and the resolution reduced gradually according to an established rule which does not take full consideration of sample distribution.
In the present paper, inspired by the localization character of wavelet functions and considering the multiplication form of multidimensional wavelets in the hidden neuron for multivariable inputs, we present a novel initialization approach by the help of a new clustering method for WNN. This approach can determine the unit number of hidden layer and initialize the translation and dilation vectors simultaneously. After performing the training process by gradient descent method, we can see that, besides the capability of neuron number determination, WNN with our initialization method gives more satisfactory and stable results for time series prediction compared to Zhang’s heuristic initialization method which is used for this model in some literatures [9, 16, 17].
The paper is organized as follows. A brief review of wavelet and wavelet-based function approximation is given in Section 2, followed by the introduction of the architecture of wavelet neural network in Section 3. The detailed description of the clustering based initialization approach and the training algorithm are given in Sections 4 and 5. Three simulation experiments on time series prediction problems and the comparison results with Zhang’s heuristic initialization method are presented in Section 6. Finally, some conclusions are drawn in the last section.
2. Wavelet-Based Function Approximation
Wavelets in the following form,are a family of functions generated from one single function by the operation of dilation and translation. is called a mother wavelet function that satisfies the admissibility condition:where is the Fourier transform of [11, 18].
Grossmann and Morlet  proved that any function in can be represented by where given byis the continuous wavelet transform of .
Superior to conventional Fourier transform, the wavelet transform (WT) in its continuous form provides a flexible time-frequency window, which narrows when observing high frequency phenomena and widens when analyzing low frequency behavior. Thus, time resolution becomes arbitrarily good at high frequencies, while the frequency resolution becomes arbitrarily good at low frequencies. This kind of analysis is suitable for signals composed of high frequency components with short duration and low frequency components with long duration, which is often the case in practical situations.
As the parameters and are the continuous values, the resulting continuous wavelet transform (CWT) is a very redundant representation and impracticable also. This impracticability is the result of the redundancy. Therefore, the scale and shift parameters are evaluated on a discrete grid of time-scale leading to a discrete set of continuous wavelet functions:The continuous inverse wavelet transform (3) is discretized asIf there exist two constants and such that, for any in , the following inequalities hold:where denotes the norm of function and denotes the inner product of functions and , and the family is said to be a frame of . It has been proved that families of wavelet frames of are universal approximators.
3. Architecture of Wavelet Neural Network
A brief review of wavelet decomposition theory has been given in Section 2, where functions with univariable were concerned. For the modeling of multivariable processes, multidimensional wavelets must be defined. In the present work, multidimensional wavelets are defined as the multiplication of single-dimensional wavelet functions:where is the input vector and and are the translation and dilation vectors, respectively.
Generalized from radial basis function neural network, WNN is in fact a feed-forward neural network with one hidden layer, wavelet functions as activation functions in the hidden nodes, and a linear output layer. As a result, the network output is computed aswhere and define the connecting weights and the bias terms between the hidden layer and the output layer, respectively. is the number of units in hidden layer. These wavelet neurons are usually referred to as wavelons. The architecture of a WNN is illustrated in Figure 1.
4. Initialization Approach of Wavelet Neural Network
Before training the WNN, some factors should be determined in advance, which are the number of wavelons and initial value of parameters (, , , and ). The former is fixed once the structure of network was determined, while the latter is adjusted by the training algorithm. All these factors are crucial for the performance of network in simulating the real model. In this section, a brief description of wavelet window is presented firstly, and then a novel initialization method based on the dynamic clustering is proposed, which could provide the number of hidden neurons and the initial values of translation and dilation parameters at the same time.
4.1. Wavelet Window in Time Domain
A mother wavelet function defined by (2) will have sufficient decay, which can be considered as “local response.” In other words, is a window with center in and radius in time domain, which can be computed by As a result, its translated and dilated version will be concentrated in the region of in the time domain.
In this paper, the Mexican Hat wavelet function with symmetric graph (Figure 2) is employed, which is given by the following equation:
From (10), the center and radius of Mexican Hat wavelet window in the time domain can be derived as
4.2. Initialization by a Novel Clustering Approach for WNN
The structure of our network is illustrated in Figure 1. Suppose the input data for network training are vectors with components: . The procedure comprises the following steps:(1)Create the first cluster with cluster mean and dimensional radius . Set the number of clusters .(2)Put .(3)Compute the distance vectors , .
If there are some , such that, , and , then . The vector is a threshold vector which is set in advance of executing the algorithm. The cluster mean will be reset asand dimensional radius will be reset aswhere is the cardinal number of and are the patterns that belong to .
Else, the number of clusters becomes ; create the cluster with cluster mean and dimensional radius .(4)Put . If , then stop; otherwise go to step (3).
Remark 1. (i) Vector in the above procedure is crucial to the clustering result. Large elements of will lead to a coarse partition, namely, a small , whereas with small value will lead to a large . In practice, a reasonable should be determined by the input patterns. In our experiments, we prefer to adopt vector as in formula (15) to control the cluster scale, which offers moderate results in most times. Considerwhere .
(ii) The conditions “there are some such that, , ” in step (3) are derived from “local response” property of activation wavelet functions in the wavelons. As a result, patterns satisfying that each feature activates corresponding 1-D wavelet function will be identified as a class.
(iii) After the clustering procedure of (1)–(4), the corresponding results help us to determine the number of wavelons in WNN as and the initial value of translation and dilation vectors as in (17) is a relaxation parameter which satisfies ; is the window radius of wavelet function .
(iv) In order to avoid the dilation parameters being zeros, the radius vector of the cluster with single element should be redefined. The minimum value strategy is employed which can be described as , .
(v) The connecting weights between the hidden layer and the output layer will be randomly initialized in the region , and the bias term initialized as the mean vector of input patterns.
5. Training Algorithm
Gradient descent method is implemented for training the WNN in this paper. Parameters are adjusted in the opposite direction of the gradient such that the objective function in (18) of the model should be minimized. Considerwhere is the output of network and is the desired output.
The corrections applied to parameters , , , and are shown as follows:where , , , and . are the learning rates which should be set on the basis of specific experiment.
6. Simulation Examples
In this section, WNN model with two different initialization schemes is applied to three time series prediction problems, namely, the prediction of Mackey-Glass, Box-Jenkins, and traffic volume time series. The performance of WNN with the clustering based initialization approach (WNN-CIA) described in Section 4 is compared to Zhang’s heuristic initialization approach (WNN-HIA) in each simulation.
Because the architecture of WNN-HIA must be decided in advance, in order to compare directly, we adopt the same architecture with WNN-CIA in the experiments. Relaxation parameter in (17) of WNN-CIA is set as 2.5 in all simulations and the Mexican Hat function defined in (11) is employed as the wavelet function in the hidden neurons of all models. Root mean square error (RMSE) given by (20) of the training/testing set is used as index for comparing performances of WNN with different initialization schemes. Consider
6.1. Prediction of Mackey-Glass Time Series
The Mackey-Glass chaotic time series is generated from the following delay differential equation:Here we predict the using the input variables , , , and . Parameters in (21) are set as , , , and which make the equation show chaotic behavior. One thousand input-output data points are extracted from the Mackey-Glass time series , where to . The first 500 data pairs of the series are used as training data, while the remaining 500 data pairs are used to validate the proposed network. After performing the proposed clustering based initialization method proposed in Section 4.2, we get that the number of wavelons is .
For the performance comparison of WNN-CIA with WNN-HIA, some different architectures are employed for WNN-HIA. Table 1 shows the mean and standard deviation (std.) of RMSE for training and testing data obtained when 100 runs were performed by each model. The models are trained for 500 epochs in each run. Some results of different models for testing set are shown in Table 2. The RMSE reduction curve during training and testing of gradient descent algorithm corresponding to the best WNN-CIA model is drawn in Figure 3. Figures 4 and 5 show the prediction output of the best WNN-CIA model and the corresponding prediction error for training and testing data with the training and testing RMSE as 0.0080 and 0.0078.
From Table 1, it can be seen that the performance of WNN with structure and initial parameters derived by the proposed initialization approach is much better than that of WNN-HIA, even when more parameters are employed in the model.
6.2. Prediction of Box-Jenkins Time Series
The gas furnace data of Box and Jenkins (1970), that is, Box-Jenkins time series, was recorded from a combustion process of a methane-air mixture. It is well known and frequently used as a benchmark example for testing identification algorithms. During the process, the portion of methane was randomly changed, keeping a constant gas flow rate. The data set consists of 296 pairs of input-output measurements. The input is the gas flow into the furnace and the output is the CO2 concentration in outlet gas. The sampling interval is 9 s.
In this section, the data set used consists of 292 consecutive values of methane at time and CO2 produced in a furnace at time as input variables, with the produced CO2 at time as an output variable. Namely, variables and are used to predict . The data are partitioned in 200 data points as a training set and the remaining 92 points as a testing set for testing the performance of the proposed network. After performing the initialization method of WNN proposed in Section 4.2, we get the number of wavelons .
As is done in Section 6.1, different architectures are employed for WNN-HIA for comparison with WNN-CIA whose structure and initial parameters are derived by the proposed approach. Table 3 shows the mean and standard deviation of RMSE for training and testing data obtained when 100 runs were performed by each model. The models are also trained for 500 epochs in each run. Table 4 shows some test results of different models. The RMSE reduction curve during training and testing of gradient descent algorithm corresponding to the best WNN-CIA model is drawn in Figure 6. Figures 7 and 8 show the prediction output of the best WNN-CIA model and the corresponding prediction error for training and testing data with the training and testing RMSE as 0.0186 and 0.0348.
From the data in Table 3, we can see that WNN-CIA outperforms WNN-HIA when the same architectures are employed. When more parameters are employed to WNN-HIA, the performances of WNN-HIA gradually improve. However, WNN-CIA model can make a more stable performance than all WNN-HIA models in the experiments. In order to further examine the effectiveness of the proposed method, simulation experiments of a real-word example, traffic volume time series prediction, are carried out.
6.3. Prediction of the Traffic Volume Time Series (A Real-Word Example)
Chen in  implemented the neural network time series models for traffic volume forecasting. In this section, the data of hourly traffic volume for station 5 from , which were collected on IR 271 and IR 90 in Cuyahoga County, are used as the real-word time series to examine the performance of WNN-CIA as well as WNN-HIA. There are 105 volume data points collected from June 4, 4:00 pm, to June 8, 12:00 pm, for training purposes, with the remaining 9 data points collected from 1:00 am to 9:00 am on June 9 reserved for model accuracy checking. This is a one-step forecasting with 6 anterior data points as input vector. Data normalizing is done to transfer values of the raw time series into the numbers in interval . After performing the initialization method of WNN proposed in Section 4.1, we get the number of wavelons .
Some same and different architectures are employed for WNN-HIA for comparison with WNN-CIA. After 100 experiments with 500 epochs in each run, Table 5 shows the mean and standard deviation of RMSE for training and testing data for two WNN models with different initialization methods. Test results of different models are shown in Table 6. The RMSE reduction curve during training and testing of gradient descent algorithm corresponding to the best WNN-CIA model is drawn in Figure 9. Figures 10 and 11 show the prediction output of the best WNN-CIA model and the corresponding prediction error for training and testing data with the training and testing RMSE as 0.0233 and 0.0335.
From Table 5, we can see that the performance of WNN with the proposed clustering based initialization procedure is also superior to that with heuristic initialization approach even when more parameters are employed in WNN-HIA. It demonstrates again the validity of our methods.
In this paper, a novel initialization procedure for WNN as time series predictor is proposed, which behaves as a dimensional clustering procession. Taking account of the distribution of input patterns and the local response property of wavelet functions, the input patterns can be dynamically classified by the proposed approach. And then the architecture as well as the initial values of translation and dilation parameters of WNN model can be determined accordingly. Simulation results demonstrate that, besides the capability of neuron number determination, WNN with our initialization method can provide satisfactory and stable results for time series prediction.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
The authors are thankful that the research is supported by the National Science Foundation of China (61275120) and Tian Yuan Special Foundation (11426207).
R. Setiono and H. Liu, “Feature extraction via neural networks,” in Feature Extraction, Construction and Selection, pp. 191–204, Springer US, 1998.View at: Google Scholar
C. M. Bishop, Neural Networks for Pattern Recognition, The Clarendon Press, Oxford, UK, 1995.View at: MathSciNet
G. E. P. Box, G. M. Jenkins, and G. C. Reinsel, Time Series Analysis: Forecasting and Control, John Wiley & Sons, 2013.
D. Veitch, Wavelet Neural Networks and Their Application in the Study of Dynamical Systems, University of York, 2005.
R. Liu, Research on Computational Intelligence-Based Strctural Reliability Design Optimization, Jilin University, 2006.
S. Mallat, A Wavelet Tour of Signal Processing, Academic Press, 1999.
J. Chen, Characterization and Implementation of Neural Network Time Series Models for Traffic Volume Forecasting, University of Toledo, Toledo, Ohio, USA, 1997.
H. Surmann, A. Kanstein, and K. Goser, “Self-organizing and genetic algorithms for an automatic design of fuzzy control and decision systems,” in Proceedings of the 1st European Congress on Fuzzy and Intelligent Technologies (EUFIT '93), Aachen, Germany, 1993.View at: Google Scholar
J. S. R. Jang and C. T. Sun, Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence, Prentice-Hall, 1996.