Data Normalization to Accelerate Training for Linear Neural Net to Predict Tropical Cyclone Tracks

Jin, Jian; Li, Ming; Jin, Long

doi:https://doi.org/10.1155/2015/931629

Mathematical Problems in Engineering

On this page

Abstract Introduction Results Discussion Conclusions Acknowledgments References Copyright Related Articles

Special Issue

Advanced Time Series Forecasting Methods

View this Special Issue

Research Article | Open Access

Volume 2015 | Article ID 931629 | https://doi.org/10.1155/2015/931629

Data Normalization to Accelerate Training for Linear Neural Net to Predict Tropical Cyclone Tracks

Jian Jin,¹Ming Li,²and Long Jin³

Academic Editor: Cagdas Hakan Aladag

Received27 Aug 2014

Accepted20 Dec 2014

Published16 Jul 2015

Abstract

When pure linear neural network (PLNN) is used to predict tropical cyclone tracks (TCTs) in South China Sea, whether the data is normalized or not greatly affects the training process. In this paper, min.-max. method and normal distribution method, instead of standard normal distribution, are applied to TCT data before modeling. We propose the experimental schemes in which, with min.-max. method, the min.-max. value pair of each variable is mapped to (−1, 1) and (0, 1); with normal distribution method, each variable’s mean and standard deviation pair is set to (0, 1) and (100, 1). We present the following results: (1) data scaled to the similar intervals have similar effects, no matter the use of min.-max. or normal distribution method; (2) mapping data to around 0 gains much faster training speed than mapping them to the intervals far away from 0 or using unnormalized raw data, although all of them can approach the same lower level after certain steps from their training error curves. This could be useful to decide data normalization method when PLNN is used individually.

1. Introduction

Numerous prediction models have been proposed to raise the forecasting precisions of the tropical cyclone tracks (TCTs) in South China Sea (SCS) in the past decades to reduce the loss from these disasters. TCT forecasting is a time series problem. There are a lot of improved time series techniques that could be applied to this field [1–4]. The approaches are getting more and more complicated, including ensembles [5–8]. The results presented here are part of a wider project on committee machines aiming to obtain higher performance through combining multiple simple experts. These experts require reasonable accuracy and diversity [9]. We use pure linear neural networks (PLNNs) as the experts and then combine their results by means of fuzzy logic. So it relies on comprehensive understanding of the PLNNs, especially PLNNs on the particular data set [10].

One of the problems we face is before modeling whether data should be normalized or not to predict TCTs in SCS and what is the effect. It has been reported in literature that normalization could improve performance. Sola and Sevilla [11] pointed out the importance of data normalization prior to the neural network training to fasten the calculations and obtain good results in nuclear power plant application. Jayalakshmi and Santhakumaran [12] concluded that various statistical normalization techniques enhance the reliability of trained feed forward backpropagation neural networks and the performance of the diabetes data classification model using the neural networks is dependent on the normalization methods. Zhang and Sun [13] also provided a normalization method considered to be suitable for the particular data set from UCI repository. So, data normalization is important whereas the results are by no means universally applicable to the particular data. In predicting the TCTs, we usually use raw environmental data or simply map the raw data to 0-1 interval linearly. But we are not certain if that is beneficial.

In this paper, we try two commonly used normalization methods, that is, linear min.-max. method and normal distribution method, propose the experimental schemes to map the raw data variables to the intervals near as well as far away from , and then apply the normalized data to a standalone PLNN model. We intend to study whether data normalization really works and how it affects the PLNN model with TCT data, so that we can be more aware of how to pretreat the data and make the combining system more controllable once the PLNNs are used as ensemble members. We demonstrate that proper data normalization speeds up convergence remarkably with the TCT data. From the trends of error curves, lower levels of training error can be achieved in all situations as long as enough training iterations are provided. However, mapping the data to a proper interval makes more sense than how the data is mapped, according to the min.-max. method and normal distribution method.

2. Pure Linear Neural Network

The pure linear neural network in this paper is a single-layer structure model named ADALINE (adaptive linear neuron), first proposed by Widrow and Hoff in 1960, as shown in Figure 1 [14]. Each node in the structure represents a neuron, and it has its own weights and a bias and gets the same input vector . The weighted inputs are summed with the bias to form the following net output : where and .

Equation 1 could also be expressed as for simplicity, where and .

The neuron uses pure linear transfer function which takes the net output as its input and forms the neuron output . The neuron is shown in Figure 2.

The learning rule could be derived by minimizing the mean square error of the training instance where is the difference between the target (or observed value) and the network output (or prediction) . The rule is written as or for simplicity [15], where is a learning rate.

In this paper, we use a MATLAB function introduced in the Neural Network Toolbox, as the learning rate. This function takes 6 as the learning rule.

We use the traditional incremental training strategy, with which the weights and biases are updated based on each single instance, while, with the batch training, weights and biases are updated based on more than one instance [16–18].

3. Experiments and Results

3.1. Data Set

The used TCT data set in our experiment is published by China Meteorological Administration. The objects are the TCs that form in or move into the SCS area and last for at least 48 h. The TCTs were sampled every 12 h, starting from the moment when a TC moves into the sea area or when a TC developed in the area. We use the sampled 750 instances since 1960. Our object is to predict the longitude and latitude of the TC center in the next 24 h [19].

A TCT and its changes are associated with TC’s intensity, accumulation and replenish of energy, and various nonlinear changes in its environmental flow field, which are referred to as variables in this paper. The variables used in our experiment include the climatology and persistence (CLIPER) factors representing changes of TC itself, such as changes in the latitude, longitude, and intensity of a TC at 12 and 24 h before prediction time. Table 1 lists the variables initially selected by using the CLIPER method with the significance level of 0.01 [19, 20].

For our case study we adapt the preceding 720 instances for training and the following 30 as independent instances for testing. Eight and 11 common predictors among July, August, and September are carefully picked from the different number of variables as shown in Table 1 for longitude and latitude, respectively. Some statistical information about the data set is listed in Table 2.

3.2. Experiment Setup

Two types of normalization methods, min.-max. method and normal distribution method, are used in our experiment. The min.-max. method linearly maps the original values to the new interval determined by the assigned min.-max. values (see Figure 3). The original minimum value and maximum value can be achieved from the statistical information of the raw data, and the new minimum value and maximum value are assigned by us, so the new value could be calculated from original value by

while the normal distribution method maps the original values to the new interval according to the new mean and deviation (see Figure 4).

Similarly, the new value could be calculated by except the mean and deviation in place of the min.-max. values.

Four normalized data sets originated from the raw data set, as well as the raw one, are used in our experiment and are described in Table 3.

The training process on each training data set is summarized in Algorithm 1 [21], in which we have the following.(1): each instance in the training set is applied to training process one by one in turn.(2)Calculations of instance indicates necessary calculations such as forward calculating the network output, errors, and so on, as required.

(1) begin
(2) repeat
(3) Calculate the learning rate with 7 and the current
training set
(4) forall do
(5) Calculations of the instance
(6) Update the weights and biases making use of the instance
and the incremental learning rule 6
(7) until the maximum number of iterations () is reached;
(8) end

The experimental process is summarized in Algorithm 2, in which Algorithm 1 is used twice, and we have the following.(1)Network weights are initialized with uniform distributed random real values ranging between 0 and 1, and the experiments on the five data sets use the same initial weights.(2)Training set is initialized with the preceding 720 out of the total 750 instances (including both longitude and latitude data and the same below).(3)Testing set is initialized with the last 30 instances.(4)During the pretraining, the maximum number of iterations () in Algorithm 1 is set to 1000.(5)During the repeat statement, is set to 20.

(1) begin
(2) Initialization begin
(3) Network weights
(4) Training set
(5) Testing set
(6) end
(7) Pre-training on the initial training set using Algorithm 1
(8) repeat
(9) Training on the current training set using Algorithm 1
(10) Test on the first
(11) Move the first from to
(12) until ;
(13) end

In Algorithm 2, we arranged a pretraining process in which the maximum number of iterations () is much larger than that in the training process with respect to each independent instance (). This arrangement is made with the concerns that we hope the prediction of the independent instance is made using the adequately trained model. If is set too small, the model will be inadequately trained when it applies to predict the independent instances. To compensate for this, must be set larger to train the initial training set more, other than the newly added independent instance, to make the model more adequately trained. This is reasonable for the beginning independent instances. However, aims to train the newly added independent instance based on the adequately trained model using the former training set. For the later independent instances, there is only one new instance added to the training set, so the training load is not so heavy, and the larger wastes more resources. As a result, is set to be larger and is set to be smaller. Another character in Algorithm 2 is that each time only one instance is predicted and after that the predicted instance is added to the training set. So the model should be retrained before next prediction. This is in accordance with the actual facts. We use the 24-hour model to predict the track center in the next 24 hours. Once it comes true, this instance could be more valuable and should be added to the training set for training to prepare for the next 24 hours.

3.3. Result and Analysis

Figure 5 shows the convergence curves during the pretraining process for the five normalization configurations. The left half of the figure is for the longitude and the right one for the latitude. Both halves share the same -coordinate, namely, iterations labeled on the top edge and numbered along the bottom edge. The -coordinate is the mean absolute error (MAE) for all the instances in the training set, after iterations of training. The scales in the -coordinate are different for longitude and latitude and are numbered along the left edge of the figure. The five colored thick curves with various line styles reflect the training precision and speed. The smaller the -value, the higher the precision, and the steeper the curve, the faster the convergence. We should make it clear that all the results, including the MAEs, are denormalized before drawing in the figures or listing in the table, and it is the same as below. As Algorithm 1 shows, iteration of training means that each of the instance in the training set should be applied to the learning rule to update the weights for once.

(a)

(b)

Both curves of Conf. 1 and Conf. 2 use the min.-max. method of normalization. The difference is that Conf. 1 makes the normalized data located around zero with the radius 1 while Conf. 2 around 0.5 with the radius 0.5. This minor difference makes the curve of Conf. 1 smaller and steeper than that of Conf. 2 during the pretraining, for both longitude and latitude.

What Conf. 1 and Conf. 3 have in common is that their means are 0 and radii are about 1. Their differences are uniform distribution for Conf. 1 and normal distribution for Conf. 3. The convergence curves of Conf. 1 and Conf. 3 are very close, both for longitude and latitude data. For longitude data, MAE values on the curve of Conf. 1 is smaller than on the curve of Conf. 3, while for latitude data things are the opposite. So it is hard to tell which configuration is more superior to the other.

Conf. 3 and Conf. 4 both use normal distribution, but the mean of Conf. 3 is 0, while that of Conf. 4 deviates from 0 to 100. As a result, the convergence MAE for Conf. 4 is too large and disappears out of the axis limits in Figure 5.

Conf. 5 uses the raw data, located in different intervals ranging from −45 to 525 for different variables (see Table 2). The curve of Conf. 5 doesn’t deviate so much as the curve of Conf. 4 does, although the curve of Conf. 5 is flatter than the curves of Conf. 1, Conf. 2 and Conf. 3. It implies that the initial weights ranging between 0 and 1 are possibly not so decisive for big valued data, although data from Conf. 4 could be influenced to some extent by the same initial weights with other configurations.

Table 4 lists the required iterations for PLNN to reach the same level of training errors during the pretraining process with both longitude and latitude data. It quantitatively illustrates that mapping the data to the intervals around (Conf. 1 and Conf. 3) could reach the smaller training error much faster, and even if slight shift for the mean from to (Conf. 2) could extend this process, let alone giant shift from to (Conf. 4).

Figure 6 illustrates the fitting curves of longitude and latitude on independent instances. The left and right halves of the figure are for longitude and latitude, respectively. Both halves share the same -coordinate indicating the indexes of the independent instances labeled on the top edge and numbered along the bottom edge. The -coordinate is the real east longitude or north latitude value. The five prediction curves, as well as the observed one (the targets, black solid line), are drawn for comparison. Each curve connects the adjacent fitting points. It is worth noting that the corresponding training set to each independent instance includes the original training set (720 instances) and the previous used independent instances. For example, if is 5, then the current training set is 720 plus 4, totally 724 instances. For -value, the closer to the target point, the more accurate the prediction. For -value of east longitude, the bigger value implies that the prediction is east to the target position and the smaller value implies west, and vice versa. For -value of north latitude, the bigger and smaller values imply north and south to the target values, respectively.

(a)

(b)

The curves of Conf. 1, Conf. 2, and Conf. 3 are very consistent, and there is almost no difference in the figure. Conf. 5 uses the unnormalized data, and its predictions are similar to that of Conf. 1, Conf. 2, and Conf. 3. For the curve of Conf. 4, sometimes the deviations from the targets are large, and sometimes small, and show more instability. However, when combining systems, this instability could cancel out errors. So, the direction and magnitude of these instable deviations are also worth further study. From the qualitative point of view, prediction trends of all the five configurations are consistent with the target. Table 5 lists the mean absolute errors of the testing set, from the quantitative point of view.

The MAEs for longitude and latitude are calculated by in which is the size of testing set; and are target and predictive output of the th independent instance. The distance error (Dst., as row header in Table 5) is calculated by the common used equation

The distance errors of Conf. 1 and Conf. 3, whose means are zero, have a difference of only 0.3 km. Among the errors of Conf. 1, Conf. 2, and Conf. 3, whose means are near zero, they have a maximum difference of only 7.8 km. For such a large-scale motion as TC, these errors could be regarded as on the same level, while, for the distance error of Conf. 4, the difference is up to 56.8 km, comparing to the minimum error of Conf. 2.

4. Discussion

The experiments show that data normalization does affect the performance, as we had expected. It is more likely to converge, if we use the data scaled to around zero, no matter the min.-max. method or normal distribution method. However, it is not so absolute that the data center should locate on the zero point. In our experiments, testing errors using the data scaled to are even smaller than that using the data scaled to . We consider it within the normal error tolerance range. But when the data center reaches 100 and distribution radius remains 1, the errors increase greatly. This is probably because the data located around zero are easier to cancel out errors. The farther the data center from zero and the shorter the data distribution radius, the harder the convergence.

The raw data are made up of all kinds of meteorological variables, so it is difficult to make sure that they can locate around zero. This could affect the performance. In our experiments, although the values far from zero also exist in the raw data, which is as low as −45 and high up to 525, the errors are not so large and the convergence speed is not as slow as Conf. 4. This is probably because of the diversity of the raw data, in which some are around zero, some are not, and some are positive, while some are negative. They could also cancel out the errors to some degree.

The probing value ranges for data normalization in our experiments are not systematic, both for min.-max. method and normal distribution method. The experiments could be improved by using more data configurations to confirm the model trends. The experiments could have more different data centers and more ranges. For examples, the data centers could be 0, 1, 50, 100, 500, 1000, and so forth, and, on each data center, several radiuses on different orders of magnitude, depending on the center value, could be considered.

In our experiments, only one normalization method is used on all variables in the same data. In other words, in one experiment all the variables in the data set use the same normalization method. However, this is different from the real situations. Maybe different variables should use different methods. For example, the variable, longitude, should be scaled to an evenly distributed range, because the earth is round and the longitudes are evenly distributed on the surface, while the surface wind speed should use the normal distribution method because of its possible values.

Despite slower convergence speed and larger mean errors of Conf. 4, individual points on the fitting curve of independent instances exist with higher precision, as well as lower precision, and show more instability. However, when combining subsystems in the committee machines, which require the experts to be reasonably accurate (better than guessing) and diverse (errors are uncorrelated), some instabilities may be beneficial. The diversity of curves for Conf. 1, Conf. 2, Conf. 3 and Conf. 5 in Figure 6 is obviously not obvious. So, in the study of committee machines, how to make use of the normalization methods to control the model outputs is worthy of further exploration, not only showing some diversity, but also being limited to some degree.

5. Conclusions

In this work, we have proposed the experimental schemes to map the TCT data to different ranges, near to as well as far away from , using min.-max. and normal distribution methods. Then the normalized data were applied to the PLNN model to predict TC centers in SCS. It has been shown that both min.-max. and normal distribution methods produce similar results, as long as they map the data to similar value ranges. And we also demonstrated that mapping the data to around could remarkably fasten the training speed, although, provided enough training iterations, reaching smaller error is possible in all given situations.

Future work will be devoted to initialization of the network weights, which is also supposed to greatly affect the results. Being more aware of how data normalization and initial weights of the networks influence the training process, plus the theoretically optimal learning rate provided by MATLAB Neural Network Toolbox, we can have a more comprehensive understanding of individual PLNN and can better design and control the combining system.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant no. 61203301), Major Special Funds for Guangxi Natural Science Foundation (no. 2011GXNSFE018006), and National Natural Science Foundation of China (no. 41065002). Ming Li acknowledges the supports in part by the National Natural Science Foundation of China under the Project Grants nos. 61272402, 61070214, and 60873264. Ming Li also thanks the support by the Science and Technology Commission of Shanghai Municipality under Research Grant no. 14DZ2260800. The authors are grateful to Xiao-Yan Huang for providing and refining the variable information of the data set.

References

E. Egrioglu, U. Yolcu, C. H. Aladag, and E. Bas, “Recurrent multiplicative neuron model artificial neural network for non-linear time series forecasting,” Procedia—Social and Behavioral Sciences, vol. 109, pp. 1094–1100, 2014.
View at: Publisher Site | Google Scholar
E. Egrioglu, U. Yolcu, C. H. Aladag, and C. Kocak, “An ARMA type fuzzy time series forecasting method based on particle swarm optimization,” Mathematical Problems in Engineering, vol. 2013, Article ID 935815, 12 pages, 2013.
View at: Publisher Site | Google Scholar
C. H. Aladag, U. Yolcu, E. Egrioglu, and E. Bas, “Fuzzy lagged variable selection in fuzzy time series with genetic algorithms,” Applied Soft Computing Journal, vol. 22, pp. 465–473, 2014.
View at: Publisher Site | Google Scholar
E. Egrioglu, “PSO-based high order time invariant fuzzy time series method: application to stock exchange data,” Economic Modelling, vol. 38, pp. 633–639, 2014.
View at: Publisher Site | Google Scholar
Q. Wang, J. Liu, and L. Zhang, “The study on ensemble prediction of typhoon track,” Journal of the Meteorological Sciences, vol. 32, pp. 137–144, 2012.
View at: Google Scholar
T. C. Lee and W. M. Leung, “Performance of multiple-model ensemble techniques in tropical cyclone track prediction,” in Proceedings of the 35th Session of the Typhoon Committee, Chiang Mai, Thailand, November 2002.
View at: Google Scholar
T. C. Lee and M. S. Wong, “The use of multiple-model ensemble techniques for tropical cyclone track forecast at the Hong Kong Observatory,” in Proceedings of the WMO Commission for Basic Systems Technical Conference on Data Processing and Forecasting Systems, Cairns, Australia, December 2002.
View at: Google Scholar
J. S. Goerss, “Tropical cyclone track forecasts using an ensemble of dynamical models,” Monthly Weather Review, vol. 128, no. 4, pp. 1187–1193, 2000.
View at: Publisher Site | Google Scholar
L. K. Hansen and P. Salamon, “Neural network ensembles,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, no. 10, pp. 993–1001, 1990.
View at: Publisher Site | Google Scholar
L. Jin, J. Jin, and C. Yao, “A short-term climate prediction model based on a modular fuzzy neural network,” Advances in Atmospheric Sciences, vol. 22, no. 3, pp. 428–435, 2005.
View at: Publisher Site | Google Scholar
J. Sola and J. Sevilla, “Importance of input data normalization for the application of neural networks to complex industrial problems,” IEEE Transactions on Nuclear Science, vol. 44, no. 3, pp. 1464–1468, 1997.
View at: Publisher Site | Google Scholar
T. Jayalakshmi and A. Santhakumaran, “Statistical normalization and back propagation for classification,” International Journal of Computer Theory and Engineering, vol. 3, no. 1, pp. 89–93, 2011.
View at: Publisher Site | Google Scholar
Q. Zhang and S. Sun, “Weighted data normalization based on eigenvalues for artificial neural network classification,” in Neural Information Processing, vol. 5863 of Lecture Notes in Computer Science, pp. 349–356, Springer, Berlin, Germany, 2009.
View at: Publisher Site | Google Scholar
B. Widrow and M. E. Hoff, “Adaptive switching circuits,” in IRE WESCON Convention Record, pp. 96–140, 1960.
View at: Google Scholar
M. T. Hagan, H. B. Demuth, and M. Beale, Neural Network Design, PWS Publishing, 1997.
N. Syed, H. Liu, and K. K. Sung, “Incremental learning with support vector machines,” in Proceedings of the Workshop on Support Vector Machines at the International Joint Conference on Artificial Intelligence, 1999.
View at: Google Scholar
R. Polikar, L. Udpa, S. S. Udpa, and V. Honavar, “Learn++: an incremental learning algorithm for supervised neural networks,” IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews, vol. 31, no. 4, pp. 497–508, 2001.
View at: Publisher Site | Google Scholar
S. Baluja, Population-Based Incremental Learning: A Method for Integrating Genetic Search Based Function Optimization and Competitive Learning, Carnegie Mellon University, Pittsburgh, Pa, USA, 1994.
L. Jin, X. Huang, and X. Shi, “A study on influence of predictor multicollinearity on performance of the stepwise regression prediction equation,” Acta Meteorologica Sinica, vol. 24, no. 5, pp. 593–601, 2010.
View at: Google Scholar
L. Xie, “A climate-continuity model for the forecaste of the south sea cyclone patheo,” Marine Forecasts, vol. 6, pp. 20–30, 1989.
View at: Google Scholar
J. Jin, Committee machines and its applications [Ph.D. dissertation], East China Normal University, Shanghai, China, 2008, (Chinese).

Copyright

Copyright © 2015 Jian Jin et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

4251

Downloads

1651

Citations