A Joint Optimization of Momentum Item and Levenberg-Marquardt Algorithm to Level Up the BPNN’s Generalization Ability
Back propagation neural network (BPNN) as a kind of artificial neural network is widely used in pattern recognition and trend prediction. For standard BPNN, it has many drawbacks such as trapping into local optima, oscillation, and long training time. Because training the standard BPNN is based on gradient descent method, and the learning rate is fixed. Momentum item and Levenberg-Marquardt (LM) algorithm are two ways to adjust the weights among the neurons and improve the BPNN’s performance. However, there is still much space to improve the two algorithms. The hybrid optimization of damping factor of LM and the dynamic momentum item is proposed in this paper. The improved BPNN is validated by Fisher Iris data and wine data. Then, it is used to predict the visit_spend. The database is provided by Dunnhumby's Shopper Challenge. Compared with the other two improved BPNNs, the proposed method gets a better performance. Therefore, the proposed method can be used to do the pattern recognition and time series prediction more effectively.
Artificial neural network is a computing model which is similar to biological neural network. It is widely used in stimulation, trend prediction, pattern recognition, and control system. It could realize self-study without knowing the function and relationship among the training datasets. It is widely used in practice and researched in academia. For example, Jing and Cheng  developed a new optimal PID learning for training feedforward neural networks (FNN) for any purpose (system identification, function approximation, pattern recognition, control, etc.). And, in their paper, they compared its effect with some types of neural networks, such as BP, scale conjugate BP, and LM-BP. Although standard BPNN is one of the typical neural networks, it still has many unavoidable shortcomings. Because, the learning rate is fixed and the weights in standard BPNN are adjusted by gradient descent method according to error back propagation, for BPNN, it’s easy to trap into local optima, oscillation and long training time. These issues make the standard BPNN unable to meet the demands of some pattern recognition or data regression items which need fast processing (e.g., the real-time condition monitoring of complex mechanical equipment and some control systems). In recent decades, many improvements were developed on the standard BPNN to speed up the training process and get more accurate results. And some are variants of neural networks. Among the published works, the improvements engage in solving the three main drawbacks: slow learning speed, oscillation, and convergence to local optima.
(1) Alternative Learning Rate. In the standard BPNN, the learning rate is fixed in the whole iteration process. The magnitude of the learning rate decides the steps of the weights update, which is strongly related to the learning time. Roy  introduced the near optimal learning rate into the adjustment of the learning rate. Song et al.  adjusted the learning rate value according to the change of system error between two consecutive steps. A self-adaptive learning rate based on the adjustment of weights and biases changes was reported in . Reference  proposed a new dynamic optimal learning rate which is related to the previous approach after the first iteration. Hasan et al.  developed Hanning window neural network and used Hanning window function to make the learning rate dynamic. The linear matrix inequality techniques were used to find the appropriate learning rates to guarantee the fast and robust convergence . These methods make a good performance in the learning speed.
(2) Momentum Item Methods. Momentum method is another way to modify the adjustment of weights. It reflects the previous information of the weights adjustment and could accelerate the training speed and at the same time weaken the oscillation. A new way to accelerate the convergence by adjusting the learning rate and momentum factor at the same iteration was reported by Yu and Liu . Wang et al.  proposed a restart strategy for the momentum in order to converge the cyclic and almost-cyclic learning with a single hidden layer neural network. Another way to adjust the momentum coefficient according to error function and weights in the network was given by Wu et al. .
(3) Second Order Methods. The gradient descent method only uses the first-order derivative of the error function; many works have proved that the second-order derivative methods are more efficient in increasing the convergence speed and getting more accurate results than gradient descent. These second order methods include quasi-Newton , conjugate gradient , and LM algorithm [13, 14]. Specially, LM algorithm is a good adjuster of the Gauss-Newton technique and the steepest-descent algorithm but avoids many of their limitations . The adjuster is based on the conception of damping factor. The adjustment of the parameter in the LM algorithm is according to the iteration effectiveness [15–17], while most works just regard the LM algorithm as the training method without any improvement .
Although there are many aspects proposed for improving BPNN, it is still deficient. Especially for some engineering demands, the real-time condition monitoring and trend prediction stress the BPNN less training time and more accurate results so that some timely operations could be implemented. Therefore, in this paper, hybrid optimization of dynamic momentum item and damping LM algorithm is proposed to accelerate the convergence speed and get more accurate results. The momentum item is added to weaken the oscillation. And LM algorithm is involved to speed up the iteration. Being different from the standard BPNN, the adjustment of the weights in the proposed method utilizes the second order derivative information and the previous iteration. The weights in the next iteration are decided by three sections, the current weight, the weight adjustment, and hybrid optimized previous weight adjustment. The influence of the previous weights is determined by the momentum coefficient and damping factor in LM algorithm. The momentum coefficient and damping factor vary in each iteration. The adjustment of the momentum coefficient is decided by the previous momentum parameter in the last iteration. It increases if the error of the BPNN declines. When it reaches the maximum, it returns to a certain value. Oppositely, damping factor in LM declines if the iteration performs better than before. Overall, the parameters vary according to the former values and the iteration efficiency.
The rest of the paper is structured as follows. Section 2 presents the novel momentum and LM algorithm to train the BPNN with one hidden layer. Section 3 validates the proposed training method by Fisher Iris data and wine data compared in the aspect of pattern recognition with LM proposed by El-Alfy  training BPNN and the BPNN trained by improved LM algorithm which is provided by Nørgaard . In Section 4, Dunnhumby’s Shopper Challenge dataset is used to prove the improvements of prediction. Section 5 outlines the conclusions and presents the future work.
2. Modified Training Algorithm for BPNN
2.1. Basic BPNN
BPNN is a kind of feedback neural network. Its main principle is the error back propagation. The adjustment of weights is based on the gradient descent which requires that the activation function has the first order derivative. The iteration terminal condition is meeting the predetermined error goal or reaching the maximum iteration steps. Therefore, it is a supervised network. Its learning rule with one hidden layer is shown in Figure 1.
Considering the convenience of description, supposing that there is one input layer, one hidden layer, and one output layer, and are the input vectors and the desired output vectors, respectively, where . In the standard BPNN, the adjustments of weights are based on the derivative of error function. So the error function is vital. Fontenla-Romero et al.  considered the influence of the slope of the nonlinear activation functions and proposed a new way to measure the error. Nguyen et al.  proposed a new cost function considering the system error and the cluster weight which represents an approximation to the probability mass. In this paper, the error function is defined as follows: where is the real network output and is the weight vector.
For the standard BPNN, the gradient descent is implemented to train the network. And the weights update based on the last iteration and the changes means weight vector in the th iteration time, and denotes the changes of weights in the th iteration time. is the weight vector for the next iteration time.
2.2. The Classical Momentum Item and LM Algorithm
Among the published improvements in standard BPNN, momentum item and LM algorithm are two common and effective ways to improve BPNN’s performance. They are used to weaken the oscillation and speed up convergence. The main characteristics of the two algorithms are as follows. More details can be seen in the works [21, 22], respectively.(1)Gradient descent with momentum item: the weight’s change is related to the previous weight update: where is the learning rate. is the gradient which is derived from the standard BPNN derivation process. When the weight update of input-hidden layer is calculated, is the input of the samples, while, when the weight update of hidden-output layer is calculated, is the output of the hidden layer. is the momentum coefficient, .(2)Levenberg-Marquardt: the weight update rule is as follows: where the term denotes the error vector of the neural network. is a damping factor which impacts the performance of the convergence. It is also the adjuster of steepest-descent method and Gauss-Newton method. If is large, expression (4) approximates steepest-descent method; otherwise, when is small, the equation approximates Gauss-Newton method. is the Jacobian matrix which is defined as follows:
2.3. The Hybrid Optimization in BPNN
In the standard algorithm, the parameter which is the momentum coefficient is static. It does not change in the whole iteration process, which makes the impacts of the momentum limited. Traditionally, the damping parameter becomes larger or smaller according to the performance. For example, it is decreased or increased by a factor 10 based on whether the performance is improved or not, respectively . However, the merits of momentum item and LM algorithm have not been exerted sufficiently. So, in this paper, the two algorithms are optimized simultaneously and the weight equation for the proposed BPNN is as follows:
(1) The Adjustment of α. The proposed method considers the momentum coefficient dynamic. The updates according to the error alteration and the former iteration. If the error reduces in this iteration, it means the previous weight update is beneficial to convergence; the researching direction is correct. Therefore, should be bigger to encourage researching on this direction next time. Otherwise, the momentum coefficient should decline. The weight update can be formulated as follows:
While the momentum coefficient should not increase all the time and infinitely, when it is too big, it can influence the network; therefore, it should be reset as a decimal in the interval (0,1). The restriction rule of is described as follows:
(2) The Adjustment of . could be regarded as the learning rate in the standard momentum method, while it is the damping parameter of LM algorithm. The principle of update is different from ’s. For a given , if the error reduces, should decline to perform LM as the analogous Gauss-Newton advantage of fast convergence for local search. Otherwise, increases to make LM analogous steepest-descent algorithm for searching global optima. Too big or too small update of makes the network need long time to train. The disadvantage is similar to the learning rate in the standard BPNN. An appropriate update of is more efficient for convergence. The rule of the adjustment of is as follows:
3. Validation on Pattern Recognition
In this paper, we suppose there is just one hidden layer. The hyperbolic tangent activation function is in the hidden neurons and linear activation function in the output neurons.
The number of the neurons in the hidden layer is important for convergence speed. Too many or too few neurons make the network need long time for training. Traditionally, it is set by personal experience. In this paper, the number of the neurons in the hidden layer depends on the empirical formula which is shown as follows: in and on are the number of the neurons in the input layer and output layer, respectively. is a constant in interval . in and on are determined by the dimensionality of the input vector and the output vector. For example, the input and output dataset have 4 attributes and 1 attribute, respectively; the numbers of neurons in the input layer and output layer are 4 and 1.
To validate our proposed method on pattern recognition, Fisher Iris data and wine data which are provided by UCI are used. Fisher Iris data and wine data should be normalized and the “mapminmax” function provided by Matlab toolbox is used. All the parameters in the compared three methods are the same. The maximum iteration step is 50000000, the training error goal is 1e-10, and the initial is 1. In our improved program, the initial momentum coefficient is 0.01. The adjustment of and is according to (7)–(9).
3.1. The Fisher Iris Data Example
The Fisher Iris data is one of the most famous databases in the pattern recognition works. The dataset includes 3 classes of 50 instances each. These classes refer to “setosa,” “versicolor,” and “virginica” which are labeled as “1”, “2,” and “3,” respectively. The attributes contain sepal length, sepal width, petal length, and petal width. The unit is centimeter. The whole data is divided into 2 parts; one is for training and the other for testing. The data classification and the corresponding classification map in the first three attributes are shown in Figures 2 and 3.
The output of the test data should be processed in order to be compared with the real value. Because of calculation, the real output of the network may be nonintegral. So “round” function which is provided by Matlab toolbox is used to process the real output. For simplifying the items, ELM-BPNN replaces El-Alfy’s LM training BPNN, and NLM-BPNN denotes the LM-BPNN improved by Nørgaard. The compared results are shown in Table 1.
(c) Proposed LM-BPNN
(c) Proposed LM-BPNN
In these 3D figures, the points with pink color are the wrong recognition. From the above figures and table, the proposed method has better performance in correct rate.
The errors of iteration process are shown in Figure 6.
(c) Proposed LM-BPNN
The statistics of iteration steps, iteration time, and accuracy are listed in Table 2.
3.2. Wine Data Example
The wine data is provided by UCI. The data is the results of a chemical analysis of wines grown in the same region in Italy but derived from 3 different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines. These attributes are alcohol, malic acid, ash, alkalinity of ash, magnesium, total phenols, flavanoids, nonflavonoid phenols, proanthocyanins, color intensity, hue, OD280/OD315 of diluted wines, and (13) proline. The numbers of instances are Class 1 with 59, Class 2 with 71, and Class 3 with 48. Among these instances, 19 instances are regarded as the test set, and the rest are the training set. The classification and the corresponding map in the first three attributes are shown in Figures 7 and 8.
Similarly, the compared outputs of different networks are operated by “round” function, so the results are listed in Table 3.
(c) Proposed LM-BPNN
(c) Proposed LM-BPNN
The errors of iteration process are shown in Figure 11.
(c) Proposed LM-BPNN
The statistics of iteration steps, iteration time, and accuracy are listed in Table 4.
By comparison, the proposed LM-BPNN gets more accurate results than ELM-BPNN and NLM-BPNN. In issue of the training speed, the proposed LM-BPNN is faster than ELM training BPNN.
4. Validation on Prediction
In the aspect of prediction, Dunnhumby’s Shopper Challenge database is used to compare the performances of the three methods. The dataset consists of details of every visit made by 100,000 customers over a year from April 2010 to March 31, 2011. Each visit is stamped with the date and the customer’s spend in that visit. The challenge is to predict the visit_date and visit_spend of this next visit for each customer_id. But in this section, we just predict the visit_spend during April to December in 2010.
In order to be trained in the network, the data should be limited in the interval . With the consideration of the external character of prediction, all the data is divided by an enough big constant, . Besides, two performance indexes are given to assess the trained networks:
Expressions (11) and (12) could evaluate the deviation from the real value. abs means the absolute values. So the basic information of test dataset and performances of the three networks are given in Table 5.
By comparing the results from Tables 5 and 6, the proposed LM-BPNN has a better performance than ELM-BPNN in the aspect of prediction. Two performance indexes could illustrate that the proposed method is more stable.
The results from pattern recognition (in Section 3) and prediction (in Section 4) show the proposed BPNN performs better than the other two methods. It could finish calculation in less time and less iteration step with higher accuracy. Because it uses the error information and efficiency of pervious iteration sufficiently. Momentum item could overcome the oscillation and make the iteration in a less error direction, simultaneously, LM algorithm is a good balance of steepest-descent method and Gauss-Newton method. Therefore, the proposed method not only weakens the oscillation but also converges into the optimum in less iteration steps.
In this paper, a joint optimization of momentum item and Levenberg-Marquardt is proposed to train BPNN. Its performance is compared with ELM-BPNN and NLM-BPNN in the aspect of pattern recognition and prediction. The validated data is provided by UCI and public challenge. The results proved that the proposed LM-BPNN has a better performance. Although the proposed method shows its better performance, there is still much work to do. For example, the adjustment strategy of should be more adaptive. We also found the initial could influence the network’s convergence speed, so how to select appropriate is worth researching.
Conflict of Interests
The authors declare that they have no conflict of interests regarding the publication of this paper.
The authors would like to thank anonymous referees for their remarkable comments and great support by Key Project supported by National Science Foundation of China (51035008), the Natural Science Foundation Project of Chongqing (CSTC, 2009BB3365), and the Fundamental Research Funds for the State Key Laboratory Of Mechanical Transmission, Chongqing University (SKLMT-ZZKT-2012 MS 02).
X. J. Jing and L. Cheng, “An optimal PID control algorithm for training feedforward neural networks,” IEEE Transactions on Industrial Electronics, vol. 60, no. 6, pp. 2273–2283, 2013.View at: Google Scholar
S. Roy, “Factors influencing the choice of a learning rate for a backpropagation neural network,” in Proceedings of the IEEE World Congress on Computational Intelligence, IEEE International Conference on Neural Networks, pp. 503–507, June-July 1994.View at: Google Scholar
G. J. Song, J. L. Zhang, and Z. L. Sun, “The research of dynamic change learning rate strategy in BP neural network and application in network intrusion detection,” in Proceedings of the 3rd International Conference on Innovative Computing Information and Control (ICICIC '08), p. 513, June 2008.View at: Publisher Site | Google Scholar
Y. Li, Y. Fu, H. Li, and S.-W. Zhang, “The improved training algorithm of back propagation neural network with selfadaptive learning rate,” in Proceedings of the International Conference on Computational Intelligence and Natural Computing (CINC '09), pp. 73–76, June 2009.View at: Publisher Site | Google Scholar
T. Zhang, C. L. P. Chen, C. H. Wang, and S. C. Tam, “A new dynamic optimal learning rate for a two-layer neural network,” in Proceedings of the International Conference on System Science and Engineering (ICSSE '12), pp. 55–59, June-July 2012.View at: Google Scholar
M. M. Hasan, A. Rahaman, M. Talukder, M. Islam, M. M. S. Maswood, and M. M. Rahman, “Neural network performance analysis using hanning window function as dynamic learning rate,” in Proceedings of the International Conference on Informatics, Electronics & Vision (ICIEV '13), pp. 1–5, May 2013.View at: Google Scholar
C.-C. Yu and B.-D. Liu, “A backpropagation algorithm with adaptive learning rate and momentum coefficient,” in Proceedings of the International Joint Conference on Neural Networks (IJCNN '02), pp. 1218–1223, May 2002.View at: Google Scholar
P. S. Lande and A. S. Gadewar, “Application of artificial neural networks in prediction of compressive strength of concrete by using ultrasonic pulse velocities,” IOSR Journal of Mechanical and Civil Engineering, vol. 3, no. 1, pp. 34–42, 2012.View at: Google Scholar
E. S. M. El-Alfy, “Detecting pixel-value differencing steganography using Levenberg-Marquardt neural network,” in Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM '13), pp. 160–165, April 2013.View at: Google Scholar
G. H. Nguyen, A. Bouzerdoum, and S. L. Phung, “A supervised learning approach for imbalanced data sets,” in Proceedings of the 19th International Conference on Pattern Recognition (ICPR '08), pp. 1–4, December 2008.View at: Google Scholar
M. Nørgaard, “Neural network based system identification toolbox,” Tech. Rep., Department of Automation, Technical University of Denmark, 2000.View at: Google Scholar