Abstract

Convolutional neural networks (CNNs) are effective models for image classification and recognition. Gradient descent optimization (GD) is the basic algorithm for CNN model optimization. Since GD appeared, a series of improved algorithms have been derived. Among these algorithms, adaptive moment estimation (Adam) has been widely recognized. However, local changes are ignored in Adam to some extent. In this paper, we introduce an adaptive learning rate factor based on current and recent gradients. According to this factor, we can dynamically adjust the learning rate of each independent parameter to adaptively adjust the global convergence process. We use the factor to adjust the learning rate for each parameter. The convergence of the proposed algorithm is proven by using the regret bound approach of the online learning framework. In the experimental section, comparisons are conducted between the proposed algorithm and other existing algorithms, such as AdaGrad, RMSprop, Adam, diffGrad, and AdaHMG, on test functions and the MNIST dataset. The results show that Adam and RMSprop combined with our algorithm can not only find the global minimum faster in the experiment using the test function but also have a better convergence curve and higher test set accuracy in experiments using datasets. Our algorithm is a supplement to the existing gradient descent algorithms, which can be combined with many other existing gradient descent algorithms to improve the efficiency of iteration, speed up the convergence of the cost function, and improve the final recognition rate.

1. Introduction

As a basic technology of deep learning, convolutional neural networks (CNNs) have been widely used in many fields since their inception. CNNs play an important role in fields such as the classification of images [1, 2], the analysis of microexpressions [3], the recognition of faces [4], and the detection of objects [5]. Various improved CNN models have been applied to various images—basic CNN is a supervised learning algorithm. However, CNN’s outstanding image feature extraction ability has related problems [68]. This also prompted many scholars to conduct in-depth research. They proposed a method that combines a CNN with an algorithm that provides a classification basis to form an unsupervised learning algorithm such as CSFL, which has broader application prospects [9, 10].

During the process of training the CNN model, we must identify the minimum value of the cost function. To this end, gradient descent optimization (GD) is the basic algorithm used [1113]. Then, a series of improved algorithms are derived. Generally, these improved algorithms can be divided into two groups.

The first group is represented by the momentum method [14, 15], in which the introduction of the momentum factor effectively alleviates the problem of large noise in each iterative calculation. This factor flattens the convergence curve and improves the convergence speed to a certain extent. Then, Nesterov’s acceleration [16, 17] further smooths and stabilizes the convergence curve by using the prediction of future updates in the iteration process.

The other group includes the adaptive gradient algorithm (AdaGrad) [18] and root mean square prop algorithm (RMSprop) [19], which consider the cumulative changes of each parameter in their respective iterative optimization processes. Taking the cumulative changes into consideration, the step of the next iteration is adjusted. A smaller step is performed on those with more cumulative changes, while a larger step is performed on those with less cumulative changes. This kind of adaptive method makes iterative parameter optimization more targeted. AdaDelta is another adaptive algorithm that is executed without the use of the global learning rate. Its window restriction on the accumulation of historical gradients enabled AdaDelta to avoid the learning rate decrease inherent in the AdaGrad [20].

Diederik P. Kingma designed a ground-breaking combination of these two ideas with his adaptive moment estimation algorithm (Adam) [21]. The learning rate of each parameter is adaptively adjusted by using the first-order moment estimation and second-order moment estimation of the gradient so that the parameter optimization process is stabilized and the convergence rate of the cost function is improved. The excellent performance of Adam has made it widely recognized. Since then, algorithms for CNNs have been improved mostly based on Adam. For example, Jun Hu improved the second-moment estimation of Adam based on hybrid power [22]. Dozat T. used Nesterov’s acceleration to improve the first-moment estimation of Adam [23]. H. Iiduka designed an iterative algorithm that combines the existing adaptive learning rate optimization algorithm with the conjugate gradient method. This paper considers the nonconvex optimization problem of deep neural networks and can be applied to the nonsmooth and convex optimization of deep neural networks [24]. However, these algorithms focus too much on the cumulative change in global gradients and ignore the importance of current gradients and recent gradients in the model optimization process. Sadaqat ur Rehman et al. proposed a modified resilient backpropagation (MRPROP) algorithm [25]. In the algorithm, an increase or decrease factor is given to the weight update values based on the sign of the current and recent gradients. The algorithm also solves the problem of CNN overtraining by introducing a tolerant band and a global best concept. They did not ignore the importance of current gradients and recent gradients, but they lacked attention to the cumulative change in global gradients. Shiv Ram Dubey et al. also improved Adam for this problem. These researchers introduced a new adaptive learning rate factor to adjust the step size according to the size of the gradient change and proposed the diffGrad algorithm [26]. In this approach, the adaptive learning rate factor provided by Adam regulates the global gradient, and the adaptive learning rate factor provided by the diffGrad algorithm adjusts the local gradient. The combination of the two adaptive learning rate factors has achieved good results.

In this paper, we also pay attention to local gradients and introduce a new adaptive learning rate factor. According to the positive and negative signs of the current and recent gradients, we can judge whether the parameter is in the convergence stage of the cost function or if it is near the value that can achieve the minimum of the cost function. While the parameters are in the convergence stage of the cost function, a large learning rate is given to the parameter iteration so that the cost function converges quickly. While the parameter is near the value that can achieve the minimum of the cost function, a small learning rate is given to the parameter iteration.

Due to the above idea, this paper introduces an adaptive learning rate factor related to the current and recent gradients. The main work of this paper can be summarized as follows:(1)Based on Adam, we introduce an adaptive learning rate factor related to the current and recent gradients to optimize the CNN training process.(2)We use an online learning framework to analyze the convergence of the proposed algorithm.(3)Various test functions, such as Booth’s function, the Beale function, and the Styblinski–Tang function, are used to evaluate the convergence speed of the algorithm proposed in this paper. A comparison is made with Adam and diffGrad.(4)We conduct a case study for the algorithm proposed in this paper. We compare the convergence speed and accuracy of the algorithm proposed in this paper with those of Adam and diffGrad on MNIST datasets.

The structure of this paper is as follows: The second section introduces the preliminary work of optimizing the algorithm. The third section introduces the adaptive learning rate factor related to the current and recent gradients and introduces the method proposed in this paper. In the fourth section, the convergence of the algorithm is analyzed. The fifth section uses test functions to evaluate the performance of the proposed algorithm. The sixth section uses public datasets for empirical testing and gives the experimental results, comparison, and analysis. The seventh section is the discussion of the paper.

2. Preliminaries

The basic algorithm for CNN model optimization is GD. In GD, parameters are iterated as follows:where is the updated parameter value. is the parameter value before the update. is the learning rate. is the gradient of the cost function of the parameter, which is defined as follows:where is the cost function. When the output contains multiple parameters , it is usually expressed by the mean squared error of the output, which is defined as follows:

In each iterative calculation, the gradient contains a relatively large amount of noise. Referring to the idea of momentum in physics, the momentum method introduces a variable . By adding up all gradients after varying degrees of attenuation, gradients at all times are involved in the iterative operation of the parameters, which effectively alleviates the noise problem. In the momentum algorithm, the variable and the parameters are iterated by the following rules:where is the attenuation coefficient and is the learning rate.

In addition to the noise of the gradient, GD uses the same global learning rate for all parameters in the CNN training process. The loss sight of difference between each parameter is clearly not perfect. Therefore, an adaptive learning rate algorithm arises.

The AdaGrad algorithm sets a global learning rate. The actual learning rate corresponding to each parameter is inversely proportional to the accumulation of the square of the respective gradient. The specific iterative rules are as follows:where indicates the elementwise square and is a small value included to avoid division by zero.

The accumulation of gradient squares causes the denominator in the middle and later stages to become too large, which leads to the premature end of the iterative updating procedure. To solve this problem, the RMSprop algorithm was developed, which no longer caused a simple summation of gradient squares. By introducing an attenuation coefficient , is attenuated by a certain proportion during each iteration. The specific iterative rules are as follows:where indicates the elementwise square and is a small value included to avoid division by zero.

Adam is a good combination of momentum and the RMSprop algorithm. The first- and second-moment estimators of the gradient are used to assign the corresponding adaptive learning rate to each parameter. At the same time, a bias correction is introduced creatively, which improves the training speed and makes the parameters more stable. The specific iterative method is as follows:where and are the first-order and second-order moment estimates of the gradient, respectively. indicates the elementwise square. and are the first-order and second-order moment estimates of the gradient after the bias correction, respectively. and are the attenuation coefficients. is the learning rate. is a small value included to avoid division by zero.

Then, according to the gradient difference, a new learning rate factor is introduced to improve Adam. Shiv Ram Dubey proposed the diffGrad method. According to their algorithm, the new learning rate factor and parameter iteration rules are as follows:where is the difference between two successive gradients, and and are the first-order and second-order moment estimates of the gradient after the bias correction, respectively. is the learning rate. is a small value included to avoid division by zero.

All algorithms above contain the learning rate . is a good choice in most cases. Attenuation coefficients and are recommended in Diederik P. Kingma’s paper [21].

3. Proposed Optimization

It can be seen from the second section that the learning rate plays an important role in the iteration of the parameters in GD. A larger learning rate yields a larger step size for parameter iteration, as well as a larger risk of missing the optimal solution. A smaller learning rate can achieve more refined learning. However, a smaller learning rate results in a slower convergence speed [27]. Therefore, the main developmental trend of GD is the adaptive learning rate. By giving each parameter dimension an adaptive learning rate, each parameter can find the optimal solution more quickly and realize the convergence of the cost function.

Beginning with AdaGrad, numerous adaptive algorithms are increasingly being developed. Almost all of these algorithms produce the adaptive learning rate factor by global estimation of the gradient, but they ignore the local gradient to some extent. In this paper, we introduce an adaptive learning rate factor to further improve the attention given to the current and recent gradients. At the same time, the basic algorithm Adam ensures the control of the global gradient.

The essence of GD used in CNN model training is its idea of multiobjective optimization. The sign of the gradient implies much information. As shown in Figure 1, during parameter iteration, while the learning rate is too small, the gradients of the objective function are continuous with the same sign, and the parameter is located at a falling edge or rising edge. Therefore, we can use a larger learning rate for parameter iteration. Similarly, while the learning rate is too large, positive and negative alternations occur in the gradients, which means that the objective function may hover around a minimum value. Therefore, we can use a smaller learning rate for parameter iteration.

Based on the above idea, we introduce a new adaptive learning rate factor, which can be defined as follows:where and are two successive gradients.

The sig function is similar to the sigmoid function. We can increase the learning rate when two gradients possess the same sign and decrease the learning rate when two gradients possess different signs by using an adaptive learning rate factor. Through the introduction of the operator, we can effectively reduce the serious oscillation of parameters in the iterative process due to the large step size. When the step size is small, it can also effectively accelerate the convergence speed. Finally, the degree of step size adjustment is also affected by the specific gradient value of the image, which has stronger adaptability. The specific definition of the function is as follows:

The characteristic image of the sig function is shown in Figure 2. We can find that the sig function is a monotonically increasing function. The rising edge of the sig function becomes steeper as the base increases. When the value of is negative, the value range of the sig function is; when the value of is positive, the value range of the sig function is ; and when the value of is zero, the value of the sig function is 1.

We use the above adaptive factor to adjust the learning rate for the parameter. Then, we can obtain the effect shown in Figure 3.

In this paper, the proposed method can be combined with Adam. We still use the adaptive learning rate factor constructed by the offset-corrected first-order moment and the offset-corrected second-order moment to control the global gradient. However, to control the local gradient, a new adaptive learning rate factor is introduced. Therefore, in the proposed method, the final iteration of the parameters is realized according to the following rule:where is the parameter value of the iteration. is the parameter value of the iteration. is the learning rate. is the new adaptive learning rate factor introduced in this paper. is a small value included to avoid division by zero. and are the first-order and second-order moment estimates of the gradient after the bias correction, respectively. When is equal to 1, the iterative rule of the algorithm proposed in this article is the same as that of Adam.

4. Convergence Analysis

For the establishment of our method, we suppose that the cost function possesses a convex property [28]. Then, we use an online learning framework to analyze the convergence of the proposed method in our article. Since an unknown and arbitrary sequence of convex cost function is given, is the cost function at the iteration, and is the parameter at the iteration. We attempt to predict and evaluate the corresponding cost function at each time . To solve the above, we use a regret function. We can assume that a globally optimal parameter exists. Then, for a feasible set , we can use the globally optimal parameter and predicted parameter to calculate the difference between the sum of and for each of the iterations. Thus, the regret function can be defined aswhere . Then, we can find that our algorithm possesses a regret bound of . The proof is shown in the appendix. Normally, we analyze multivariate functions instead of the univariate function . To simplify the process of our proof, we use the following definition. We define , , and . Our theorem below holds when decays with the attenuation coefficient .

Theorem 1. In a smooth convex function , for all , we make the following assumptions:(1)The gradients of the function are bounded (, )(2)The distances between any calculated by the proposed algorithm are bounded (, for any )(3) and for (4)Then, the algorithm achieves the following guarantee for all .

Since the data feature of dimension (d) is sparse and bounded, we can obtain and . Since we use a decay , we can obtain the third term of the above formula, which is irrelevant to the parameter . Finally, we can achieve

All parameters except T are given constants, so we can obtain .

5. Test Functions

The test function is used to test the algorithm’s ability to find the global minimum, which is consistent with the CNN model training process. In this section, we use several different forms of test functions to test the performance of the algorithm proposed in this paper. These algorithms share the same parameters and fixed at 0.9 and 0.999, respectively. The global learning rate is fixed at 0.01. The base of the sig function is the natural constant . The initial point of the test is set as the origin.

5.1. Booth’s Function

Booth’s function is a test function whose shape is similar to that of a plate. The specific expression of the function is as follows:

The test interval of the function is . The function possesses a global minimum . Figure 4 shows Booth’s function.

The parameters are iterated 1,000 times. The results are shown in Figure 5 and Table 1.

Although AdaGrad, Adam, and diffGrad did not find the global minimum after 1,000 iterations, the trajectory of the parameter iteration is similar to the algorithm that has already found the global minimum, and more iterations help them complete the job. Compared to Adam, Adam improved by the algorithm proposed in this article performs better. After 1,000 iterations, the result is close to the global minimum. In test functions, there is little noise. RMSprop perform better than the other existing algorithms. As a result, RMSprop improved by the algorithm proposed in this article has the best performance.

5.2. Beale’s Function

The Beale function is another test function. The specific expression of the function is as follows:

The test interval of the function is . The function possesses global minimum. Figure 6 shows the Beale function.

Similar to the test on Booth’s function, Adam and RMSprop improved by the algorithm proposed in this article have better performance. In the comparison of RMSprop and GS-RMSprop, as we expected, we can see that the parameter iteration using GS-RMSprop is more stable.

5.3. Styblinski–Tang Function

The Styblinski–Tang function can be used for multidimensional testing. Its two-dimensional expression is as follows:

The test interval of the function is . The function possesses a global minimum . Figure 8 shows the Styblinski–Tang function.

The parameters are iterated 1,000 times. The results are shown in Figure 9 and Table 3.

In the test of the Styblinski–Tang function, the parametric iteration trajectory is almost identical, but it has different speeds. Among them, Adam and RMSprop improved by the algorithm proposed in this paper still performed best.

6. Testing Dataset

In the test on the dataset, we use the simplest but most classical CNN architecture, LeNet-5 [29]. The network consists of two convolution layers, two pooling layers, and one fully connected layer. The convolution layer uses 55 convolution kernels, and the pooling layer uses the 22 max pooling method. The MNIST dataset is composed of 60,000 training samples and 10,000 testing samples, all of which are 2828-pixel handwritten digits [30]. The excellent classification ability of CNNs on the MNIST dataset has drawn wide attention from the public. We still use the MNIST dataset for testing. To speed up the training process, the test is based on a minibatch gradient descent [31, 32]. We test AdaGrad, RMSprop, Adam, diffGrad, AdaHMG and improved Adam, and RMSprop using the algorithm proposed in this paper.

Unlike test functions, the MNIST dataset is more realistic and complex. Different bases may have different effects on the results. Therefore, we choose 5, 10, and 15 as the bases to test the effect of different sizes of on the algorithm. The results are shown in Figure 10.

We can see in the figure that the convergence effect increases with increasing base . However, when increases from 10 to 15, the improvement of the convergence effect becomes less obvious. Therefore, when compared with other algorithms, we choose 15 as the base .

The convergence curves of the cost function under these algorithms and the classification error rates of the models on the verification set after training are compared. Since there are many random numbers in the initialization of the CNN, five groups of tests are performed for each algorithm in this paper.

The convergence curves of the cost function in the CNN model training process are shown in Figure 11. In these tests, we use the following parameters: the global learning rate is 0.01, the batch size is 50, and the number of epochs is 3. Every 100 iterations, we average the value of the cost function.

Since the core of our algorithm introduces a new adaptive learning rate factor, the algorithm proposed in this article can be combined with other gradient descent algorithms to improve its performance. Combined with the algorithm proposed in this paper, the performance of Adam and RMSprop is improved. In the training process of CNN, the convergence curves of the improved algorithms are below those of the original algorithm, indicating that these algorithms have a faster convergence speed. Among them, RMSprop improved by our algorithm has the best performance.

Table 4 and Figure 12 show the error rates of sample classification for each model on the testing data after training.

The classification error rate of the CNN model trained by the improved Adam and improved RMSprop is lower than that of the model trained by the original algorithm. Adam and RMSprop improved by the algorithm proposed in this article have the lowest error rate in terms of test set classification, which is superior to other existing algorithms.

7. Discussion and Conclusions

In this paper, we propose a new optimization method. The architecture of CNN we used is LeNet-5. Our method is not an improvement of the CNN based on its architecture. Therefore, the complexity of our algorithm is the same as that of LeNet-5. We also compared the computation time. Due to the introduction of a new adaptive learning rate, several parameter operations were added. The time consumption of each iteration was approximately 15% higher than that of Adam and RMSprop. By introducing a new adaptive learning rate factor based on the current and recent gradients, we not only control global gradient changes well but also address local gradient changes. When the sign of the gradient remains the same, it indicates that the parameter is located at the falling or rising edge of the objective function. A larger step size helps to find the minimum value of the objective function more quickly. In this case, the current and recent gradient values provide an adaptive learning rate adjustment factor greater than 1 through our algorithm to increase the learning rate of the corresponding parameters. When the sign of the gradient value alternates between positive and negative, it indicates that the parameter oscillates near the minimum value of the objective function. A smaller learning step helps the objective function converge to the minimum value. At this time, the current and recent gradient values provide an adaptive learning rate adjustment factor greater than 0 and less than 1 through our algorithm to decrease the learning rate of the corresponding parameters. Our algorithm is a supplement to the existing algorithms, which can be combined with many other existing algorithms. The results show that Adam and RMSprop combined with our algorithm have better performance. Our method can not only improve the convergence speed of the original algorithm but also achieve higher accuracy in the classification of test sets. Although our algorithm performs well in the experiment, it has a flaw. In the introduction of our algorithm, we know that while the gradient exhibits the phenomenon of positive and negative alternation, we use a smaller learning rate for training. This may increase the risk of falling into a local minimum. For further research, we can use current and recent gradients to construct other adaptive learning rate factor calculation functions so that the local gradients can play their proper role in the CNN training process.

Appendix

A. Convergence Proof

From the updating rules, we can obtain

Focusing on the dimension of the parameter vector , the authors assume that is the optimal parameter. The authors obtain

Since ,

Using Young’s inequality for the second term of the above formula, the authors obtain

Since , the authors obtain

The authors use Lemma 10.2 of Adam [21], and they obtainso

Since , the authors obtain

The authors use Lemma 10.4 of Adam [21], and they obtain

Assuming that , , the authors obtain

Since , the authors obtain

Bringing in the last item of the above formula, the authors obtain

Data Availability

The data used to support the findings of this study are available at http://yann.lecun.com/exdb/mnist/.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

Authors’ Contributions

Project administration was carried out by X.W. and W.W.; methodology was carried out by Z.L. and W.W.; convergence proof was carried out by Z.L.; data curation was carried out by Z.L. and X.L.; software was provided by Z.L.; visualization was carried out by Z.L. and R.F.; supervision was carried out by X.W. and R.F.; validation was carried out by X.W.; Z.L. wrote the original draft; Z.L., W.W., and X.W. reviewed and edited the manuscript.

Acknowledgments

This work was supported by the National Key Research and Development Program of China (2017YFB1303203) and the Postgraduate Research and Practice Innovation Program of Jiangsu Province (JX12413673).