Online Learning for DNN Training: A Stochastic Block Adaptive Gradient Algorithm
Adaptive algorithms are widely used because of their fast convergence rate for training deep neural networks (DNNs). However, the training cost becomes prohibitively expensive due to the computation of the full gradient when training complicated DNN. To reduce the computational cost, we present a stochastic block adaptive gradient online training algorithm in this study, called SBAG. In this algorithm, stochastic block coordinate descent and the adaptive learning rate are utilized at each iteration. We also prove that the regret bound of can be achieved via SBAG, in which is a time horizon. In addition, we use SBAG to train ResNet-34 and DenseNet-121 on CIFAR-10, respectively. The results demonstrate that SBAG has better training speed and generalized ability than other existing training methods.
Benefitting from a great many data samples and complex training model, deep learning has gained great interest in recent years and has been applied in resource allocation [1–4], signal estimation [5, 6], computer vision [7–9], and so on. However, the computing cost is very high in the training process of deep learning, which needs large amounts of training data and iteration update to obtain good model parameters. It is key to speed up model training process and improve model performance. Therefore, besides proposing new training architecture , designing an effective training algorithm is also important. This study focuses on the design of efficient training algorithms for deep neural networks (DNNs). In fact, many questions in practice can be modeled to be an optimization problem in general [11–13], which can be solved by employing gradient-based methods. The stochastic gradient descent (SGD) method is an effective optimization algorithm . Moreover, it is easy to implement because of its simplicity and is frequently used in the training process of DNN.
Although the simplicity of stochastic gradient descents, the problem of slow convergence rate always exists. The same learning rate is not suitable for all parameter updates across the training process, especially in the case of sparse training data. For this reason, a number of training methods are presented to address this issue, for instance, AdaGrad , RMSProp, AdaDelta , and Adam . These methods are referred as Adam-type algorithms since the adaptive learning rates are employed. Further, Adam has attained the most wide application in many deep learning training tasks, such as optimization of convolutional neural networks and recurrent neural networks [19, 20]. Despite its popularity, Adam incurs the convergence issue. For this reason, AMSGrad  was presented by introducing a nonincreasing learning rate. Besides, the learning rates of the Adam algorithm are either too big or too small, which results in poor generalization performance. To avoid the learning rate of extreme cases, a variant of Adam, Padam , was presented via employing a partial adaptive parameter . SWATS  used the switch method from Adam to SGD. AdaBound  limited the learning rate to a dynamic bound over time at each iteration.
In deep learning, gradient-based methods are used to optimize the model parameter, which needs to calculate the gradients of all coordinates in decision vectors at each iteration, and a huge number of data and complex model lead to expensive computation cost. Randomized block coordinate descent is an efficient method for high-dimensional optimization problem and has been successfully utilized in the large-scale problem generated in machine learning . It divides the set of variables into different blocks and carries out a gradient update step on a selected block coordinates randomly at each iteration, while holding the remaining ones fixed. In this way, the computational expense of each iteration can be effectively reduced.
In this study, we propose a stochastic block adaptive gradient online learning (SBAG) algorithm to rapidly train DNN, which incorporates an adaptive learning rate and stochastic block coordinate approach to improve the generalization ability and computation cost. Our key contributions are as follows:•We present the SBAG algorithm based on the stochastic block coordinate descent method and AdaBound optimization algorithm to solve high-dimensional optimization problems.(i)We provide the theoretical analysis on the convergence for SBAG. Moreover, we show that SBAG is convergent in the convex setting under common assumptions and its regret is bounded by , where is the time horizon.(ii)We demonstrate the performance of SBAG on a public dataset. The simulation results show that the algorithm takes lesser time to achieve the best accuracy on the training set and test set, and it outperforms other methods.
The rest of this study is arranged as follows. In the next two sections, we will review the extant literature and introduce related background. In Section 4, we will present SBAG in detail. In Sections 5 and 6, we will describe our convergence analysis and performance evaluation. Finally, we present the conclusion of this paper in Section 7.
2. Related Work
SGD is one of the most popular algorithms used in DNN because of its implementation easily. However, it has the same learning rate for all parameters updated at each iteration across the training process, and the parameters are updated to the same extent no matter how different the feature frequencies are, which consequently results in slow convergence rate and poor performance. Hence, some variants of SGD were proposed to improve its convergence rate by either making the learning rate adaptive or using historical gradient information for descent direction. Ghadimi et al.  used the Heavy-ball method to combine one-order historical gradients and current gradients for updates. Sutskever et al.  presented Nesterov’s accelerated gradient (NAG) method. Duchi et al.  proposed AdaGrad that first used an adaptive learning rate, whereas AdaGrad’s performance is worse in the case of dense gradients because all historical gradients are used in the updates, and this limitation is more severe when dealing with high-dimensional data in deep learning. Hinton  proposed RMSProp, which utilizes an exponential moving average to solve the problem that the learning rate drops sharply in AdaGrad. Zeiler  proposed AdaDelta, which prevents learning rate decay and gradient disappearance over time. In fact, further research was to combine adaptive learning rate with historical gradient information, such as those used in Adam  and AMSGrad . Moreover, Adam has a good convergence rate in many scenarios. However, it was found that Adam may not converge in the later stage of the training process on account of oscillated learning rate. Reddi et al.  presented AMSGrad, but the result of the experiments was not much better than Adam. In general, Adam-type algorithms have better performance on convergence, but often do not work well as SGD in out of sample. To address this issue, Keskar and Socher  proposed the SWATS algorithm. SWATS utilizes Adam to learn in the early part of the training and switches to SGD in the later stage of the training. In this case, it enjoys the quick convergence rate of Adam and the good performance of SGD, but the switching time is difficult to determine in practice. Huang et al.  presented NosAdam increasing the effect of past gradients on parameter update to avoid trapping in local or divergence. Nevertheless, it depends a lot on the initial conditions. Padam  introduced a parameter making the level of adaptivity of the update process controlled. Luo et al.  proposed the AdaBound algorithm, which provides a dynamic bound for learning rate, and AdaBound is evaluated on a public dataset and is shown to converge as fast as Adam and perform as well as SGD. However, the aforementioned methods need to calculate all coordinates of gradients in decision vectors at each iteration, and computation cost will be aggravated due to the high-dimensional data and complex model structure.
The randomized block coordinate descent method is a powerful and effective approach for the high-dimensional optimization problem. It employs randomized strategies to pick a block of variables to update per iteration. For general gradient descent algorithms, all the coordinates of gradient vector should be calculated each time. One can easily observe that this will incur significant computing cost when dealing with high-dimensional data. In contrast, the randomized block coordinate method only calculates one block coordinate of gradient vector, which is considered as the descent direction. In particular, the randomized block method selects a coordinate based on probability and updates the responding decision variable according to its descent direction. In addition, other coordinates of decision vector remain the same as the last time. Although the randomized block coordinate method can save significant computing cost for the learner, especially in optimization problems with high dimension data, it uses the fixed learning rate that scales the entries of gradient equally, and an adaptive learning rate has not been applied in this method.
Compared with the current work, we combine the randomized block coordinate descent method with an adaptive learning rate in this study. At each iteration, a part of gradient vectors is picked randomly, and the corresponding decision vectors are updated. In this way, the gradient is then calculated based on the chosen block coordinates instead of full gradients. Moreover, the extreme learning rates are restricted to a suitable range. Our method not only enjoys good generalization performance but also saves computation cost.
In this section, we first introduce the optimization problem in detail. Then, we begin with the background about the randomized block coordinate method.
3.1. The Online Optimization Problem
In this work, the analysis of sequence iteration optimization problem is based on the online learning framework, which can be seen as a trade-off between a learner (the algorithm) and an opponent. In such an online convex setting, the learner selects a decision point produced by the algorithm per time step , and is a convex and compact subset of . At the same time, the opponent responds to the decision of the learner with a loss function , which is convex and unknown in advance, and the algorithm suffers a loss . Repeating the process, we have a sequence of loss functions where , and they vary with time t. In general, the online learner’s prediction problem can be represented as follows:
For online learning tasks, the goal is to optimize the regret of the online learner’s predictions against the optimal decision in hindsight, which is defined as the difference in the total sum of loss functions after performing online learning over rounds and its minimum value in the deterministic decision point . In particular, we define the regret in the following:where , . It is desired that if the regret of online optimization algorithm is a sublinear function of , which suggests , then, on average, the online learner executes just and the fixed optimal decision afterwards. In other words, the proposed algorithm converges when its is bounded. Throughout this study, the diameter of convex compact set is assumed to be bounded and is bounded for all . Hereafter, denotes the norm.
3.2. Relevant Definitions
Now, we will describe the relevant definitions that are used in the next sections.
Definition 1. A function is -Lipschitz, where is Lipschitz constant, and ; if ,
Definition 2. (Equation (3.2) of Section 3 in ) A function is convex and differentiable where is a convex set; if ,
Definition 3. A function is -strongly convex and differentiable, , and if ,
4. SBAG Algorithm and Assumptions
This section presents the proposed algorithm, followed by the common assumptions for convergence analysis of the algorithm.
4.1. Algorithm Design
In this study, we develop the high-dimensional online learning problems and aim to solve the optimization problem (1) by incorporating the stochastic block coordination method and adaptive learning rate. Because the dimensionality of the decision variable is high, the computing cost of the gradients is prohibitive. In addition, the tuning of the learning rate is challenging. For these reasons, a stochastic block coordinate adaptive optimization algorithm, dubbed SBAG, is proposed for settling the online problem (1). In our algorithm, the objective functions at different times satisfy some conditions, which are displayed in Assumption 1.
SBAG is described in Algorithm 1, whose input includes , , and . The parameters of SBAG are , , , and , where . At each round , a order diagonal matrix is generated and includes random variables with and , for and . In particular, the gradient is computed as follows.where , and elements of consist of 0 and 1. When , it means that the th coordinate of decision vector is selected to calculate the gradient at time . From (6), one can observe that the computation cost is greatly reduced at each iteration. In addition, let denotes the algebra, which means consists of all variables before time . More explicitly, .
By Using , one and second momentum terms and are obtained as follows, respectively.
Furthermore, SBAG introduces a bound of learning rate as follows:where each element of the learning rate is clipped to constrain in an internal at time , and the upper and lower bounds of the interval are and , respectively. That is, the output of equation (9) is constrained in , and the technique was also used in [23, 24]. Moreover, let
Then, SBAG updates as follows:where is the coordinate-wise product operator. Furthermore, the projection step of equation (11) is equivalent to the following:
Before presenting the convergence analysis of SBAG, we will now introduce the below common assumptions.
Assumption 1. Loss functions , where , are convex, differentiable, and -Lipschitz over .
Assumption 2. In this study, is a bounded feasible set; i.e., , where and .
Assumption 3. In this study, is bounded for all over ; i.e., , where .
Assumptions 1–3 are some standard assumptions in the literature, for example [18, 21, 24]. In addition, the convergence of SBAG is analyzed based on these assumptions in the following.
5. Convergence Analysis
Now, we will analyze the convergence of SBAG. We consider the regret, equation (2), in the online optimization problem (a typical scenario). The proposed algorithm generates the gradient with probability at time . Therefore, is a random variable. Moreover, is calculated by and at time . According to the knowledge of probability and statistics, the expectation should be considered when the variable is randomized. Therefore, we define the regret of SBAG as follows:
From the convexity of , it follows that
Moreover, by the definition of matrix , we know that is a sparse matrix. Therefore, applying equation (14) leads to
Taking conditional expectation (conditioned on ) on both sides of equation (15), it implies that
By equation (1.1f) of Section 4 in , and taking unconditional expectation for equation (16), it follows that
From equations (13) and (17), the following equation holds
To get the bound of , we should consider the two terms on the right side of equation (18). Thus, we first propose the following lemmata to estimate term .
Lemma 1. If Assumptions 1 to 3 are satisfied, sequences , and are generated by SBAG with . Moreover, is a convex and compact set. , and for . In addition, suppose , and , where . Let , , and . Then, we have the following relation:
Proof. From equations (9) and (10), it follows thatandFrom equations (20) and (21), and by property of expectation, it can be verified thatPlugging equation (7) into equation (22), it yieldsBy Cauchy–Schwarz inequality, we further bound the term (a) of equation (23) and haveThe second inequality of equation (24) follows from the fact for all . In addition, the third inequality of equation (24) is due to the inequality . Moreover, plugging equation (24) into equation (23) leads toMoreover, since , and by equation (25), it follows thatTherefore, the proof of Lemma 1 is completed. Next, we introduce Lemma 2 to estimate the term .
Lemma 2. If Assumptions 1 to 3 are satisfied, sequences , and are generated by SBAG with . Moreover, is a convex and compact set. , and , for . In addition, suppose , and , where . Let , , and . Then, we have the following:
Proof. Let with . By equations (11) and (12), the following equation holdsUsing Lemma 3 of , it can be proved thatSubstituting equation (7) into equation (29) yieldsRearranging the terms of equation (30), and by , it follows thatApplying Young’s inequality and the Cauchy–Schwarz inequality into equation (31) leads toSumming equation (32) over and taking expectation on the obtained relation imply thatBy Lemma 1 and equation (33), it follows from thatSince and , we have . Therefore, we further obtain . Then, from equation (34), it can be proved thatApplying Assumption 2 and property of expectation yieldsTherefore, the proof of Lemma 2 is completed. Next, we estimate the last term in (18).
Lemma 3. If Assumptions 1 to 3 are satisfied, sequences , and are generated by SBAG with . Moreover, is a convex and compact set. and , for . In addition, suppose , and , where . Let and . Then, we attain the following inequality:
Proof. For the original full gradient, we have . Let , , and , which are generated by AdaBound .
The proof of Lemma 3 is similar to that of Theorem 4 in . Starting with the following inequality impliesTherefore, the proof of Lemma 3 is finished.
To attain the bound of regret in equation (18), we establish Theorem 1 as follows.
Theorem 1. Suppose that Assumptions 1 to 3 are satisfied, and sequences , and are generated by SBAG with . Moreover, is a convex and compact set. , and for . In addition, suppose , and , where . Let , , and . We obtain the bound of regret as follows:
Proof. Applying lemmata 1, 2, and 3 into (18) yieldsTherefore, we complete the proof of Theorem 1.
From Theorem 1, we obtain . This suggests that SBAG is convergent. In addition, the bound of regret is ; i.e., given some accuracy , it requires an order of iterations at least to achieve the given accuracy.
6. Performance Evaluation
In this section, we perform our experiments on a public dataset to evaluate the performance of algorithm objectively. We consider the machine learning problem, multi-classification tasks taking advantage of the DNN for the experiments.
To assess our SBAG algorithm, we research the performance on the classification task problem. We use the CIFAR-10  dataset for our experiments, which is widely used for classification problem. It consists of 10 classes and 50000 training samples and 10000 test samples.
For the experiments, we use the convolutional neural network to solve classification tasks on the CIFAR-10 image dataset, which has a good effect on image classification and object recognition, and specifically implement ResNet-34  and DenseNet-121 .
To study the performance of our proposed algorithm, we compare SBAG with SGD , AdaGrad , and AdaBound . The hyper-parameters of these algorithms are initialized as follows.
For SGD, the scale of the learning rate is selected from the set . AdaGrad uses the initialized learning rate set , and the value 0 is set for the initial accumulator value of AdaGrad. The value of hyper-parameters of AdaBound is set the same as Adam. We directly use the initialized hyper-parameter values of AdaBound in our algorithm. In addition, we set the probability of choosing a coordinate from these values in the set .
In addition, we define the dynamic bound functions following with  for our simulation experiments, i.e.,and
We take account of the image multi-class classification problem on the CIFAR-10 dataset using ResNet-34 and DenseNet-121 and run 200 epoch in this experiment. First, we operate a group of experiments with epochs and runtime for ResNet-34 and DenseNet-121 on CIFAR-10. The findings of experiments are reported in Figure 1, and when completing the same number of iterations of 200 epochs, our method takes the least time, and the AdaBound spends the most time. The main reason is that only several blocks of coordinates are calculated in the gradient descent process for our algorithm at each iteration , while the compared algorithms calculate the full gradients at each iteration. Moreover, AdaBound combines the first- and second-order momentum, while SGD and AdaGrad only use first-order gradients; thus, SGD and AdaGrad incur less time than AdaBound. The same results can be seen for the DenseNet-121 in Figure 1(b).
We present another group of experiments with average loss and running time, which are executed for ResNet-34 and DenseNet-121 on CIFAR-10. The findings are shown in Figure 2. At about 150 epochs, SGD has the biggest average loss than others and decreases sharply after that time, while the average loss of SBAG is smaller compared with others and reaches the minimum value in the shortest running time finally. The reason for fast descent rate of SBAG is due to the randomized block method, which chooses one block coordinate of decision vector to calculate the gradient. In other words, SBAG calculates more samples than other compared algorithms in the same running time. Therefore, the convergence of SBAG is verified by the findings presented in Figure 2.
In Figures 3 and 4, the training and test accuracy with running time of four algorithms are evaluated. As we can see, in about 150 epochs, AdaBound achieves the highest accuracy, and AdaGrad and our algorithm almost have the same accuracy of 92.36% and 93.99%. As the running time goes, the AdaBound and SBAG have the accuracy of 99.96% and 99.93%, respectively. The similar results can be seen on the DenseNet-121. In a word, SBAG works well on training or test set, and at the same time, it has the good generalization ability on both ResNet-34 and DenseNet-121.
From the experiments above, we observe that the SBAG shows a very good performance on both ResNet-34 and DenseNet-121. It incurs less computation cost for each iteration in experiments, which is consistent with theory.
In this study, we proposed a randomized block adaptive gradient online learning algorithm. The proposed algorithm, SBAG, is designed to reduce the gradient computation cost of high-dimensional decision vector. The convergence analysis of SBAG and evaluations on CIFAR-10 demonstrated that the regret bound of SBAG is when loss functions are convex and achieved significant computation cost savings, without adversely affecting the performance of the optimizer. In the same 200 epochs, the proposed algorithm has the least running time and tightly less in average loss in the end. The accuracy of training sample for ResNet-34 and DenseNet-121 is 99.93% and 99.72%, slightly less compared with that of 99.96% of AdaBound, but our method reaches the highest accuracy on the test sample than AdaBound, SGD, and AdaGrad; i.e., SBAG is the fastest in four methods, and the curves are milder than SGD.
The data that support the findings of this study are CIFAR-10, which is available from .
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this study.
Jianghui Liu and Baozhu Li contributed equally.
This work was supported in part by the National Natural Science Foundation of China (NSFC), under Grant nos. 61976243 and 61871430, the Leading Talents of Science and Technology in the Central Plain of China, under Grant no. 214200510012, the Scientific and Technological Innovation Team of Colleges and Universities in Henan Province, under Grant no. 20IRTSTHN018, the Basic Research Projects in the University of Henan Province, under Grant no. 19zx010, the Key Scientific Research Projects of Colleges and Universities in Henan Province, under Grant no. 22A520005, the National Natural Science Foundation of China, under Grant no. 61901191, and the Shandong Provincial Natural Science Foundation, under Grant no. ZR2020LZH005.
D. Huang, Y. Gao, Y. Li et al., “Deep learning based cooperative resource allocation in 5g wireless networks,” Mobile Networks and Applications, Springer, Berlin, Germany, 2018.View at: Google Scholar
R. Dong, C. She, W. Hardjawana, Y. Li, and B. Vucetic, “Deep learning for radio resource allocation with diverse quality-of-service requirements in 5g,” 2020, https://arxiv.org/abs/2004.00507.View at: Google Scholar
J. Qian, K. Zhu, R. Wang, and Y. Zhao, “Optimal auction for resource allocation in wireless virtualization: a deep learning approach,” in Proceedings of the 25th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2019, pp. 535–538, IEEE, Tianjin, China, December 2019.View at: Publisher Site | Google Scholar
J. Chen, K. Li, K. Li, P. S. Yu, and Z. Zeng, “Dynamic planning of bicycle stations in dockless public bicycle-sharing system using gated graph neural network,” ACM Transactions on Intelligent Systems and Technology, vol. 12, no. 2, pp. 1–22, 2021.View at: Publisher Site | Google Scholar
A. Saha, V. Minz, S. Bonela, S. R. Sreeja, R. Chowdhury, and D. Samanta, “Classification of EEG signals for cognitive load estimation using deep learning architectures,” in Proceedings of the Intelligent Human Computer Interaction - 10th International Conference, IHCI 2018, U. S. Tiwary, Ed., pp. 59–68, Springer, Allahabad, India, December 2018.View at: Publisher Site | Google Scholar
X. Chang, G. Li, L. Tu, G. Xing, and T. Hao, “Deepheart: accurate heart rate estimation from PPG signals based on deep learning,” in Proceedings of the 16th IEEE International Conference on Mobile Ad Hoc and Sensor Systems, MASS, pp. 371–379, IEEE, Monterey, CA, USA, November 2019.View at: Publisher Site | Google Scholar
A. Ioannidou, E. Chatzilari, S. Nikolopoulos, and I. Kompatsiaris, “Deep learning advances in computer vision with 3d data: a survey,” ACM Computing Surveys, vol. 50, pp. 21–38, 2017.View at: Google Scholar
J. Guo, H. He, T. He et al., “Gluoncv and gluonnlp: deep learning in computer vision and natural language processing,” Journal of Machine Learning Research, vol. 21, no. 23, pp. 1–7, 2020.View at: Google Scholar
B. Pu, K. Li, S. Li, and N. Zhu, “Automatic fetal ultrasound standard plane recognition based on deep learning and IIoT,” IEEE Transactions on Industrial Informatics, vol. 17, no. 11, pp. 7771–7780, 2021.View at: Publisher Site | Google Scholar
J. Chen, K. Li, K. Bilal, X. Zhou, K. Li, and P. S. Yu, “A Bi-layered parallel training architecture for large-scale convolutional neural networks,” IEEE Transactions on Parallel and Distributed Systems, vol. 30, no. 5, pp. 965–976, 2019.View at: Publisher Site | Google Scholar
F. Song, Y. Zhou, L. Chang, and H. Zhang, “Modeling space-terrestrial integrated networks with smart collaborative theory,” IEEE Netw, vol. 33, pp. 51–57, 2018.View at: Google Scholar
F. Song, M. Zhu, Y. Zhou, I. You, and H. Zhang, “Smart collaborative tracking for ubiquitous power iot in edge-cloud interplay domain,” IEEE Internet of Things Journal, vol. 7, no. 7, pp. 6046–6055, 2020.View at: Publisher Site | Google Scholar
F. Song, Z. Ai, Y. Zhou, I. You, K.-K. R. Choo, and H. Zhang, “Smart collaborative automation for receive buffer control in multipath industrial networks,” IEEE Transactions on Industrial Informatics, vol. 16, no. 2, pp. 1385–1394, 2020.View at: Publisher Site | Google Scholar
H. Robbins and S. Monro, “A stochastic approximation method,” The Annals of Mathematical Statistics, vol. 22, no. 3, pp. 400–407, 1951.View at: Publisher Site | Google Scholar
J. C. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, pp. 2121–2159, 2011.View at: Google Scholar
G. Hinton, “Divide the gradient by a running average of its recent magnitude,” Neural Networks for Machine Learning, Coursera, California, USA, 2011.View at: Google Scholar
M. D. Zeiler, “ADADELTA: an adaptive learning rate method,” 2012, https://arxiv.org/abs/1212.5701.View at: Google Scholar
D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” in Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 2015.View at: Google Scholar
J. Zhang, L. Cui, and F. B. Gouza, “GADAM: genetic-evolutionary ADAM for deep neural network optimization,” 2018, https://arxiv.org/abs/1805.07500.View at: Google Scholar
J. Bai and J. Zhang, “Bgadam: boosting based genetic-evolutionary Adam for convolutional neural network optimization,” 2019, https://arxiv.org/abs/1908.08015.View at: Google Scholar
S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of Adam and beyond,” in Proceedings of the 6th International Conference on Learning Representations ICLR 2018, OpenReview.net, Vancouver, BC, Canada, April 2018.View at: Google Scholar
J. Chen, D. Zhou, Y. Tang, Z. Yang, Y. Cao, and Q. Gu, “Closing the generalization gap of adaptive gradient methods in training deep neural networks,” in Proceedings of the 29th International Joint Conference on Artificial Intelligence, IJCAI 2020, C. Bessiere, Ed., pp. 3267–3275, ijcai.org, 2020.View at: Publisher Site | Google Scholar
N. S. Keskar and R. Socher, “Improving generalization performance by switching from Adam to SGD,” 2017, https://arxiv.org/abs/1712.07628.View at: Google Scholar
L. Luo, Y. Xiong, Y. Liu, and X. Sun, “Adaptive gradient methods with dynamic bound of learning rate,” in Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, OpenReview.net, New Orleans, LA, USA, May 2019.View at: Google Scholar
S. Shalev-Shwartz and A. Tewari, “Stochastic methods for l1-regularized loss minimization,” Journal of Machine Learning Research, vol. 12, pp. 1865–1892, 2011.View at: Google Scholar
E. Ghadimi, H. R. Feyzmahdavian, and M. Johansson, “Global convergence of the heavy-ball method for convex optimization,” in Proceedings of the European Control Conference, ECC 2015, pp. 310–315, IEEE, Linz, Austria, July 2015.View at: Google Scholar
I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton, “On the importance of initialization and momentum in deep learning,” in Proceedings of the 30th International Conference on Machine Learning, pp. 1139–1147, ICML, Atlanta, GA, USA, June 2013.View at: Google Scholar
H. Huang, C. Wang, and B. Dong, “Nostalgic Adam: weighting more of the past gradients when designing the adaptive learning rate,” in Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI 2019, S. Kraus, Ed., pp. 2556–2562, ijcai.org, Macao, China, August 2019.View at: Publisher Site | Google Scholar
S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, Cambridge, UK, 2004.
P. Embrechts, Probability: Theory and Examples, University in Cambridge, Cambridge, UK, 3rd edition, 2007.
H. B. McMahan and M. J. Streeter, “Adaptive bound optimization for online convex optimization,” in Proceedings of the COLT 2010 - The 23rd Conference on Learning Theory, A. T. Kalai and M. Mohri, Eds., pp. 244–256, Omnipress, Haifa, Israel, June 2010.View at: Google Scholar
A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Technical Report TR-2009, University of Toronto, Toronto, Canada, 2009.View at: Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, pp. 770–778, IEEE Computer Society, Las Vegas, NV, USA, June 2016.View at: Publisher Site | Google Scholar
G. Huang, Z. Liu, and K. Q. Weinberger, “Densely connected convolutional networks,” 2016, https://arxiv.org/abs/1608.06993.View at: Google Scholar