Abstract
Adaptive algorithms are widely used because of their fast convergence rate for training deep neural networks (DNNs). However, the training cost becomes prohibitively expensive due to the computation of the full gradient when training complicated DNN. To reduce the computational cost, we present a stochastic block adaptive gradient online training algorithm in this study, called SBAG. In this algorithm, stochastic block coordinate descent and the adaptive learning rate are utilized at each iteration. We also prove that the regret bound of can be achieved via SBAG, in which is a time horizon. In addition, we use SBAG to train ResNet34 and DenseNet121 on CIFAR10, respectively. The results demonstrate that SBAG has better training speed and generalized ability than other existing training methods.
1. Introduction
Benefitting from a great many data samples and complex training model, deep learning has gained great interest in recent years and has been applied in resource allocation [1–4], signal estimation [5, 6], computer vision [7–9], and so on. However, the computing cost is very high in the training process of deep learning, which needs large amounts of training data and iteration update to obtain good model parameters. It is key to speed up model training process and improve model performance. Therefore, besides proposing new training architecture [10], designing an effective training algorithm is also important. This study focuses on the design of efficient training algorithms for deep neural networks (DNNs). In fact, many questions in practice can be modeled to be an optimization problem in general [11–13], which can be solved by employing gradientbased methods. The stochastic gradient descent (SGD) method is an effective optimization algorithm [14]. Moreover, it is easy to implement because of its simplicity and is frequently used in the training process of DNN.
Although the simplicity of stochastic gradient descents, the problem of slow convergence rate always exists. The same learning rate is not suitable for all parameter updates across the training process, especially in the case of sparse training data. For this reason, a number of training methods are presented to address this issue, for instance, AdaGrad [15], RMSProp[16], AdaDelta [17], and Adam [18]. These methods are referred as Adamtype algorithms since the adaptive learning rates are employed. Further, Adam has attained the most wide application in many deep learning training tasks, such as optimization of convolutional neural networks and recurrent neural networks [19, 20]. Despite its popularity, Adam incurs the convergence issue. For this reason, AMSGrad [21] was presented by introducing a nonincreasing learning rate. Besides, the learning rates of the Adam algorithm are either too big or too small, which results in poor generalization performance. To avoid the learning rate of extreme cases, a variant of Adam, Padam [22], was presented via employing a partial adaptive parameter . SWATS [23] used the switch method from Adam to SGD. AdaBound [24] limited the learning rate to a dynamic bound over time at each iteration.
In deep learning, gradientbased methods are used to optimize the model parameter, which needs to calculate the gradients of all coordinates in decision vectors at each iteration, and a huge number of data and complex model lead to expensive computation cost. Randomized block coordinate descent is an efficient method for highdimensional optimization problem and has been successfully utilized in the largescale problem generated in machine learning [25]. It divides the set of variables into different blocks and carries out a gradient update step on a selected block coordinates randomly at each iteration, while holding the remaining ones fixed. In this way, the computational expense of each iteration can be effectively reduced.
In this study, we propose a stochastic block adaptive gradient online learning (SBAG) algorithm to rapidly train DNN, which incorporates an adaptive learning rate and stochastic block coordinate approach to improve the generalization ability and computation cost. Our key contributions are as follows:•We present the SBAG algorithm based on the stochastic block coordinate descent method and AdaBound optimization algorithm to solve highdimensional optimization problems.(i)We provide the theoretical analysis on the convergence for SBAG. Moreover, we show that SBAG is convergent in the convex setting under common assumptions and its regret is bounded by , where is the time horizon.(ii)We demonstrate the performance of SBAG on a public dataset. The simulation results show that the algorithm takes lesser time to achieve the best accuracy on the training set and test set, and it outperforms other methods.
The rest of this study is arranged as follows. In the next two sections, we will review the extant literature and introduce related background. In Section 4, we will present SBAG in detail. In Sections 5 and 6, we will describe our convergence analysis and performance evaluation. Finally, we present the conclusion of this paper in Section 7.
2. Related Work
SGD is one of the most popular algorithms used in DNN because of its implementation easily. However, it has the same learning rate for all parameters updated at each iteration across the training process, and the parameters are updated to the same extent no matter how different the feature frequencies are, which consequently results in slow convergence rate and poor performance. Hence, some variants of SGD were proposed to improve its convergence rate by either making the learning rate adaptive or using historical gradient information for descent direction. Ghadimi et al. [26] used the Heavyball method to combine oneorder historical gradients and current gradients for updates. Sutskever et al. [27] presented Nesterov’s accelerated gradient (NAG) method. Duchi et al. [15] proposed AdaGrad that first used an adaptive learning rate, whereas AdaGrad’s performance is worse in the case of dense gradients because all historical gradients are used in the updates, and this limitation is more severe when dealing with highdimensional data in deep learning. Hinton [16] proposed RMSProp, which utilizes an exponential moving average to solve the problem that the learning rate drops sharply in AdaGrad. Zeiler [17] proposed AdaDelta, which prevents learning rate decay and gradient disappearance over time. In fact, further research was to combine adaptive learning rate with historical gradient information, such as those used in Adam [18] and AMSGrad [21]. Moreover, Adam has a good convergence rate in many scenarios. However, it was found that Adam may not converge in the later stage of the training process on account of oscillated learning rate. Reddi et al. [21] presented AMSGrad, but the result of the experiments was not much better than Adam. In general, Adamtype algorithms have better performance on convergence, but often do not work well as SGD in out of sample. To address this issue, Keskar and Socher [23] proposed the SWATS algorithm. SWATS utilizes Adam to learn in the early part of the training and switches to SGD in the later stage of the training. In this case, it enjoys the quick convergence rate of Adam and the good performance of SGD, but the switching time is difficult to determine in practice. Huang et al. [28] presented NosAdam increasing the effect of past gradients on parameter update to avoid trapping in local or divergence. Nevertheless, it depends a lot on the initial conditions. Padam [22] introduced a parameter making the level of adaptivity of the update process controlled. Luo et al. [24] proposed the AdaBound algorithm, which provides a dynamic bound for learning rate, and AdaBound is evaluated on a public dataset and is shown to converge as fast as Adam and perform as well as SGD. However, the aforementioned methods need to calculate all coordinates of gradients in decision vectors at each iteration, and computation cost will be aggravated due to the highdimensional data and complex model structure.
The randomized block coordinate descent method is a powerful and effective approach for the highdimensional optimization problem. It employs randomized strategies to pick a block of variables to update per iteration. For general gradient descent algorithms, all the coordinates of gradient vector should be calculated each time. One can easily observe that this will incur significant computing cost when dealing with highdimensional data. In contrast, the randomized block coordinate method only calculates one block coordinate of gradient vector, which is considered as the descent direction. In particular, the randomized block method selects a coordinate based on probability and updates the responding decision variable according to its descent direction. In addition, other coordinates of decision vector remain the same as the last time. Although the randomized block coordinate method can save significant computing cost for the learner, especially in optimization problems with high dimension data, it uses the fixed learning rate that scales the entries of gradient equally, and an adaptive learning rate has not been applied in this method.
Compared with the current work, we combine the randomized block coordinate descent method with an adaptive learning rate in this study. At each iteration, a part of gradient vectors is picked randomly, and the corresponding decision vectors are updated. In this way, the gradient is then calculated based on the chosen block coordinates instead of full gradients. Moreover, the extreme learning rates are restricted to a suitable range. Our method not only enjoys good generalization performance but also saves computation cost.
3. Preliminaries
In this section, we first introduce the optimization problem in detail. Then, we begin with the background about the randomized block coordinate method.
3.1. The Online Optimization Problem
In this work, the analysis of sequence iteration optimization problem is based on the online learning framework, which can be seen as a tradeoff between a learner (the algorithm) and an opponent. In such an online convex setting, the learner selects a decision point produced by the algorithm per time step , and is a convex and compact subset of . At the same time, the opponent responds to the decision of the learner with a loss function , which is convex and unknown in advance, and the algorithm suffers a loss . Repeating the process, we have a sequence of loss functions where , and they vary with time t. In general, the online learner’s prediction problem can be represented as follows:
For online learning tasks, the goal is to optimize the regret of the online learner’s predictions against the optimal decision in hindsight, which is defined as the difference in the total sum of loss functions after performing online learning over rounds and its minimum value in the deterministic decision point . In particular, we define the regret in the following:where , . It is desired that if the regret of online optimization algorithm is a sublinear function of , which suggests , then, on average, the online learner executes just and the fixed optimal decision afterwards. In other words, the proposed algorithm converges when its is bounded. Throughout this study, the diameter of convex compact set is assumed to be bounded and is bounded for all . Hereafter, denotes the norm.
3.2. Relevant Definitions
Now, we will describe the relevant definitions that are used in the next sections.
Definition 1. A function is Lipschitz, where is Lipschitz constant, and ; if ,
Definition 2. (Equation (3.2) of Section 3 in [29]) A function is convex and differentiable where is a convex set; if ,
Definition 3. A function is strongly convex and differentiable, , and if ,
4. SBAG Algorithm and Assumptions
This section presents the proposed algorithm, followed by the common assumptions for convergence analysis of the algorithm.
4.1. Algorithm Design
In this study, we develop the highdimensional online learning problems and aim to solve the optimization problem (1) by incorporating the stochastic block coordination method and adaptive learning rate. Because the dimensionality of the decision variable is high, the computing cost of the gradients is prohibitive. In addition, the tuning of the learning rate is challenging. For these reasons, a stochastic block coordinate adaptive optimization algorithm, dubbed SBAG, is proposed for settling the online problem (1). In our algorithm, the objective functions at different times satisfy some conditions, which are displayed in Assumption 1.
SBAG is described in Algorithm 1, whose input includes , , and . The parameters of SBAG are , , , and , where . At each round , a order diagonal matrix is generated and includes random variables with and , for and . In particular, the gradient is computed as follows.where , and elements of consist of 0 and 1. When , it means that the th coordinate of decision vector is selected to calculate the gradient at time . From (6), one can observe that the computation cost is greatly reduced at each iteration. In addition, let denotes the algebra, which means consists of all variables before time . More explicitly, .

By Using , one and second momentum terms and are obtained as follows, respectively.
Furthermore, SBAG introduces a bound of learning rate as follows:where each element of the learning rate is clipped to constrain in an internal at time , and the upper and lower bounds of the interval are and , respectively. That is, the output of equation (9) is constrained in , and the technique was also used in [23, 24]. Moreover, let
Then, SBAG updates as follows:where is the coordinatewise product operator. Furthermore, the projection step of equation (11) is equivalent to the following:
4.2. Assumptions
Before presenting the convergence analysis of SBAG, we will now introduce the below common assumptions.
Assumption 1. Loss functions , where , are convex, differentiable, and Lipschitz over .
Assumption 2. In this study, is a bounded feasible set; i.e., , where and .
Assumption 3. In this study, is bounded for all over ; i.e., , where .
Assumptions 1–3 are some standard assumptions in the literature, for example [18, 21, 24]. In addition, the convergence of SBAG is analyzed based on these assumptions in the following.
5. Convergence Analysis
Now, we will analyze the convergence of SBAG. We consider the regret, equation (2), in the online optimization problem (a typical scenario). The proposed algorithm generates the gradient with probability at time . Therefore, is a random variable. Moreover, is calculated by and at time . According to the knowledge of probability and statistics, the expectation should be considered when the variable is randomized. Therefore, we define the regret of SBAG as follows:
From the convexity of , it follows that
Moreover, by the definition of matrix , we know that is a sparse matrix. Therefore, applying equation (14) leads to
Taking conditional expectation (conditioned on ) on both sides of equation (15), it implies that
By equation (1.1f) of Section 4 in [30], and taking unconditional expectation for equation (16), it follows that
From equations (13) and (17), the following equation holds
To get the bound of , we should consider the two terms on the right side of equation (18). Thus, we first propose the following lemmata to estimate term .
Lemma 1. If Assumptions 1 to 3 are satisfied, sequences , and are generated by SBAG with . Moreover, is a convex and compact set. , and for . In addition, suppose , and , where . Let , , and . Then, we have the following relation:
Proof. From equations (9) and (10), it follows thatandFrom equations (20) and (21), and by property of expectation, it can be verified thatPlugging equation (7) into equation (22), it yieldsBy Cauchy–Schwarz inequality, we further bound the term (a) of equation (23) and haveThe second inequality of equation (24) follows from the fact for all . In addition, the third inequality of equation (24) is due to the inequality . Moreover, plugging equation (24) into equation (23) leads toMoreover, since , and by equation (25), it follows thatTherefore, the proof of Lemma 1 is completed. Next, we introduce Lemma 2 to estimate the term .
Lemma 2. If Assumptions 1 to 3 are satisfied, sequences , and are generated by SBAG with . Moreover, is a convex and compact set. , and , for . In addition, suppose , and , where . Let , , and . Then, we have the following:
Proof. Let with . By equations (11) and (12), the following equation holdsUsing Lemma 3 of [31], it can be proved thatSubstituting equation (7) into equation (29) yieldsRearranging the terms of equation (30), and by , it follows thatApplying Young’s inequality and the Cauchy–Schwarz inequality into equation (31) leads toSumming equation (32) over and taking expectation on the obtained relation imply thatBy Lemma 1 and equation (33), it follows from thatSince and , we have . Therefore, we further obtain . Then, from equation (34), it can be proved thatApplying Assumption 2 and property of expectation yieldsTherefore, the proof of Lemma 2 is completed. Next, we estimate the last term in (18).
Lemma 3. If Assumptions 1 to 3 are satisfied, sequences , and are generated by SBAG with . Moreover, is a convex and compact set. and , for . In addition, suppose , and , where . Let and . Then, we attain the following inequality:
Proof. For the original full gradient, we have . Let , , and , which are generated by AdaBound [24].
The proof of Lemma 3 is similar to that of Theorem 4 in [24]. Starting with the following inequality impliesTherefore, the proof of Lemma 3 is finished.
To attain the bound of regret in equation (18), we establish Theorem 1 as follows.
Theorem 1. Suppose that Assumptions 1 to 3 are satisfied, and sequences , and are generated by SBAG with . Moreover, is a convex and compact set. , and for . In addition, suppose , and , where . Let , , and . We obtain the bound of regret as follows:
Proof. Applying lemmata 1, 2, and 3 into (18) yieldsTherefore, we complete the proof of Theorem 1.
From Theorem 1, we obtain . This suggests that SBAG is convergent. In addition, the bound of regret is ; i.e., given some accuracy , it requires an order of iterations at least to achieve the given accuracy.
6. Performance Evaluation
In this section, we perform our experiments on a public dataset to evaluate the performance of algorithm objectively. We consider the machine learning problem, multiclassification tasks taking advantage of the DNN for the experiments.
6.1. Setup
To assess our SBAG algorithm, we research the performance on the classification task problem. We use the CIFAR10 [32] dataset for our experiments, which is widely used for classification problem. It consists of 10 classes and 50000 training samples and 10000 test samples.
For the experiments, we use the convolutional neural network to solve classification tasks on the CIFAR10 image dataset, which has a good effect on image classification and object recognition, and specifically implement ResNet34 [33] and DenseNet121 [34].
6.2. Parameters
To study the performance of our proposed algorithm, we compare SBAG with SGD [14], AdaGrad [15], and AdaBound [24]. The hyperparameters of these algorithms are initialized as follows.
For SGD, the scale of the learning rate is selected from the set . AdaGrad uses the initialized learning rate set , and the value 0 is set for the initial accumulator value of AdaGrad. The value of hyperparameters of AdaBound is set the same as Adam. We directly use the initialized hyperparameter values of AdaBound in our algorithm. In addition, we set the probability of choosing a coordinate from these values in the set .
In addition, we define the dynamic bound functions following with [24] for our simulation experiments, i.e.,and
6.3. Results
We take account of the image multiclass classification problem on the CIFAR10 dataset using ResNet34 and DenseNet121 and run 200 epoch in this experiment. First, we operate a group of experiments with epochs and runtime for ResNet34 and DenseNet121 on CIFAR10. The findings of experiments are reported in Figure 1, and when completing the same number of iterations of 200 epochs, our method takes the least time, and the AdaBound spends the most time. The main reason is that only several blocks of coordinates are calculated in the gradient descent process for our algorithm at each iteration , while the compared algorithms calculate the full gradients at each iteration. Moreover, AdaBound combines the first and secondorder momentum, while SGD and AdaGrad only use firstorder gradients; thus, SGD and AdaGrad incur less time than AdaBound. The same results can be seen for the DenseNet121 in Figure 1(b).
(a)
(b)
We present another group of experiments with average loss and running time, which are executed for ResNet34 and DenseNet121 on CIFAR10. The findings are shown in Figure 2. At about 150 epochs, SGD has the biggest average loss than others and decreases sharply after that time, while the average loss of SBAG is smaller compared with others and reaches the minimum value in the shortest running time finally. The reason for fast descent rate of SBAG is due to the randomized block method, which chooses one block coordinate of decision vector to calculate the gradient. In other words, SBAG calculates more samples than other compared algorithms in the same running time. Therefore, the convergence of SBAG is verified by the findings presented in Figure 2.
(a)
(b)
In Figures 3 and 4, the training and test accuracy with running time of four algorithms are evaluated. As we can see, in about 150 epochs, AdaBound achieves the highest accuracy, and AdaGrad and our algorithm almost have the same accuracy of 92.36% and 93.99%. As the running time goes, the AdaBound and SBAG have the accuracy of 99.96% and 99.93%, respectively. The similar results can be seen on the DenseNet121. In a word, SBAG works well on training or test set, and at the same time, it has the good generalization ability on both ResNet34 and DenseNet121.
(a)
(b)
(a)
(b)
From the experiments above, we observe that the SBAG shows a very good performance on both ResNet34 and DenseNet121. It incurs less computation cost for each iteration in experiments, which is consistent with theory.
7. Conclusion
In this study, we proposed a randomized block adaptive gradient online learning algorithm. The proposed algorithm, SBAG, is designed to reduce the gradient computation cost of highdimensional decision vector. The convergence analysis of SBAG and evaluations on CIFAR10 demonstrated that the regret bound of SBAG is when loss functions are convex and achieved significant computation cost savings, without adversely affecting the performance of the optimizer. In the same 200 epochs, the proposed algorithm has the least running time and tightly less in average loss in the end. The accuracy of training sample for ResNet34 and DenseNet121 is 99.93% and 99.72%, slightly less compared with that of 99.96% of AdaBound, but our method reaches the highest accuracy on the test sample than AdaBound, SGD, and AdaGrad; i.e., SBAG is the fastest in four methods, and the curves are milder than SGD.
Data Availability
The data that support the findings of this study are CIFAR10, which is available from [32].
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this study.
Authors’ Contributions
Jianghui Liu and Baozhu Li contributed equally.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (NSFC), under Grant nos. 61976243 and 61871430, the Leading Talents of Science and Technology in the Central Plain of China, under Grant no. 214200510012, the Scientific and Technological Innovation Team of Colleges and Universities in Henan Province, under Grant no. 20IRTSTHN018, the Basic Research Projects in the University of Henan Province, under Grant no. 19zx010, the Key Scientific Research Projects of Colleges and Universities in Henan Province, under Grant no. 22A520005, the National Natural Science Foundation of China, under Grant no. 61901191, and the Shandong Provincial Natural Science Foundation, under Grant no. ZR2020LZH005.