Abstract

Adaptive algorithms are widely used because of their fast convergence rate for training deep neural networks (DNNs). However, the training cost becomes prohibitively expensive due to the computation of the full gradient when training complicated DNN. To reduce the computational cost, we present a stochastic block adaptive gradient online training algorithm in this study, called SBAG. In this algorithm, stochastic block coordinate descent and the adaptive learning rate are utilized at each iteration. We also prove that the regret bound of can be achieved via SBAG, in which is a time horizon. In addition, we use SBAG to train ResNet-34 and DenseNet-121 on CIFAR-10, respectively. The results demonstrate that SBAG has better training speed and generalized ability than other existing training methods.

1. Introduction

Benefitting from a great many data samples and complex training model, deep learning has gained great interest in recent years and has been applied in resource allocation [14], signal estimation [5, 6], computer vision [79], and so on. However, the computing cost is very high in the training process of deep learning, which needs large amounts of training data and iteration update to obtain good model parameters. It is key to speed up model training process and improve model performance. Therefore, besides proposing new training architecture [10], designing an effective training algorithm is also important. This study focuses on the design of efficient training algorithms for deep neural networks (DNNs). In fact, many questions in practice can be modeled to be an optimization problem in general [1113], which can be solved by employing gradient-based methods. The stochastic gradient descent (SGD) method is an effective optimization algorithm [14]. Moreover, it is easy to implement because of its simplicity and is frequently used in the training process of DNN.

Although the simplicity of stochastic gradient descents, the problem of slow convergence rate always exists. The same learning rate is not suitable for all parameter updates across the training process, especially in the case of sparse training data. For this reason, a number of training methods are presented to address this issue, for instance, AdaGrad [15], RMSProp[16], AdaDelta [17], and Adam [18]. These methods are referred as Adam-type algorithms since the adaptive learning rates are employed. Further, Adam has attained the most wide application in many deep learning training tasks, such as optimization of convolutional neural networks and recurrent neural networks [19, 20]. Despite its popularity, Adam incurs the convergence issue. For this reason, AMSGrad [21] was presented by introducing a nonincreasing learning rate. Besides, the learning rates of the Adam algorithm are either too big or too small, which results in poor generalization performance. To avoid the learning rate of extreme cases, a variant of Adam, Padam [22], was presented via employing a partial adaptive parameter . SWATS [23] used the switch method from Adam to SGD. AdaBound [24] limited the learning rate to a dynamic bound over time at each iteration.

In deep learning, gradient-based methods are used to optimize the model parameter, which needs to calculate the gradients of all coordinates in decision vectors at each iteration, and a huge number of data and complex model lead to expensive computation cost. Randomized block coordinate descent is an efficient method for high-dimensional optimization problem and has been successfully utilized in the large-scale problem generated in machine learning [25]. It divides the set of variables into different blocks and carries out a gradient update step on a selected block coordinates randomly at each iteration, while holding the remaining ones fixed. In this way, the computational expense of each iteration can be effectively reduced.

In this study, we propose a stochastic block adaptive gradient online learning (SBAG) algorithm to rapidly train DNN, which incorporates an adaptive learning rate and stochastic block coordinate approach to improve the generalization ability and computation cost. Our key contributions are as follows:We present the SBAG algorithm based on the stochastic block coordinate descent method and AdaBound optimization algorithm to solve high-dimensional optimization problems.(i)We provide the theoretical analysis on the convergence for SBAG. Moreover, we show that SBAG is convergent in the convex setting under common assumptions and its regret is bounded by , where is the time horizon.(ii)We demonstrate the performance of SBAG on a public dataset. The simulation results show that the algorithm takes lesser time to achieve the best accuracy on the training set and test set, and it outperforms other methods.

The rest of this study is arranged as follows. In the next two sections, we will review the extant literature and introduce related background. In Section 4, we will present SBAG in detail. In Sections 5 and 6, we will describe our convergence analysis and performance evaluation. Finally, we present the conclusion of this paper in Section 7.

SGD is one of the most popular algorithms used in DNN because of its implementation easily. However, it has the same learning rate for all parameters updated at each iteration across the training process, and the parameters are updated to the same extent no matter how different the feature frequencies are, which consequently results in slow convergence rate and poor performance. Hence, some variants of SGD were proposed to improve its convergence rate by either making the learning rate adaptive or using historical gradient information for descent direction. Ghadimi et al. [26] used the Heavy-ball method to combine one-order historical gradients and current gradients for updates. Sutskever et al. [27] presented Nesterov’s accelerated gradient (NAG) method. Duchi et al. [15] proposed AdaGrad that first used an adaptive learning rate, whereas AdaGrad’s performance is worse in the case of dense gradients because all historical gradients are used in the updates, and this limitation is more severe when dealing with high-dimensional data in deep learning. Hinton [16] proposed RMSProp, which utilizes an exponential moving average to solve the problem that the learning rate drops sharply in AdaGrad. Zeiler [17] proposed AdaDelta, which prevents learning rate decay and gradient disappearance over time. In fact, further research was to combine adaptive learning rate with historical gradient information, such as those used in Adam [18] and AMSGrad [21]. Moreover, Adam has a good convergence rate in many scenarios. However, it was found that Adam may not converge in the later stage of the training process on account of oscillated learning rate. Reddi et al. [21] presented AMSGrad, but the result of the experiments was not much better than Adam. In general, Adam-type algorithms have better performance on convergence, but often do not work well as SGD in out of sample. To address this issue, Keskar and Socher [23] proposed the SWATS algorithm. SWATS utilizes Adam to learn in the early part of the training and switches to SGD in the later stage of the training. In this case, it enjoys the quick convergence rate of Adam and the good performance of SGD, but the switching time is difficult to determine in practice. Huang et al. [28] presented NosAdam increasing the effect of past gradients on parameter update to avoid trapping in local or divergence. Nevertheless, it depends a lot on the initial conditions. Padam [22] introduced a parameter making the level of adaptivity of the update process controlled. Luo et al. [24] proposed the AdaBound algorithm, which provides a dynamic bound for learning rate, and AdaBound is evaluated on a public dataset and is shown to converge as fast as Adam and perform as well as SGD. However, the aforementioned methods need to calculate all coordinates of gradients in decision vectors at each iteration, and computation cost will be aggravated due to the high-dimensional data and complex model structure.

The randomized block coordinate descent method is a powerful and effective approach for the high-dimensional optimization problem. It employs randomized strategies to pick a block of variables to update per iteration. For general gradient descent algorithms, all the coordinates of gradient vector should be calculated each time. One can easily observe that this will incur significant computing cost when dealing with high-dimensional data. In contrast, the randomized block coordinate method only calculates one block coordinate of gradient vector, which is considered as the descent direction. In particular, the randomized block method selects a coordinate based on probability and updates the responding decision variable according to its descent direction. In addition, other coordinates of decision vector remain the same as the last time. Although the randomized block coordinate method can save significant computing cost for the learner, especially in optimization problems with high dimension data, it uses the fixed learning rate that scales the entries of gradient equally, and an adaptive learning rate has not been applied in this method.

Compared with the current work, we combine the randomized block coordinate descent method with an adaptive learning rate in this study. At each iteration, a part of gradient vectors is picked randomly, and the corresponding decision vectors are updated. In this way, the gradient is then calculated based on the chosen block coordinates instead of full gradients. Moreover, the extreme learning rates are restricted to a suitable range. Our method not only enjoys good generalization performance but also saves computation cost.

3. Preliminaries

In this section, we first introduce the optimization problem in detail. Then, we begin with the background about the randomized block coordinate method.

3.1. The Online Optimization Problem

In this work, the analysis of sequence iteration optimization problem is based on the online learning framework, which can be seen as a trade-off between a learner (the algorithm) and an opponent. In such an online convex setting, the learner selects a decision point produced by the algorithm per time step , and is a convex and compact subset of . At the same time, the opponent responds to the decision of the learner with a loss function , which is convex and unknown in advance, and the algorithm suffers a loss . Repeating the process, we have a sequence of loss functions where , and they vary with time t. In general, the online learner’s prediction problem can be represented as follows:

For online learning tasks, the goal is to optimize the regret of the online learner’s predictions against the optimal decision in hindsight, which is defined as the difference in the total sum of loss functions after performing online learning over rounds and its minimum value in the deterministic decision point . In particular, we define the regret in the following:where , . It is desired that if the regret of online optimization algorithm is a sublinear function of , which suggests , then, on average, the online learner executes just and the fixed optimal decision afterwards. In other words, the proposed algorithm converges when its is bounded. Throughout this study, the diameter of convex compact set is assumed to be bounded and is bounded for all . Hereafter, denotes the norm.

3.2. Relevant Definitions

Now, we will describe the relevant definitions that are used in the next sections.

Definition 1. A function is -Lipschitz, where is Lipschitz constant, and ; if ,

Definition 2. (Equation (3.2) of Section 3 in [29]) A function is convex and differentiable where is a convex set; if ,

Definition 3. A function is -strongly convex and differentiable, , and if ,

4. SBAG Algorithm and Assumptions

This section presents the proposed algorithm, followed by the common assumptions for convergence analysis of the algorithm.

4.1. Algorithm Design

In this study, we develop the high-dimensional online learning problems and aim to solve the optimization problem (1) by incorporating the stochastic block coordination method and adaptive learning rate. Because the dimensionality of the decision variable is high, the computing cost of the gradients is prohibitive. In addition, the tuning of the learning rate is challenging. For these reasons, a stochastic block coordinate adaptive optimization algorithm, dubbed SBAG, is proposed for settling the online problem (1). In our algorithm, the objective functions at different times satisfy some conditions, which are displayed in Assumption 1.

SBAG is described in Algorithm 1, whose input includes , , and . The parameters of SBAG are , , , and , where . At each round , a order diagonal matrix is generated and includes random variables with and , for and . In particular, the gradient is computed as follows.where , and elements of consist of 0 and 1. When , it means that the th coordinate of decision vector is selected to calculate the gradient at time . From (6), one can observe that the computation cost is greatly reduced at each iteration. In addition, let denotes the algebra, which means consists of all variables before time . More explicitly, .

Input:
Parameter: , and where and . denotes coordinate selection probability at time . Moreover, where and .
Initially Set: and .
Output:
(1)fordo
(2)
(3)Generating diagonal matrix with probability
(4)
(5) Generating gradient
(6)
(7)
(8) and
(9) Clip
(10)
(11)
(12)end for
(13)return

By Using , one and second momentum terms and are obtained as follows, respectively.

Furthermore, SBAG introduces a bound of learning rate as follows:where each element of the learning rate is clipped to constrain in an internal at time , and the upper and lower bounds of the interval are and , respectively. That is, the output of equation (9) is constrained in , and the technique was also used in [23, 24]. Moreover, let

Then, SBAG updates as follows:where is the coordinate-wise product operator. Furthermore, the projection step of equation (11) is equivalent to the following:

4.2. Assumptions

Before presenting the convergence analysis of SBAG, we will now introduce the below common assumptions.

Assumption 1. Loss functions , where , are convex, differentiable, and -Lipschitz over .

Assumption 2. In this study, is a bounded feasible set; i.e., , where and .

Assumption 3. In this study, is bounded for all over ; i.e., , where .
Assumptions 13 are some standard assumptions in the literature, for example [18, 21, 24]. In addition, the convergence of SBAG is analyzed based on these assumptions in the following.

5. Convergence Analysis

Now, we will analyze the convergence of SBAG. We consider the regret, equation (2), in the online optimization problem (a typical scenario). The proposed algorithm generates the gradient with probability at time . Therefore, is a random variable. Moreover, is calculated by and at time . According to the knowledge of probability and statistics, the expectation should be considered when the variable is randomized. Therefore, we define the regret of SBAG as follows:

From the convexity of , it follows that

Moreover, by the definition of matrix , we know that is a sparse matrix. Therefore, applying equation (14) leads to

Taking conditional expectation (conditioned on ) on both sides of equation (15), it implies that

By equation (1.1f) of Section 4 in [30], and taking unconditional expectation for equation (16), it follows that

From equations (13) and (17), the following equation holds

To get the bound of , we should consider the two terms on the right side of equation (18). Thus, we first propose the following lemmata to estimate term .

Lemma 1. If Assumptions 1 to 3 are satisfied, sequences , and are generated by SBAG with . Moreover, is a convex and compact set. , and for . In addition, suppose , and , where . Let , , and . Then, we have the following relation:

Proof. From equations (9) and (10), it follows thatandFrom equations (20) and (21), and by property of expectation, it can be verified thatPlugging equation (7) into equation (22), it yieldsBy Cauchy–Schwarz inequality, we further bound the term (a) of equation (23) and haveThe second inequality of equation (24) follows from the fact for all . In addition, the third inequality of equation (24) is due to the inequality . Moreover, plugging equation (24) into equation (23) leads toMoreover, since , and by equation (25), it follows thatTherefore, the proof of Lemma 1 is completed. Next, we introduce Lemma 2 to estimate the term .

Lemma 2. If Assumptions 1 to 3 are satisfied, sequences , and are generated by SBAG with . Moreover, is a convex and compact set. , and , for . In addition, suppose , and , where . Let , , and . Then, we have the following:

Proof. Let with . By equations (11) and (12), the following equation holdsUsing Lemma 3 of [31], it can be proved thatSubstituting equation (7) into equation (29) yieldsRearranging the terms of equation (30), and by , it follows thatApplying Young’s inequality and the Cauchy–Schwarz inequality into equation (31) leads toSumming equation (32) over and taking expectation on the obtained relation imply thatBy Lemma 1 and equation (33), it follows from thatSince and , we have . Therefore, we further obtain . Then, from equation (34), it can be proved thatApplying Assumption 2 and property of expectation yieldsTherefore, the proof of Lemma 2 is completed. Next, we estimate the last term in (18).

Lemma 3. If Assumptions 1 to 3 are satisfied, sequences , and are generated by SBAG with . Moreover, is a convex and compact set. and , for . In addition, suppose , and , where . Let and . Then, we attain the following inequality:

Proof. For the original full gradient, we have . Let , , and , which are generated by AdaBound [24].
The proof of Lemma 3 is similar to that of Theorem 4 in [24]. Starting with the following inequality impliesTherefore, the proof of Lemma 3 is finished.
To attain the bound of regret in equation (18), we establish Theorem 1 as follows.

Theorem 1. Suppose that Assumptions 1 to 3 are satisfied, and sequences , and are generated by SBAG with . Moreover, is a convex and compact set. , and for . In addition, suppose , and , where . Let , , and . We obtain the bound of regret as follows:

Proof. Applying lemmata 1, 2, and 3 into (18) yieldsTherefore, we complete the proof of Theorem 1.
From Theorem 1, we obtain . This suggests that SBAG is convergent. In addition, the bound of regret is ; i.e., given some accuracy , it requires an order of iterations at least to achieve the given accuracy.

6. Performance Evaluation

In this section, we perform our experiments on a public dataset to evaluate the performance of algorithm objectively. We consider the machine learning problem, multi-classification tasks taking advantage of the DNN for the experiments.

6.1. Setup

To assess our SBAG algorithm, we research the performance on the classification task problem. We use the CIFAR-10 [32] dataset for our experiments, which is widely used for classification problem. It consists of 10 classes and 50000 training samples and 10000 test samples.

For the experiments, we use the convolutional neural network to solve classification tasks on the CIFAR-10 image dataset, which has a good effect on image classification and object recognition, and specifically implement ResNet-34 [33] and DenseNet-121 [34].

6.2. Parameters

To study the performance of our proposed algorithm, we compare SBAG with SGD [14], AdaGrad [15], and AdaBound [24]. The hyper-parameters of these algorithms are initialized as follows.

For SGD, the scale of the learning rate is selected from the set . AdaGrad uses the initialized learning rate set , and the value 0 is set for the initial accumulator value of AdaGrad. The value of hyper-parameters of AdaBound is set the same as Adam. We directly use the initialized hyper-parameter values of AdaBound in our algorithm. In addition, we set the probability of choosing a coordinate from these values in the set .

In addition, we define the dynamic bound functions following with [24] for our simulation experiments, i.e.,and

6.3. Results

We take account of the image multi-class classification problem on the CIFAR-10 dataset using ResNet-34 and DenseNet-121 and run 200 epoch in this experiment. First, we operate a group of experiments with epochs and runtime for ResNet-34 and DenseNet-121 on CIFAR-10. The findings of experiments are reported in Figure 1, and when completing the same number of iterations of 200 epochs, our method takes the least time, and the AdaBound spends the most time. The main reason is that only several blocks of coordinates are calculated in the gradient descent process for our algorithm at each iteration , while the compared algorithms calculate the full gradients at each iteration. Moreover, AdaBound combines the first- and second-order momentum, while SGD and AdaGrad only use first-order gradients; thus, SGD and AdaGrad incur less time than AdaBound. The same results can be seen for the DenseNet-121 in Figure 1(b).

We present another group of experiments with average loss and running time, which are executed for ResNet-34 and DenseNet-121 on CIFAR-10. The findings are shown in Figure 2. At about 150 epochs, SGD has the biggest average loss than others and decreases sharply after that time, while the average loss of SBAG is smaller compared with others and reaches the minimum value in the shortest running time finally. The reason for fast descent rate of SBAG is due to the randomized block method, which chooses one block coordinate of decision vector to calculate the gradient. In other words, SBAG calculates more samples than other compared algorithms in the same running time. Therefore, the convergence of SBAG is verified by the findings presented in Figure 2.

In Figures 3 and 4, the training and test accuracy with running time of four algorithms are evaluated. As we can see, in about 150 epochs, AdaBound achieves the highest accuracy, and AdaGrad and our algorithm almost have the same accuracy of 92.36% and 93.99%. As the running time goes, the AdaBound and SBAG have the accuracy of 99.96% and 99.93%, respectively. The similar results can be seen on the DenseNet-121. In a word, SBAG works well on training or test set, and at the same time, it has the good generalization ability on both ResNet-34 and DenseNet-121.

From the experiments above, we observe that the SBAG shows a very good performance on both ResNet-34 and DenseNet-121. It incurs less computation cost for each iteration in experiments, which is consistent with theory.

7. Conclusion

In this study, we proposed a randomized block adaptive gradient online learning algorithm. The proposed algorithm, SBAG, is designed to reduce the gradient computation cost of high-dimensional decision vector. The convergence analysis of SBAG and evaluations on CIFAR-10 demonstrated that the regret bound of SBAG is when loss functions are convex and achieved significant computation cost savings, without adversely affecting the performance of the optimizer. In the same 200 epochs, the proposed algorithm has the least running time and tightly less in average loss in the end. The accuracy of training sample for ResNet-34 and DenseNet-121 is 99.93% and 99.72%, slightly less compared with that of 99.96% of AdaBound, but our method reaches the highest accuracy on the test sample than AdaBound, SGD, and AdaGrad; i.e., SBAG is the fastest in four methods, and the curves are milder than SGD.

Data Availability

The data that support the findings of this study are CIFAR-10, which is available from [32].

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this study.

Authors’ Contributions

Jianghui Liu and Baozhu Li contributed equally.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (NSFC), under Grant nos. 61976243 and 61871430, the Leading Talents of Science and Technology in the Central Plain of China, under Grant no. 214200510012, the Scientific and Technological Innovation Team of Colleges and Universities in Henan Province, under Grant no. 20IRTSTHN018, the Basic Research Projects in the University of Henan Province, under Grant no. 19zx010, the Key Scientific Research Projects of Colleges and Universities in Henan Province, under Grant no. 22A520005, the National Natural Science Foundation of China, under Grant no. 61901191, and the Shandong Provincial Natural Science Foundation, under Grant no. ZR2020LZH005.