A Stochastic Trust Region Method for Unconstrained Optimization Problems
In this paper, a stochastic trust region method is proposed to solve unconstrained minimization problems with stochastic objectives. Particularly, this method can be used to deal with nonconvex problems. At each iteration, we construct a quadratic model of the objective function. In the model, stochastic gradient is used to take the place of deterministic gradient for both the determination of descent directions and the approximation of the Hessians of the objective function. The behavior and the convergence properties of the proposed method are discussed under some reasonable conditions. Some preliminary numerical results show that our method is potentially efficient.
The computational complexity of learning algorithm becomes the critical limiting factor when one envisions very large datasets. This contribution advocates many stochastic optimization algorithms for large-scale problems. To a certain extent, stochastic optimization is effective to solve the problems in stochastic systems. A number of industrial, biological, engineering, and economic problems can be considered as stochastic systems, for example, area of communication, gene, signal processing, geography, civil engineering, aerospace, and banking. Particularly, stochastic optimization algorithms are used to solve the problem of optimizing an objective function over a set of feasible values in situations where the objective function is defined as an expectation over a set of random functions. To be more precise, consider an optimization variable and a random variable that determines the choice of a function . Stochastic optimization problems considered in this paper entail determination of the argument that minimizes the expected value :We refer to as the random or instantaneous functions and to as the average function. We assume that the instantaneous function is continuously differentiable for all from which it follows that the average function is also continuously differentiable. Problems having the form in (1) are often used in machine learning [1–3] as well as in optimal resource allocation in wireless systems [4, 5].
Many descent algorithms can be used for solving such problems as (1). However, descent methods require exact gradient of the objective function, i.e., , which is impracticable generally as a result of the very large dataset. Stochastic gradient descent (SGD) methods overcome this obstacle by using gradient approximation which is based on small data samples and are regarded as the workhorse methodology to solve large-scale stochastic optimization problems [3, 6–9]. But practical appeal of SGD methods remains limited, and the number of iterations required to approximate optimal arguments can be prohibitive in high dimensional problems. In fact, SGD inherits slow convergence from its use of gradients which is exasperated by their substitution with stochastic approximation; consecutive stochastic gradients may vary a lot or even point in opposite directions. Several accelerations of SGD have been presented, and lots of recent works have focused on reducing randomness in SGD by combining gradient and stochastic gradient or updating descent direction so that they become more close to gradient gradually. For example, the Stochastic Average Gradient (SAG) method achieves a faster convergence rate than SGD by incorporating a memory of previous gradient values, which often outperform existing SGD methods and were shown to exhibit linear convergence . The Semistochastic Gradient (S2GD) method in  runs for one or several epochs in each of which a single full gradient and a random number of stochastic gradients are computed, following a geometric law. These algorithms end up showing a faster asymptotic convergence rate than SGD ; however, practical numerical experiments show that reducing randomness is of no use for problems with a challenging curvature profile.
To overcome problems with the objective function’s curvature, one may consider stochastic version of Newton’s method. But computing approximation of Newton steps is difficult except in problems with specific structures because of the challenging curvature . Recourse to quasi-Newton methods, regularized stochastic BFGS (RES) method, is proposed. RES utilizes stochastic gradients in lieu of deterministic gradients for determining descent directions and approximating the objective function’s curvature. As we can see in , RES has been shown to outperform SGD in large-dimensional problems or ill-conditioned functions. However, RES inherits the disadvantage that the Hessian matrix of objective functions must be positive definite which implies that the strongly convex objective functions are indispensable to guarantee provable convergence. Thus, RES can only be used to solve strongly convex stochastic optimization problems. But there are lots of nonconvex optimization problems in practical applications. So, in this paper, we would like to exploit an effective method for solving more general stochastic optimization problems, i.e., the objective functions can be nonconvex.
The trust region methods are deemed to be invaluable tools for solving nonlinear and nonconvex optimization problems [15–17]. The main idea of typical trust region method is to construct a quadratic model and choose a step to be the approximate minimizer of the quadratic model in the trust region around the current point in which the model is trusted to be adequate to the objective function. In the iterative procedures, the trust region radii are adjusted depending on the agreement between the model functions and the objective functions.
In this paper, we develop a stochastic version of trust region method, where the Hessian matrices are approximated by regularized stochastic BFGS method . Similar to classical trust region methods, we use a suitable quadratic model to replace the complex objective function. In the model, deterministic gradients are replaced by stochastic gradients and exact Hessian matrices of the objective function are approximated by its stochastic approximations. Due to the fact that stochastic gradients are computable at manageable cost, stochastic trust region method is feasible in practice. Convergence theory and numerical results verify the reliability of our method to deal with both convex and nonconvex optimization problems.
The rest of the paper is organized as follows. The new algorithm is illustrated in Section 2. In Section 3, convergence of the algorithm is established under suitable conditions. Some preliminary numerical results are reported in Section 4. Finally, some conclusions are given in Section 5.
2. Stochastic Trust Region Algorithm
Considering that the objective function is continuously twice differentiable and further assuming that the instantaneous functions have finite gradients, it follows that the gradients of are given byWhen the number of functions is large, as is the case in most problems of practical interest, exact evaluation of the gradient is impractical. This motivates the use of stochastic gradients in lieu of exact gradients. In addition, the Hessian matrices of are given byMore precisely, consider a given set of realizations and define the stochastic gradient of at given samples as
Consider the trust region subproblemwhere , is the trust region radius, and is a stochastic approximation of Hessian matrix of (see Remark 2 for details). The model is minimized approximately to produce a step which is subject to having the norm less than or equal to . In other words, the solution of this subproblem represents a step toward minimizing the model of the objective function at . The radio of the actual reduction of the objective function to the predicted reduction is calculated. If , where , then is accepted, which is called a successful step. Let ; in this case, the trust region radius is increased or remains unchanged. Otherwise, is rejected, which is called an unsuccessful step in which remains unchanged and the trust region radius is reduced.
Algorithm 1 (stochastic trust region). Initialization. The variable is . The symmetric matrix is . The initial radius is , where , and choose constant , .
Step 1. Acquire independent samples , and calculate
Step 2. Solve trust region subproblem (5), giving a trial step .
Step 3. Computewhere and . If , set ; otherwise, set .
Step 4. Update the trust region radius.
If , set ;
If and , set ;
Step 5. Update Hessian approximation matrix to acquire when ; otherwise, keep the matrix unchanged. Set ; go to Step 2.
Remark 2. The matrix can be updated by regularized stochastic BFGS formula  as follows:where is a constant, , denote the variable and corrected stochastic gradient variation at time . The addition of the regularization term and the corrected stochastic gradient variation avoid the near-singularity problems of more straightforward extensions.
3. Convergence Analysis
In this section, we prove that the iterative sequence generated by Algorithm 1 is convergent. For the subsequent analysis, we define the instantaneous objective function associated with samples asThe definition of the instantaneous objective function in association with the fact that implies
In order to prove the global convergence, we make the following assumptions.
A1 The instantaneous objective function is continuously twice differentiable.
A2 The level set is bounded. Moreover, the function is bounded below in .
A3 Since the stochastic gradient is an unbiased estimator of in the sense of , there exists a positive constant such that, for all , it holds
A4 There exists a , such that for all .
As a consequence of A1, the function is also continuously twice differentiable owing to the linearity of the expected operation and the expression in (10).
Lemma 3. Assume that A1, A3, and A4 hold; if is a solution or approximate solution of subproblem (5), then we havewhere is a constant and denotes the conditional expectation of with giving .
Proof. Considering subproblem (5), the Cauchy point can be expressed bywhereCase 1. When , it follows from (14) that ; thus, we haveCase 2. When and , it follows from (14) that ; then, we haveCase 3. When and , it follows from (14) that , and we haveWith the observation of these cases above, we can come to the conclusion that, for all, the Cauchy point of (5) satisfiesAssume that is the exact solution of subproblem (5); we can draw the following inequalities from the property of Cauchy point thatIn combination with inequality (18) above, we can writeFurther assume that is the approximate solution; then, there exists a constant satisfyingIf we set , then we haveLemma 3 implies that a sufficient decrement of the model on average is guaranteed under assumptions A1 and A4.
Lemma 4. If A1, A3, and A4 hold true, we havewhere is a function of and decreases with the decrease of .
Lemma 5. If the same assumptions as in Lemma 3 hold, and is a small tolerance. Then, there exist infinite number of which satisfy .
Proof. From the definition of and properties of expectation, we can defineFurther assume and there exists a positive constant such that , and it follows from assumptions and Lemma 3 and Lemma 4 thatWe can select small enough to satisfy , andThus, we have ; this result is shown that is closer to than on average which implies by Algorithm 1 that there are infinite numbers of which satisfy .
Theorem 6. Considering the new algorithm defined above, suppose that A1-A4 hold, and the sequence of iterates generated by Algorithm 1 satisfiesover realizations of the random samples .
Proof. Here we use a contradiction with Lemma 5 to prove (30). For that purpose, assume that there exist and a positive index set such that for every we have . At the same time, it is assumed that there are infinite number of successful iterations, and we can obtain the following inequality from Algorithm 1:taking expectation with given in both sides of (31) we haveConsidering assumption A3, as approaches infinity, the function bounded below implies that the right side of (32) is tending to zero; thus, we haveBut the result in (33) is in contradiction with Lemma 5. Further assume there are finite number of successful iterations; that is, the iterations are unsuccessful for large ; then, the trust region radius is reduced, i.e. . In this case, we still get a contradiction with Lemma 5. Hence, we obtain (30).
Remark 7. We note that the stochastic gradient is an unbiased estimator of , i.e., . Thus, we can also obtain from the result in Theorem 6.
4. Numerical Experiments
In this section, we apply the proposed stochastic trust region method to solve convex and nonconvex problems with stochastic objectives and also compare it with SGD in terms of convergence time and central processing unit (CPU) runtime. Both methods will be tested on the following problems: problem 1 and 2 are convex stochastic optimization problems while problem 3 is nonconvex.
4.1. Example 1: Standard Quadratic Function
We use a stochastic quadratic objective function as a test case. In particular, consider a positive definite diagonal matrix , a vector , a random vector , and diagonal matrix defined by . The function is defined asIn (34), the vector is chosen uniformly at random from the -dimensional box for some given constant . The linear term is added so that the instantaneous functions have different minima which are (almost surely) different from the minimum of the average function . The quadratic term is chosen so that the condition number of is the condition number of . Indeed, since , the average function in (34) can be written as . The parameter controls the variability of the instantaneous functions . For small , the instantaneous functions are close to each other and to the average function. For large , the instantaneous functions vary over a large range. Observe that we can write the optimum argument as for comparison against iterates . Further consider a given and study the convergence metricwhich represents the time needed to achieve a given relative distance to optimality. To study the effect of the problem’s condition number, we generate instances of (34) by choosing uniformly at random from the box and the matrix as diagonal with elements uniformly drawn from the discrete set . This choice of yields problems with condition number . Representative runs of stochastic trust region method and SGD for , and are shown in Figure 1. For the stochastic trust region method and SGD run, the stochastic gradients in (4) are computed as an average of realizations. Moreover, we set , , and for our method due to the randomness of instantaneous functions and the large condition number.
As expected for a problem with large condition number, the condition number of is since we are using , and the stochastic trust region method is much faster than SGD. After , the distance to optimality for the SGD iterate is . Comparable accuracy for stochastic trust region method is achieved after iterations. Conversely, upon processing random functions, our method achieves accuracy .
4.2. Example 2: Extended Powell Singular Function
Powell singular function (PSF) is also known as Powell Quartic Function. The Hessian matrix at minimizer is doubly singular; thus, PSF is a severe test problem. To analyse the numerical performance of our algorithm, for a random vector , a variable , we develop stochastic Extended PSF as the test function as follows :where is selected uniformly at random from the box with . Indeed, observe that since , the average function in (36) can be written asStandard starting point is and the Hessian matrix at the standard starting point is nonsingular. The average function in (36) is convex and the unique unconstrained minimizer is with . In Figure 2, for , , specially , , and , since the minimizer is , we plot the variation of distance to optimality instead of relative distance with the increase of the number of iterations.
As the number of iterations increases, we can observe that SGD achieves accuracy after iterations and has a slow convergence to optimality. However, our method needs only iterations to achieve the same accuracy. Furthermore, after iterations, stochastic trust region method has with . The result indicates that our method has adequate accuracy and a faster convergence rate than SGD for severe convex problems.
4.3. Example 3: Extended Rosenbrock Function
In mathematical optimization, the Rosenbrock function is frequently used in nonconvex optimization problems as a performance test problem for optimization algorithms, introduced by Howard H. Rosenbrock in 1960. The global minimum lies in a narrow, long, parabolic shaped flat valley. It is trivial to find the valley; however, convergence to the global minimum is difficult. Here we use a stochastic version of the Extended Rosenbrock function as a test case . Specifically, consider a random vector , a variable . The function is defined asSimilarly, we choose uniformly at random from the box and set . In fact, from the observation , we can write the average function asThe function in (39) has a minimizer with . For the purpose of this test problem, we choose , . In particular, set , , and for our method. The relative distances in (38) after running of stochastic trust region method and SGD are provided in Figure 3. After iterations, stochastic trust region method achieves accuracy , and correspondingly SGD achieves such accuracy after iterations.
Since each iteration of stochastic trust region method is more complex than SGD, we also compare the performances in terms of central processing unit (CPU) runtime required to achieve accuracy . We report the average runtime of stochastic trust region method and SGD for problems above in Table 1; the value of parameters is the same as above. In particular, we call stochastic trust region method STR for short in the table. As we can see that our method enjoys a significant improvement in CPU runtime than SGD for both convex and nonconvex problems.
In this paper, we propose a stochastic trust region method and show that the new algorithm is convergent for solving unconstrained minimization problems with stochastic objectives. Based on the trust region framework and the BFGS update, our method can deal with convex optimization problems with ill-conditioned objective functions as well as nonconvex optimization problems. With careful analysis, we are able to show that our method is convergent. Numerical results illustrate that the method can efficiently solve the given test problems. Therefore, the new method is potentially efficient and thus paves the way towards developing concrete algorithms for specific applications.
The date used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that they have no conflicts of interest regarding the publication of this paper.
The authors would like to thank the editor and the reviewers for their very careful reading and constructive comments which led to great improvement of the paper. This work was supported by National Natural Science Foundation of China [Grant nos. 11601252 and 11571178].
A. Mokhtari and A. Ribeiro, “A quasi-Newton method for large scale support vector machines,” in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2014, pp. 8302–8306, Italy, May 2014.View at: Google Scholar
A. Mokhtari and A. Ribeiro, “A dual stochastic DFP algorithm for optimal resource allocation in wireless systems,” in Proceedings of the 14th Workshop on Signal Processing Advances in Wireless Communications (SPAWC 2013), pp. 21–25, Darmstadt, Germany, June 2013.View at: Publisher Site | Google Scholar
S. Shalev-Shwartz and N. Srebro, “SVM optimization: Inverse dependence on training set size,” in Proceedings of the 25th International Conference on Machine Learning, pp. 928–935, Finland, July 2008.View at: Google Scholar
L. Zhang, M. Mahdavi, and R. Jin, “Linear convergence with condition number independent access of full gradients,” Advances in Neural Information Processing Systems, pp. 980–988, 2013.View at: Google Scholar
J. R. Birge, X. Chen, L. Qi, and Z. Wei, “A stochastic Newton method for stochastic quadratic programs with resource,” Tech. Rep. 1, University of Michigan, Ann Arbor, Mich, USA, 1995, pp. 113-141.View at: Google Scholar
W. Sun and Y. Yuan, Optimization Theory and Methods. Nonlinear Programming, Springer, New York, NY, USA, 2006.View at: MathSciNet
J. Nocedal and S. J. Wright, Numerical Optimization, Springer, New York, NY, USA, 1999.View at: MathSciNet