Applications of Methods of Numerical Linear Algebra in EngineeringView this Special Issue
A Modified Conjugacy Condition and Related Nonlinear Conjugate Gradient Method
The conjugate gradient (CG) method has played a special role in solving large-scale nonlinear optimization problems due to the simplicity of their very low memory requirements. In this paper, we propose a new conjugacy condition which is similar to Dai-Liao (2001). Based on this condition, the related nonlinear conjugate gradient method is given. With some mild conditions, the given method is globally convergent under the strong Wolfe-Powell line search for general functions. The numerical experiments show that the proposed method is very robust and efficient.
The conjugate gradient (CG) method has played a special role in solving large-scale nonlinear optimization problems due to the simplicity of their iterations and their very low memory requirements. In fact, the CG method is not among the fastest or most robust optimization methods for nonlinear problems available today, but it remains very popular for engineers and mathematicians who are interested in solving large-scale problems. The conjugate gradient method is designed to solve the following unconstrained optimization problem: where is a smooth, nonlinear function whose gradient will be denoted by . The iterative formula of the conjugate gradient method is given by where is a step-length which is computed by carrying out a line search, and is the search direction defined by where is a scalar and denotes the gradient . The different conjugate gradient methods correspond to different computing ways of . If is a strictly convex quadratic function, namely, where is a positive definite matrix, and if is the exact one-dimensional minimizer along the direction , then the method with (2) and (3) is called linear conjugate gradient method. Otherwise, it is called nonlinear conjugate gradient method. The most important feature of linear conjugate gradient method is that the search directions satisfy the following conjugacy condition: For nonlinear conjugate gradient methods, (5) does not hold, since the Hessian changes at different iterations.
Some well-known formulae for are the Fletcher-Reeves (FR), Polak-Ribière (PR), Hestense-Stiefel (HS), and Dai-Yuan (DY), which are given, respectively, by where denotes the Euclidean norm. Their corresponding conjugate methods are abbreviated as FR, PR, HS, and DY methods. In the past two decades, the convergence properties of these methods have been intensively studied by many researchers (e.g., [1–9]). Although all these methods are equivalent in the linear case, namely, when is a strictly convex quadratic function and are determined by exact line search, their behaviors for general objective functions may be far different.
For general functions, Zoutendijk  proved the global convergence of FR method with exact line search. (Here and throughout this paper, for global convergence, we mean that the sequence generated by the corresponding methods will either terminate after finite steps or contain a subsequence such that it converges to a stationary point of the objective function from a given initial point.) Although one would be satisfied with its global convergence properties, the FR method performs much worse than the PR and HS methods in real computations. Powell  analyzed a major numerical drawback of the FR method; namely, if a small step is generated away from the solution point, the subsequent steps may be also very short. On the other hand, in practical computation, the HS method resembles the PR method, and both methods are generally believed to be the most efficient conjugate gradient methods since these two methods essentially perform a restart if a bad direction occurs. However, Powell  constructed a counterexample and showed that the PR and HS methods without restarts can cycle infinitely without approaching the solution. This example suggests that these two methods have a drawback that they are not globally convergent for general functions. Therefore, over the past few years, much effort has been put to find out new formulae for conjugate methods such that they are not only globally convergent for general functions but also have robust and efficient numerical performance.
Recently, using a new conjugacy condition, Dai and Liao  proposed two new methods. Interestingly, one of their methods is not only globally convergent for general functions but also performs better than HS and PR methods. In this paper, similar to Dai and Liao's approach, we propose a new conjugacy condition. Based on the proposed condition, a new formula for computing is given. And then, we analyze the convergence properties for the given method and also carry the numerical experiment which shows that the given method is robust and efficient.
The remainder of this paper is organized as follows. In Section 2, after a short description of Dai and Liao's conjugacy condition and related methods, the motivations of this paper are represented. According to the motivations, we propose a new conjugacy condition and related method at the end of Section 2. In Section 3, convergence analysis for the given method is presented. In the last Section we perform the numerical experiments by testing a set of large-scale problems and do some numerical comparisons with some existing methods.
2. Motivations, New Conjugacy Condition, and Related Method
2.1. Dai-Liao’s Methods
It is well-known that the linear conjugate gradient methods generate a sequence of search directions such that conjugacy condition (5) holds. Denote to be the gradient change, which means that For a general nonlinear function , we know by the mean value theorem that there exists some such that Therefore, it is reasonable to replace (5) with the following conjugacy condition: Recently, extension of (9) has been studied by Dai and Liao in . Their approach is based on the quasi-Newton techniques. Recall that, in the quasi-Newton method, an approximation matrix of the Hessian is updated such that the new matrix satisfies the following quasi-Newton equation: The search direction in quasi-Newton method is calculated by Combining these two equations, we obtain The above relation implies that (9) holds if the line search is exact since in this case . However, practical numerical algorithms normally adopt inexact line searches instead of exact line search. For this reason, it seems more reasonable to replace conjugacy condition (9) with the condition where is a scalar.
To ensure that the search direction satisfies conjugacy condition (13), one only needs to multiply (3) with and use (13), yielding It is obvious that For simplicity, we call the method with (2), (3), and (14) as method. Dai and Liao also proved that the conjugate gradient method with is globally convergent for uniformly convex functions. For general functions, Powell  constructed an example showing that the PR method may cycle without approaching any solution point if the step-length is chosen to be the first local minimizer along . Since the method reduces to the PR method in the case that holds, this implies that the method with (14) need not converge for general functions. To get the global convergence, Dai and Liao made a restriction on as follows Dai and Liao replaced (14) by We also call the method with (2), (3), and (17) as method; Dai and Liao show that method is globally convergent for general functions under sufficient descent condition (31) and some suitable conditions. Besides, some numerical experiments in  indicate the efficiency of this method.
Similar to Dai and Liao's approach, Li et al.  proposed another conjugate condition and related conjugate gradient methods. And they also proved that the proposed methods are globally convergent under some assumptions.
Recently, based on a modified secant condition given by Zhang et al. , Yabe and Takano  derive an update parameter and show that the YT+ scheme is globally convergent under some conditions: where is a constant
From the above discussions, Dai and Liao's approach is effective; the main reason is that the search directions generated by method or method not only contain the gradient information but also contain some Hessian information. From (15) and (17), and are formed by two parts; the first part is and the second part is . So we also consider and methods as the modified forms of the method by adding some information of Hessian which is contained in the second part.
From the structure of (17), we know that the parameter may be negative since the second part may be less than zero. In conjugate gradient methods, if the and is large, then the generated directions and may tend to be opposite. This type of methods is susceptible to jamming.
On the other hand, in conjugate gradient methods, the following strong Wolfe-Powell line search is often used to determine the step size : where ; a typical choice of is . From the structure of (17), we know that depends on the directional derivative which is determined by the line search. For PRP+ algorithm with the strong Wolfe-Powell line search, in order to make sufficient descent condition (31) hold, people often used Lemarechal , Fletcher , or Moré and Thuente’s  strategy to make the directional derivative sufficiently small. Under this strategy, the second part of will tend to vanish. This means that the DL method is much line-search-dependent.
The above discussions motivate us to propose a modified conjugacy condition and the related conjugate gradient method, which should possess the following properties(1)Nonnegative property .(2)The new formula contains not only the gradient information but also some Hessian information.(3)The formula should be less line-search-dependent.
2.3. The Modified Conjugacy Condition and Related Method
From the above discussion, it seems reasonable to replace conjugacy condition (13) with the following modified conjugacy condition:
To ensure that the search direction satisfies condition (23), one only needs to multiply (3) with and use (23), yielding It is obvious that For simplicity, we call the method with (2), (3), and (25) as method. Similar to Gilbert and Nocedal's  approach, we propose the following restricted parameter : And we call the method with (2), (3), and (26) as method and give the nonlinear conjugate gradient algorithm as below.
Algorithm 1 (). Step 1. Given , , set , if , then stop.
Step 2. Compute by the strong Wolfe-Powell line search.
Step 3. Let , ; if , then stop.
Step 4. Compute by (26) and generate by (3).
Step 5. Set and go to Step 2.
3. Convergence Analysis
In the convergence analysis of conjugate gradient methods, we often make the following basic assumptions on the objective functions.
Assumption A. (i) The level set is bounded; namely, there exists a constant such that
(ii) In some neighborhood of , is continuously differentiable, and its gradient is Lipschitz continuous; namely, there exists a constant such that
Under the above assumptions of , there exists a constant such that We say the descent condition holds if for each search direction , In addition, we say the sufficient descent condition holds if there exists a constant such that, for each search direction , we have
Lemma 2. Suppose that Assumption A holds. Consider any conjugate gradient method in the form (2)-(3), where is a descent direction and is obtained by the strong Wolfe-Powell line search. If we have that
Theorem 3. Suppose that Assumption A holds. Consider method, where is a descent direction and is obtained by the strong Wolfe-Powell line search. If the objective functions are uniformly convex, namely, there exists a constant such that we have that
Proof. It follows from (34) that By (3), (24), (28), (29), and (36), we have that which implies the truth of (32). Therefore, by Lemma 2, we have (33), which is equivalent to (35) for uniformly convex functions. The proof is completed.
Proof. By SWP condition (22), we have and . So we have The proof is completed.
Theorem 5. Suppose that Assumption A holds. Consider method, where is a sufficient descent direction and is obtained by strong Wolfe-Powell line search. If there exists a constant such that then and where .
Proof. First, note that ; otherwise (31) is false. Therefore is well defined. In addition, by relation (40) and Lemma 2, we have that
Now, we divide formula into two parts as follows:
Then by (3) we have, for all , Using the identity and (45), we can obtain Using the condition , the triangle inequality, and (48), we obtain On the other hand, line search condition (22) gives Equations (22), (31), and (50) imply that It follows from the definition of , (27), (29), and (51) that So we have and the proof is completed.
Gilbert and Nocedal  introduced property (*) which is very important for the convergence analysis of the conjugate gradient methods. In fact, with Assumption A, (40), and (50), if (31) holds with some constant , the method with possesses such property (*).
Property (). Consider a method of forms (2) and (3). Suppose that
We say that the method has property (*), if, for all , there exist constants , such that , and if , we have .
In fact, by (31), (40), and (50), we have Combining (55) with (27) and (28) and (29), we obtain Note that can be defined such that . Therefore we can say that . As a result, we define and we get from the first inequality in (56) that if , then
Let denote the set of positive integers. For and a positive integer , denote Let denote the number of elements in . From the above property (*), we can prove the following theorem.
Theorem 6. Suppose that Assumption A holds. Consider method, where satisfies condition (31) with , and is obtained by the strong Wolfe-Powell line search. Then if (40) holds, there exists such that, for any and any index , there is an index such that
The proof of this theorem is similar to the proof of Lemma 3.5 in . So, we omit the proof.
According to the above lemmas and theorems, we can prove the following convergence result for the MDL+ method.
Theorem 7. Suppose that Assumption A holds. Consider method, where satisfies condition (31) with , and is obtained by the strong Wolfe-Powell line search. Then we have .
Proof. We proceed by contradiction. If , then (40) must hold. Then the conditions of Theorem 6 hold. Defining , we have, for any indices , , with ,
Consider ; (27) and (61) give that
Let be given by Theorem 6 and define to be the smallest integer not less than . By Theorem 6, we can find an index such that With this and , Theorem 6 gives an index such that For any index , by Cauchy-Schwarz, the geometric inequalities, and (63), From relations (64) and (65), by taking in (62), we get Thus , which contradicts the definition of . The proof is completed.
4. Numerical Results
In this section, we report the performance of Algorithm 1 (MDL+) on a set of test problems. The codes were written in Fortran 77 and in double precision arithmetic. All the tests were performed on the same PC (Intel Core i3 CPU M370 @ 2.4 GH, 2 GB RAM). The experiments were performed on a set of 73 nonlinear unconstrained problems collected by Neculai Andrei. Some of the problems are from CUTE  library. For each test problem, we have performed 10 numerical experiments with a number of variables , 2000,…, 10000.
In order to assess the reliability of the MDL+ algorithm, we also tested this method against the DL method and HS method using the same problems. All these algorithms terminate when . We also force the routines to stop if the iterations exceed 1000 or the number of function evaluations reaches 2000. The parameters and in Wolfe-Powell line search conditions (21) and (22) are set to be and respectively. For DL method, , which is the same with . We also test MDL+ algorithm with different parameters to see that is the best choice.
The comparing data contain the iterations, function, and gradient evaluations and CPU time. To approximatively assess the performance of MDL+, HS, and DL methods, we use the profile of Dolan and Moré  as an evaluated tool.
Dolan and Moré  gave a new tool to analyze the efficiency of algorithms. They introduced the notion of a performance profile as a means to evaluate and compare the performance of the set of solvers on a test set . Assuming that there exist solvers and problems, for each problem and solver , they defined cost (iterations or function and gradient evaluations or CPU time) required to solve problem by solver .
Requiring a baseline for comparisons, they compared the performance on problem by solver with the best performance by any solver on this problem; that is, using the performance ratio
Suppose that a parameter for all . Set if and only if solver does not solve problem . Then they defined Thus is the probability for solver that a performance ratio is within factor of the best possible ratio. Then function is the distribution function for the performance ratio. The performance profile is a nondecreasing, piecewise constant function. That is, for subset of the methods being analyzed, we plot the fraction of the problems for which any given method is within a factor of the best.
For the testing problems, if all three methods can not terminate successfully, then we got rid of it. In case one method fails, but there is another method that terminates successfully, then the performance ratio of the failed method is set to be ( is the maxima of the performance ratios). The performance profiles based on iterations, function and gradient evaluations, and CPU time of the three methods are plotted in Figures 1, 2, and 3, respectively.
From Figure 1, which plots the performance profile based on iterations, when , the HS method performs better than MDL+ and DL methods. With the increasing of , when , the profile of MDL+ method outperforms HS and DL methods. This means that, from the iteration points of view, for a subset of problems, HS method is better than MDL+ and DL methods. But, for all the testing problems, DML+ method is much robuster than HS and DL methods.
From Figure 2, which plots the performance profile based on function and gradient evaluations, it is easy to see that, for all , MDL+ method performs much better than HS and DL methods. It is an interesting phenomenon, since, when , the profiles of HS based on iterations outperform DML+ method. This means that, during process of iteration, the required function and gradient evaluations of MDL+ method are much less than HS and DL methods. Form this point of view, the CPU time consumed by MDL+ method should be much less than HS and DL methods, since the CPU time is mainly dependent on function and gradient evaluations. Figure 3 validates that the CPU time consumed by MDL+ method is much less than HS and DL methods.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This research was supported by Guangxi High School Foundation Grant (Grant no. 2013BYB210 ) and Guangxi University of Finance and Economics Science Foundation Grant no. 2013A015.
Z. X. Wei, G. Y. Li, and L. Q. Qi, “Global convergence of the Polak-Ribière-Polyak conjugate gradient method with an Armijo-type inexact line search for nonconvex unconstrained optimization problems,” Mathematics of Computation, vol. 77, no. 264, pp. 2173–2193, 2008.View at: Publisher Site | Google Scholar | Zentralblatt MATH
R. Fletcher, Practical Methods of Optimization Vol. 1: Unconstrained Optimization, John Wiley & Sons, New York, NY, USA, 1987.
J. J. Moré and D. J. Thuente, On Line Search Algorithms with Guaranteed Sufficient Decrease, Mathematics and Computer Science Devision Preprint MCS-P330-1092, Argonne National Laboratory, 1990.