Abstract

This paper proposes some diagonal matrices that approximate the (inverse) Hessian by parts using the variational principle that is analogous to the one employed in constructing quasi-Newton updates. The way we derive our approximations is inspired by the least change secant updating approach, in which we let the diagonal approximation be the sum of two diagonal matrices where the first diagonal matrix carries information of the local Hessian, while the second diagonal matrix is chosen so as to induce positive definiteness of the diagonal approximation at a whole. Some numerical results are also presented to illustrate the effectiveness of our approximating matrices when incorporated within the L-BFGS algorithm.

1. Introduction

Our investigation begins by seeking effective ways to diagonally scale an identity matrix, which is often used to initialize the L-BFGS method. For this purpose, it is useful to state the weak quasi-Newton equation: where the -dimensional vector denotes the step corresponding to two different points and , denotes the gradient change corresponding to the gradients and at the two points, is a full matrix that approximates the Hessian of . Both and are used explicitly and storage is required for the matrix . If is further chosen to be a diagonal matrix, say , then we can establish the so-called quasi-Cauchy relation denwol 1993, [1] by which Different from the standard quasi-Newton which requires of storage, here only storage is required to store the diagonal update that satisfies the quasi-Cauchy relation. In addition, we suppose the matrix to be positive definite by which it is able to define a metric.

The first example of diagonal updating that satisfies the quasi-Cauchy relation is the well-known Oren-Luenberger OREN1974 scaling matrix, given by where is the identity matrix. Expression (3) would be obtained from the quasi-Cauchy relation with the further restriction where the diagonal matrix is a scalar multiple of the identity matrix. Therefore, scaling matrices that are derived from the quasi-Cauchy relation are a natural generalization of Oren-Luenberger scaling.

In general, a procedure to obtain diagonal updating formulae via quasi-Cauchy relation can be summarized as follow. Suppose that is a positive definite diagonal matrix and is the updated version of , which is also diagonal. Then the updated is required to satisfy the quasi-Cauchy relation. It is also essential to require the deviation between and being minimized under some variational principle, which in return will encourage numerical stability. Very often the Frobenius matrix norm is used to measure this deviation. As noted earlier, a diagonal matrix uses the same computer storage as a vector, resulting in potential use in limited memory algorithms.

This paper is structured as follows. In Section 2, we consider a briefly review on the diagonal updating via quasi-Cauchy relation and the preconditioning strategy. New diagonal initial approximation for L-BFGS method is introduced in this section as well. It follows that in Section 3, the convergence properties of the proposed L-BFGS methods are investigated. Numerical results are presented in Section 4, on a large set of unconstrained minimization problems, mainly from the collections of Moré et al. [2], CUTE [3], and Toint [4]. In Section 5, through the numerical results, we do the discussion regarding the new diagonal initial approximating which matrix is more effective at improving the performance of the L-BFGS update compared to the standard preconditioner in the literature. Finally, Section 6 ends the paper by presenting a summary and conclusion.

2. Diagonal Updating via Quasi-Cauchy Relation

The performance of the L-BFGS method is depending on a good approximations of the actual Hessian. In the basic implementation of the L-BFGS method, the correction pairs are as follows: to correct . The choices of often influences the behaviour of the method. It is worth the investigation on the choices of .

Throughout this section, when we mention a direct initial matrix, we mean a matrix that is a rough approximation to the Hessian; otherwise an initial matrix is an approximation to the inverse of Hessian.

Our approach is inspired by [5, 6] which employed a variational technique that is analogue to the one used to derive the Powell Symmetric Broyden (PSB) quasi-Newton update (see, e.g, Dennis and Schnabel [7]). The resulting update for approximating the Hessian matrix diagonally is derived as follows: where , denotes the th component of the vector , and is the trace operator.

Note that when , the resulting is not necessarily positive definite. Thus, like their counterpart of PSB update in the quasi-Newton setting, the foregoing update may suffer from the loss of positive definiteness and it is not appropriate for use within a quasi-Newton-based algorithm.

In this study, our approach in finding an efficient diagonal Hessian approximation is done through letting the diagonal approximating matrix be a combination of two diagonal matrices. This gives us a freedom to incorporate curvature information into one of these diagonal matrices, while the property of hereditary positive definiteness is carried over to the second matrix.

To begin, suppose that the Hessian matrix of an objective function has positive diagonal elements. Let us divide the Hessian matrix into two parts: where is a diagonal matrix consisting the diagonal entries of the Hessian and would resemble the actual Hessian except that its diagonal entries are all zero. Thus, we intend to form two diagonal approximating matrices to approximate each part of the Hessian, respectively; that is, Since it is assume that the entries of the actual Hessian are all positive, an excellent choice would be to let a positive definite diagonal matrix, says , to approximate . Meanwhile, is expected to be dense and expensive to compute and would be approximated by . To preserve positive-definiteness, we introduce additional term in the form of , that is, a Levenberg-Marquat-like step to maintain positive definite of in a way that is expressed as . Thus, the technique for calculating subjected to the weak-quasi-Newton relation is as follows.

Theorem 1. Suppose that ; then the optimal solution of the following minimization problem: is given by

Proof. Since the objective function in (8) is convex and the feasible set is also convex, then (8) has a unique solution. Its Lagrangian function is given by where is the Lagrange multiplier associated with the constraint. Differentiating (10) with respect to each of elements of and setting the results to zero yields, Pre- and postmultiplying (11) by and invoking the constraint, We have Finally, by substituting (13) into (11) and using the fact that yields (9),

A direct result of Theorem 1 leads to the following diagonal preconditioning formulation: An analogue of the above approach can be used to derive the initial inverse Hessian approximation , which is more useful for algorithmic purpose. Once again by letting the inverse Hessian of an objective be separated by parts; then the initial approximating matrix can be expressed as Since it is our intention to derive an initial approximation for the L-BFGS method, then an excellent choice of for approximating the diagonal entries of the inverse Hessian would be the diagonalized inverse BFGS formula of Gilbert and Lemaréchal [8] updated from a multiple of identity matrix; that is, where we choose as the Oren-Luenberger scalar. Thus, diagonal matrix can be obtained in (18). Note that, to safeguard very small or very large , we impose the additional following condition: if , for some small and large positive and , we set . Here, we used and .

By interchanging the role of and in Theorem 1, one can obtain the formula for at step as follows: where is given by (18) and with is the th component of the vector .

For purposes of numerical illustrations, the latter diagonal formula (19) is used to initialize the L-BFGS method, although the potential use of (15) should not be neglected. Furthermore, one can observe that involved in the solution of the variational problem is isolated from the solution and its value does not affect the quality of the solution. This allows us to choose the value of freely to ensure that are positive definite while satisfying the weak-secant equation.

Note that maintaining positive definiteness for is crucial for L-BFGS method to generate descent direction. For this purpose, the following lemma suggests a possible choice on .

Lemma 2. Assume that for all . Then is a positive definite, if

Proof. Note that to keep positive definiteness of , we should choose such that which also implies
Therefore by our choice on as (20), we have the following.
Case 1. If , we will let , and thus,
Since , then would be positive definite.
Case 2. If , is set, which leads to It is clear that for both cases, would maintain positive definiteness.

Hence, can be expressed in the following form: Now, we can set up the basic algorithm of our L-BFGS methods using as (25).

2.1. LBFGS-USD Algorithm

Step 1. Choose an initial point and a positive definite matrix . Let .

Step 2. Compute ; then where satisfies the Wolfe conditions given by (we always try the step length ).

Step 3. Let . Update times using the pairs ; that is, let

Step 4. Set and return to Step 2.

Remark 3. The LBFGS-USD algorithm is exactly the L-BFGS algorithm of Liu and Nocedal [9], except that is computed by (25).

3. Convergence Analysis

We shall also establish the convergence of the LBFGS-USD algorithm. The following standard assumptions are made on the objective function.(a) The objective function is twice continuously differentiable.(b) The level set is convex, and there exist positive constants and such that, for all and .

Theorem 4. Let be a starting point for which satisfies assumptions above, and assume that the matrices are chosen so that are bounded. Subsequently, for any positive definite (most often, is chosen), LBFGS-USD algorithm generates a sequence which converges to . Moreover, there is a constant such that which implies that converges -linearly.

Proof. See [9].

Theorem 4 suggests that as long as is chosen such that is bounded for all , then the corresponding L-BFGS algorithm restarts by would generate that converges globally and -linearly. For this purpose, we give the following result which ensures that where is given by (25) is upper and lower bounded by some constants.

Lemma 5. Let be a starting point for which satisfies assumptions (a)-(b). Consider the sequence generated by L-BFGS algorithms subject to the diagonal initial approximation, given by (25). If for all , then is upper and lower bounded for all .

Proof. Let , then , and assumption (a) implies that Subsequently, (28) also gives and therefore,
First, we begin by showing that every component of in (18) is bounded, so that is bounded. It is more convenient to show that each part of , namely, (i),(ii), and(iii), is bounded, such that is bounded at whole. (i)Since then, we obtain (ii)Note that where is the th component of . Then, On the other hand, Therefore, it gives, in overall, (iii)In a similar way, we can establish and subsequently, Then, we have
Following that, the component of , namely , is bounded as follow: Note that, by the definition of , we have Hence, satisfies Whereas, when , we have where is the largest component among all . Therefore, we can conclude that is upper and lower bounded for all .

4. Numerical Experiences

Our test used a large set of unconstrained minimization problem consisting of 50 test problems where the list of problems is given in Table 1. These test problems are selected from Moré et al. [2], CUTE [3], Toint [4], and various other test function collections such as in [10]. The subroutine of test problems is available at http://camo.ici.ro/forum/SCALCG/evalfg.for (accessed on Jan 2012). The method tested is as follows:(1)LBFGS-I: L-BFGS method with the initial matrix, ,(2)LBFGS-I: L-BFGS method with the initial matrix, where is the Oren-Luenberger scaling at ,(3)LBFGS-USD: L-BFGS method with the initial matrix, is given by (19).

5. Discussion

In general, Figure 1 indicates that the new diagonal initial approximating matrix are substantially better, followed by both standard initializations of the L-BFGS method in terms of number of iterations, function/gradient calls, and CPU time, respectively. To better study the effect of our initial approximation, we include Tables 2, 3, and 4 that give ratio of function/gradient calls over iteration counts for the L-BFGS method with standard initial matrices and our diagonal initial approximating matrix. As conclusion, our diagonal initial approximation, LBFGS-USD, performs better in which the ratio is close to one and hence would likely to accept unit step length compared to the LBFGS-I (without any preconditioning) method.

Moreover for LBFGS-USD method, it requires, in general less iteration counts as much as 38% and 19% than LBFGS-I and LBFGS-I, respectively. Meanwhile, in terms of number of function/gradient, LBFGS-MTD needs 46% and 16% less function/gradient counts, respectively. Finally, LBFGS-MTD requires 33% and 6% less CPU time in second, respectively. In conclusion, the numerical results for a broad class of the test problems show that the LBFGS-USD algorithm is efficient and vast superior in solving small to large size problems.

6. Concluding Remarks

We proposed technique that exploits the presence of the Hessian in the diagonal matrix form. Under some standard assumption on the objective function, we observe that the convergence of the diagonal initial approximation of the L-BFGS scheme is -linear. Based on our numerical results, we believe that the following conclusions can be made on the diagonal approximation that is derived in this study. (i)Our diagonal approximating matrix is able to maintain positive definiteness in a very simple way and give storage requirement. (ii)The numerical experiments show that our diagonal initial approximating matrix is generally effective for the L-BFGS method compared to the standard initial matrices in the literature.

Conflict of Interests

The author(s) declare(s) that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work was supported by the Malaysian MOHE-FRGS Grant no. 01-11-09-722FR.