An Accelerated Proximal Gradient Algorithm for Singly Linearly Constrained Quadratic Programs with Box Constraints
Recently, the existed proximal gradient algorithms had been used to solve non-smooth convex optimization problems. As a special nonsmooth convex problem, the singly linearly constrained quadratic programs with box constraints appear in a wide range of applications. Hence, we propose an accelerated proximal gradient algorithm for singly linearly constrained quadratic programs with box constraints. At each iteration, the subproblem whose Hessian matrix is diagonal and positive definite is an easy model which can be solved efficiently via searching a root of a piecewise linear function. It is proved that the new algorithm can terminate at an -optimal solution within iterations. Moreover, no line search is needed in this algorithm, and the global convergence can be proved under mild conditions. Numerical results are reported for solving quadratic programs arising from the training of support vector machines, which show that the new algorithm is efficient.
In this paper, we mainly consider the following quadratic programming problem: where is symmetric and positive semidefinite, , and the feasible region is defined by where and is a scalar.
The singly linearly constrained quadratic programs with box constraints appear in a wide range of applications such as image processing, biological information and machine learning. Specifically, support vector machine (SVM) is one of the most classical models of (1). It is a promising technique for solving a variety of machine learning and function estimation problems. The SVM learning methodology has been shown to give good performance in a wide variety of problems such as face detection, text categorization, and handwritten character recognition. The number of variables in SVM is so huge that traditional optimization methods cannot be directly applied. Some decomposition method [1–4] with its subproblem being a special case of (1) is the main approach for large-scale SVM problems. The solution of the subproblem by generalized variable projection method (GVPM) and projected gradient method (PGM) is introduced in [5, 6], respectively. Moreover, Zanni, and so forth, also proposes the parallel decomposition algorithms based on these two methods in [7, 8]. For more general large-scale model, some of the parallelization pieces of literature in [9, 10] recently are proposed, but these methods cannot be applied specifically to SVM so far.
In this work, we will give an accelerated proximal gradient algorithm for (1). First, we consider the following nonsmooth convex optimization problem: where is a proper, lower semicontinuous (lsc), convex function and is convex smooth (i.e., continuously differentiable) on an open subset of containing dom is Lipschitz continuous on dom . That is, for some , where, and in what follows, denotes the spectral norm. Obviously, problem (1) is a special case of (3) with and being the indicator function for the feasible region defined by
Recently, great attention has been paid to the solution of (3). Nesterov and Nemirovski [11, 12] study the accelerated proximal gradient method for (1) with an attractive iteration complexity of for achieving -optimality. Almost at the same time, Beck and Teboulle  give a fast iterative shrinkage-thresholding algorithm (FISTA) which achieves the same convergence rate. After that, Tseng  summarizes these algorithms and presents a unified treatment of these methods. All these algorithms have a good performance on large-scale problems, such as linear inverse problems, matrix game problems, and matrix completion. Motivated by the successful use of the accelerated proximal gradient method for (3), we extend Beck and Teboulle's algorithm to solve (1). In particular, the subproblem is solved by searching a root of a piecewise linear continuous function. Numerical results show that the new algorithm is efficient.
The paper is organized as follows. In Section 2, the proximal gradient algorithm and its accelerated version are presented for (1). The convergence of these algorithms is also discussed. In Section 3, the method to efficiently solve the subproblem is introduced. Numerical results on SVM classification problems generated by the random and real world data sets are shown in Section 4. Section 5 is devoted to some conclusions and further study.
2. Proximal Gradient Algorithms
In this section, we introduce the proximal gradient algorithm and its accelerated version which can be applied to solve (1). For any , consider the following quadratic approximation of at : where is a given parameter, . Since (6) is a strongly convex function of , has a unique minimizer in which we denote by
First, we present a simple result which characterizes the optimality of .
Lemma 1. For any , if and only if there exists , such that where is the normal cone of at the point denoted by
Proof. This result can be obtained directly from the optimality conditions of (8), which is a strongly convex problem.
Then, similarly to Lemma 2.3 in , we have the following key result.
Lemma 2. Let and such that Then for any ,
Algorithm 3. We have the following steps.
Step 1. Choose .
Step 2. While ( does not satisfy the terminal conditions)
Theorem 4. Let and be the sequences generated by Algorithm 3 with ; then, for , the following results hold: (a) is nonincreasing and (b), for all ;(c) if is a convergent subsequence of and then ;(d) if is positive definite, then
Proof. (a) Let , ; then from (11) and the optimality of we have
Furthermore, take in (12); we get
(b) We have where the first inequality uses the fact that is nonincreasing and the second inequality is established by taking , in (12). Dividing both sides by in the last inequality, we get
(c) It follows from (b) that, for all ,
Let in the above inequality; we have . On the other hand, by the definition of , we have . Hence, , which implies that .
(d) It follows from the definition of that, for all , where the first inequality is established by the definition of and the second inequality, with denoting the minimum eigenvalue of , is obtained from the positive definite property of .
From (23) and (b), we get
Algorithm 5. We have the following steps.
Step 1. Choose , set , .
Step2. While ( does not satisfy the terminal conditions)
Remark 6. In Algorithm 5, the operator is employed on the point which is a specific linear combination of the previous two points . However, in Algorithm 3, the operator only uses the previous point . On the other hand, the computational effort of these two algorithms is almost the same except for the computation of (26) in Algorithm 5 which is negligible.
Now we give the promising improved complexity result for Algorithm 5.
Theorem 7. Let and be the sequences generated by Algorithm 5 with . Then, for , the following results hold: (a), for all ;(b) if is a converging subsequence of and , then ;(c) if G is positive definite, then
3. How to Solve
In this section, we extend the method studied by Dai and Flecher  to solve . In order to obtain in (8), we need to solve the following problem: where . For a given value of , we consider the following box constrained QP problem: and denote the minimizer of (30) as . Then, where has components () and is the componentwise operation that supplies the median of its three arguments. That is,
Define then from , we know that is a piecewise linear continuous and monotonically increasing function of . Furthermore, is the optimal of (29) if is located such that . Hence, the main task of solving (29) is to find a such that . For this purpose, we adopt the algorithm introduced in  which consists of a bracketing phase and a secant phase. Numerical experiments in the next section show that this algorithm achieves high efficiency.
4. Numerical Experiments
As we know, one of the important applications of SVM is classification. Given a training set we need to find a hyperplane to separate the two classes of points. This can be done by solving the following convex quadratic programming problem: where , is the vector of all ones, and is a positive scalar. is a symmetric and positive semidefinite matrix with entries , , where is some kernel function.
In this section, we illustrate the performance of Algorithms 3 and 5 with Matlab 7.10 on a Windows XP professional computer (Pentium R, 2.79 GHZ). We conduct a test on SVM classification problems with the random data sets and the real world data sets.
The generation of the random data sets is based on four parameters , , , and , where is the number of samples and is the dimension of . Each element of is randomly generated in , and −1 or 1 randomly emerges in the entry of .
We have generated three random data sets with , , , and = 200, 600, and 1000, respectively. The two real world data sets are the UCI adult data set and the heart data set. For the UCI adult, the versions with = 1605, 2265, and 3185, are considered. The Gauss radial basis function is used in our tests. The parameters in (35) and in (36) are set to , , and for random data sets, heart, and UCI adult data sets respectively.
For all test problems, the initial point is chosen as and the const . The terminal rules for solving are .
For the test problem with random data set () and heart data set, the running steps are 8000 and 1000, respectively. In order to make a visual comparison between Algorithms 3 and 5, we plot the values of with iterations in Figure 1. Furthermore, we compute where denotes the number of calculating in step . In Figure 2, we plot the values of with , and , , respectively.
(a) Random data set
(b) Heart data set
(a) Random data set
(b) Heart data set
For the test problems with the rest data sets, the terminal condition is based on the fulfilment of the KKT conditions within 0.001 (see ). The numerical results are shown in Tables 1 and 2, where the meaning of the indexes are as follows: (i): the scale of the data sets,(ii)sec: the computing time (in seconds),(iii): the number of iterations,(iv)raver: the average number of calculating in each step.
From Figure 2, we can see that although the total number of calculating every 100 (27) steps in Algorithm 3 is less than that the number in Algorithm 5. However, Algorithm 5 uses much less iterations and time to achieve the approximate solution, compared with Algorithm 3 from Figure 1 and Tables 1 and 2. For the subproblem, less than 4 iterations are used to achieve a very accurate solution on average, which shows that can be solved efficiently. Moreover, from the column raver in Tables 1 and 2, we can also see that the average number of calculating with real world data sets is slightly less than that with random data sets.
5. Conclusion and Future Work
We have extended the accelerated proximal point method to the solution of singly linearly constrained quadratic programming with box constraints. The new algorithm is proved to be globally convergent. Numerical results also show that the new algorithm performs well on medium-scale quadratic programs. On the other hand, Solving the subproblem by searching a root of a piecewise linear continuous function is a very cheap. Considering the good performance of the new Algorithm 5, we can apply it to the solution of the subproblems in decomposition methods for large-scale SVM problems. Moreover, a parallel version of this algorithm combined with the theory in [9, 10] is also a direction of our future research.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work is supported by the National Natural Science Foundation of China (11101420, 71271204) and the Excellent Young Scientist Foundation of Shandong Province (2010BSE06047).
T. Serafini, G. Zanghirati, and L. Zanni, “Gradient projection methods for large quadratic programs and applications in training support vector machines,” Optimization Methods and Software, vol. 20, pp. 353–378, 2003.View at: Google Scholar
L. Zanni, T. Serafini, and G. Zanghirati, “Parallel software for training large scale support vector machines on multiprocessor systems,” Journal of Machine Learning Research, vol. 7, pp. 1467–1492, 2006.View at: Google Scholar
H. Congying, W. Yongli, and H. Guoping, “On the convergence of asynchronous parallel algorithm for large-scale linearly constrained minimization problem,” Applied Mathematics and Computation, vol. 2, pp. 434–441, 2009.View at: Google Scholar
A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM Journal on Imaging Sciences, vol. 2, pp. 183–202, 2009.View at: Google Scholar
P. Tseng, “On accelerated proximal gradient methods for convex-concave optimization,” SIAM Journal on Optimization. In press.View at: Google Scholar
T. Joachims, “Making large-scale SVM learning practical,” in Advances in Kernel Methods: Support Vector Learning, B. Scholkopf, C. J. C. Burges, and A. J. Smola, Eds., pp. 169–184, MIT Press, Cambridge, Mass, USA, 1998.View at: Google Scholar