Dynamics, Control, and Optimization with Applications 2014View this Special Issue
Research Article | Open Access
An Implementable First-Order Primal-Dual Algorithm for Structured Convex Optimization
Many application problems of practical interest can be posed as structured convex optimization models. In this paper, we study a new first-order primaldual algorithm. The method can be easily implementable, provided that the resolvent operators of the component objective functions are simple to evaluate. We show that the proposed method can be interpreted as a proximal point algorithm with a customized metric proximal parameter. Convergence property is established under the analytic contraction framework. Finally, we verify the efficiency of the algorithm by solving the stable principal component pursuit problem.
In this paper, we consider the following separable optimization problem: where and are given matrices, is a given vector, and and are nonempty closed convex sets. and are convex functions (not necessarily smooth). Throughout this paper, we assume that the solution set of (1) is nonempty. Problem of this type arises in many applications, ranging from machine learning to compressed sensing. We refer to, for example, [1–9], for a few examples of applications.
To solve (1), one can use the classical augmented Lagrangian method (ALM). Starting from any initial iterate , ALM iterates via the following procedure:where the augmented Lagrangian function associated with the problem (1) is given by and is the Lagrangian multiplier vector associated with the linear constraint and is a penalty parameter for the violation of the linear constraint. ALM enjoys very nice convergence and has been shown to be equivalent to a proximal point algorithm applied to the dual of (1) . A noticeable feature of ALM is that it treats (1) as a generic minimization problem and ignores completely the nice separable structure emerging in the objective function. The minimizations of the two functions and in (2a) are strongly coupled because of the quadratic term . Hence, the implementation of ALM ((2a) and (2b)) can be computationally challenging. To utilize the separable structure of the problem, the well-known alternating direction method (ADM) essentially splits the ALM subproblem into two subproblems with respect to and in Gauss-Seidel manner. More specifically, at each iteration, ADM takes the following form:
ADM can be interpreted as Douglas-Rachford splitting method applied to the dual problem  and proximal point method . We refer to, for example, [13–17] for recent study of ADM and its variants. Comparing to ALM, ADM minimizes (3) with respect to and alternatingly at each iteration, rather than with respect to both and simultaneously. This splitting procedure makes it possible to exploit the special separable structure of the objective functions. Thus, ADM appears to be a natural fit for solving very large scale distributed machine learning and big-data related optimization problems.
When and in (1) are both identity matrices, ADM can be very efficient. The first two subproblems ((4a) and (4b)) correspond to evaluating the resolvent operators of component functions and , respectively. Here, the proximal operator of the function is defined by where and . In most popular applications of sparse optimization, the proximal operator can be computed exactly and efficiently (e.g., ). While when (resp., ) is not an identity matrix, the resulting ADM subproblems may not be easily solvable, since they involve inverting of (resp., ), and there is no efficient way of doing so directly. This difficulty could result in inefficiency of the ADM greatly. As a result, first-order algorithms that preserves both the alternating computation feature of ADM and the advantages of not involving the inverse of and are highly desirable.
In [18, 19], an alternating proximal gradient method (APGM) for solving (1) was given. APGM is based on the framework of ADM, which solves the first two subproblems ((4a) and (4b)) inexactly by taking one proximal gradient step. More specifically, APGM generates the new iterate via the following procedure:where , , and .
In this paper, we also focus on the specific scenario of (1), where both and are not identity matrices. We propose a new first-order primal-dual algorithm for (1). We show that our algorithm can be viewed as a customized proximal point method with a special proximal regularization parameter, and it is within the framework of the contraction type methods. The proposed algorithm has three main advantages: first, it can be easily implementable, the main computational effort of each iteration is to evaluate the proximal operators of the component objective functions; second, the involved subproblems are solved consecutively in the ADM manner, which makes the algorithm amenable to distributed optimization. Finally, it only uses the first-order information and does not require any matrix inversion or solving linear systems. Hence, the method is well suited for solving large scale distributed optimization problems.
The paper is organized as follows. In Section 2, we review some preliminaries. In Sections 3 and 4, we present the new method and analyze its convergence. In Section 5 we conduct numerical experiments to compare the proposed method with APGM for solving the stable principal component pursuit problem. We conclude the paper in Section 6.
2.1. Variational Characterization of (1)
In this section, we reformulate (1) as a variational form, which is useful for succedent algorithmic illustration and convergence analysis.
The Lagrangian function associated with (1) is where is a Lagrangian multiplier. According to the previous convex assumption of (1), finding optimal solutions of (1) and its dual form is equivalent to finding a saddle point of . More precisely, let be a saddle point of . We have Then we can directly read off the optimality conditions with variational characterization. More specifically, is a saddle point of if and only if it satisfies the following mixed variational inequality (VI): with Hence, a solution of ((9a)–(9c)) yields a solution of (1).
Note that the mapping is said to be monotone with respect to if Consequently, it can be easily verified that is monotone. Under the aforementioned nonempty assumption on the solution set of (1), the solution set of ((9a)–(9c)), denoted by , is also nonempty.
2.2. Proximal Point Algorithmic Framework
PPA, which was proposed by Martinet  and further studied by Rockafellar , plays a fundamental rule in optimization. For given iterate , PPA generates the new iterative via the following procedure: where is a positive definite matrix, playing the role of proximal regularization parameter. A simple choice of is that where and is the identity matrix, regularizing the proximal terms in the uniform way. We refer the reader to example [21–24] for some special choices of in different scenarios.
3. The Main Algorithm
Before we present our new algorithm, we need to make the following assumption:
Assumption 1. For any give and , the proximal operator of (see (5)) has a closed-form solution or it can be efficiently solved up to a high precision.
Whenever the assumption holds, we say that the proximal operator of is “easy” to evaluate. Note that is separable across two variables, that is, ; according to the definition (5), we have where and . Hence, under the assumption, the proximal operators of the component objective functions and are also “easy” to evaluate.
The motivation for our algorithm is directly related to the linearized augmented Lagrangian method proposed in  and the customized proximal point algorithm proposed in  for convex problems with linear constraints.
In order to obtain a closed-form solution of , we first add a proximal regularization parameter to the variational characterization of the problem ((9a)–(9c)). Then, like the algorithm in , we get the following proximal point algorithm: However, (13b) is still not implementable. Inspired in , in order to alleviate the computation required by -subproblems, we try to linearize the quadratic term in (13b) by where is a proximal parameter, and is the gradient of at . With a simple manipulation, we get the following approximation to (13b): In this case, the closed-form solutions of the resulting subproblems can be easily obtained.
3.3. Description of the Algorithm
In this section, we formally present Algorithm 1.
Remark 2. From the algorithm, we can see that the minimizations (() and ()) each requires the evaluation of the proximal operators for the component objective functions and . It is clear that the implementation of the proposed method is simple under our assumption.
4. Global Convergence
In this section, we show that the proposed method is in some sense equivalent to the proximal point algorithm with a special proximal regularization parameter. Then its convergence can be easily established under the analytic framework of contraction type methods.
Lemma 3. Let be generated by (()–()) from a given and . Then
Proof. First, by deriving the optimality condition for the subproblem (), we have It can be further rewritten as Similarly, the optimality condition for the subproblem () is It can be further rewritten as Substituting () into (19), we have In addition, it follows from () that Combining (18), (21), and (22) together, we getwith . Using the notation of , the assertion (16) is proved.
Note that when and , , is a positive definite matrix. Lemma 3 implies that the proposed algorithm can be viewed as a customized version of the PPA, where is the metric proximal parameter. Henceforth, we denote the proposed algorithm as CPPA. Now we state and prove the contractive property of our algorithm.
Lemma 4. Suppose the condition holds. Then, for any , the sequence generated by (()–()) satisfies the following inequality: where the norm is defined as .
Proof. Note that ; it follows from (16) that which can be further written as On the other hand, using the fact that is a monotone operator, we have Combining (28) and (27), we have Since is a solution of (9a) and , we have Thus, we get Since is positive definite under the requirements of the parameters (24), (31) can be further written as On the other hand, Inserting (32) into (33), we obtain This completes the proof.
Corollary 5. Let be an arbitrary point in , and let the sequence be generated by (()–()). Then(1)the sequence is bounded;(2)the sequence is nonincreasing;(3).
We are now ready to prove the global convergence of the new method.
Proof. It follows from Corollary 5 that
Since is bounded, it has at least one cluster point. Suppose has a subsequence that converges to . It follows from (23) that
Consequently, we have
which implies that .
Because , for any given , there exists such that Since , for the given above, there is an integer , such that, Therefore, for any , it follows from (38) and (39) that This implies that the sequence converges to .
5. Application to the Stable Principal Component Pursuit Problem
In this section, we apply the proposed method to solve the stable principal component pursuit problem with nonnegative constraints (SPCP). The problem tested is from Example 2 of . Our code was written in Matlab R2009b and all experiments were performed on a laptop with Intel Core 2 Duo @ 2.0 GHz CPU and 2 GB of memory.
Let be a given observation matrix, where is a low-rank and non-negative matrix, is a sparse matrix, is a noise matrix. SPCP arising from image processing seeks to recover and by solving the following nonsmooth convex optimization problem: where is the nuclear norm (defined as the sum of all singular values), and denote the norm and the Frobenius norm of a matrix, respectively, and is the indicator function for the nonnegative orthant . In the above model, denotes the noise matrix and and are some fixed parameters.
Following the procedure described in , by introducing an auxiliary variable and grouping and as one big block and grouping and as another big block , model (41) can be easily reformulated as which fits the setting of (1). The proposed algorithm (()–()) is therefore applicable for (42), and we obtain the following iterative scheme: The main advantage of CPPA applied to SPCP is that the generated minimizations in ((43)–(46)) are all simple enough to have closed-form solutions. For completeness, we elaborate on the strategy of solving the resulted subproblems at each iteration.(i)The -subproblem (43) amounts to evaluate the proximal operator of the nuclear norm function and is given by the matrix shrinkage operation: where the matrix shrinkage operator is defined as and is the SVD of matrix .(ii)The closed-form solution of -subproblem (44) can be given by the shrinkage operation: where the shrinkage operator is defined as (iii)The -subproblem (45) amounts to projecting the matrix onto the Euclidean ball , whose closed-form solution is given by (iv)The -subproblem (46) amounts to projecting the matrix onto the the nonnegative orthant, whose closed-form solution is given by: For detailed analytical methods of ((43)–(46)), the reader is referred to, for example, [19, 26]. Thus, the simplicity assumption (5) holds for the application (41).
For the numerical experiments, we follow  to randomly generate the data of (41). We consider the scenario of . For given , , we generate , where and are independent full row rank matrices whose elements are independently and identically (i.i.d.) uniformly distributed in . Note that in this experiment, is a component-wise nonnegative and low-rank matrix to be recovered. The support of the sparse matrix was chosen uniformly and randomly, and the nonzero entries in are generated i.i.d. uniformly in . The entries of matrix for noise were generated as i.i.d. Gaussian with standard deviation ; we set .
In the following, we compare CPPA with APGM in [18, 19], since they are all designed based on the “simple” assumption and share the same conditions for convergence. For the penalty parameter , we take . For other individual parameters required by these methods, we choose , , and . To implement all the compared methods, the initial iterate for both CPPA and APGM is chosen , , . The stopping criterion is set as where is the tolerance set as . We denoted so that the rank of is and so that the cardinality of is .
Some preliminary numerical results are reported in Table 1. Since they are synthetic examples with random date, for each scenario we test for 10 times, and the results were averaged over ten runs. Specifically, we reported the number of iteration (Iter.), relative error of the low-rank matrix , relative error of the sparse matrix , and CPU times in seconds (CPU(s)), where the relative errors are defined as