Research Article | Open Access
Stochastic Block-Coordinate Gradient Projection Algorithms for Submodular Maximization
We consider a stochastic continuous submodular huge-scale optimization problem, which arises naturally in many applications such as machine learning. Due to high-dimensional data, the computation of the whole gradient vector can become prohibitively expensive. To reduce the complexity and memory requirements, we propose a stochastic block-coordinate gradient projection algorithm for maximizing continuous submodular functions, which chooses a random subset of gradient vector and updates the estimates along the positive gradient direction. We prove that the estimates of all nodes generated by the algorithm converge to some stationary points with probability 1. Moreover, we show that the proposed algorithm achieves the tight approximation guarantee after iterations for DR-submodular functions by choosing appropriate step sizes. Furthermore, we also show that the algorithm achieves the tight approximation guarantee after iterations for weakly DR-submodular functions with parameter by choosing diminishing step sizes.
In this paper, we focus on the submodular function maximization, which has recently attracted significant attention in academia since submodularity is a crucial concept in combinatorial optimization. Furthermore, they have arisen in a variety of areas, such as social sciences, algorithm game theory, signal processing, machine learning, and computer vision. Furthermore, submodular functions have found many applications in the applied mathematics and computer science, such as probabilistic models [1, 2], crowd teaching [3, 4], representation learning , data summarization , document summarization , recommender systems , product recommendation [9, 10], sensor placement , network monitoring [12, 13], the design of structured norms , clustering , dictionary learning , active learning , and the utility maximization in sensor networks .
In submodular optimization problems, there exist many polynomial time algorithms for exactly minimizing the submodular functions, such as combinatorial algorithms [19–21]. In addition, there also exist many polynomial time algorithms for approximately maximizing the submodular functions with approximation guarantees, such as the local search and greedy algorithms [22–25]. Despite this progress, these methods use the combinatorial techniques, which have some limitations . For this reason, a new approach is proposed by using multilinear relaxation , which can lift the submodular functions optimization problems into the continuous domain. Thus, the continuous optimization techniques are used to minimize exactly or maximize approximately submodular functions in polynomial time. Recently, most literature is devoted to continuous submodular optimization [28–31]. The algorithms cited above need to compute all the (sub)gradients.
However, the computation of all (sub)gradients can become prohibitively expensive when dealing with huge-scale optimization problems, where the decision vectors are high-dimensional. For this reason, coordinate descent method and its variants are proposed for solving efficiently convex optimization problems . At each iteration, the coordinate descent methods only choose one block of variables to update their decision vectors. Thus, they can reduce the memory and complexity requirements at each iteration when dealing with high-dimensional data. Furthermore, coordinate descent methods can be applied in support vector machine , large-scale optimization problems [34–37], protein loop closure , regression , compressed sensing , etc. In coordinate descent methods, the choice of search strategy mainly include cyclic coordinate search [41–43] and the random coordinate search [44–46]. In addition, the asynchronous coordinate decent methods are also proposed in recent years [47, 48].
Despite this progress, however, stochastic block-coordinate gradient projection methods for maximizing submodular functions have barely been investigated. To fill this gap, we propose the stochastic block-coordinate gradient projection algorithm to solve stochastic continuous submodular optimization problems, which are introduced in . In order to reduce the complexity and memory requirements at each iteration, we incorporate the block-coordinate decomposition into the stochastic gradient projection in the proposed algorithm. The main contributions of this paper are as follows:(i)We propose a stochastic block-coordinate gradient projection algorithm for maximizing continuous submodular functions. In the proposed algorithm, each node chooses a random subset of the whole approximation gradient vector and updates its decision vector along gradient ascent direction.(ii)We show that each node asymptotically converges to some stationary points by the stochastic block-coordinate gradient projection algorithm; i.e., the estimates of all nodes converge to some stationary points with probability 1.(iii)We investigate the convergence rate of stochastic block-coordinate gradient projection algorithm with approximation guarantee. When the submodular functions are DR-submodular, we prove that the convergence rate of is achieved with approximation guarantee. More generally, we show that the convergence rate of is achieved with approximation guarantee for weakly DR-submodular functions with parameter .
The remainder of this paper is organized as follows. We describe mathematical background in Section 2. We formulate the problem of our interest and propose a stochastic block-coordinate gradient projection algorithm in Section 3. In Section 4, the main results of this paper are stated. The detailed proofs of the main results of the paper are provided in Section 5. The conclusion of the paper is presented in Section 6.
2. Mathematical Background
Given a ground set , which consists of elements. If a set function satisfiesfor all subsets , then the set function is called submodular. The notation of submodularity is mostly used in discrete domain, but it can be extended to continuous domain . Given a subset of , , where each set is a subset of and is compact. A continuous function is called submodular continuous function if, for all , the following inequalityholds, where (coordinate-wise) and (coordinate-wise). Moreover, if , we have for all , and then the submodular continuous function is called monotone on . Furthermore, a differentiable submodular continuous function is called DR-submodular if, for all such that , we have ; i.e., is an antitone mapping . When the submodular continuous function is twice differentiable, the submodular is submodular if and only if all off-diagonal components of its Hessian matrix are nonpositive ; i.e., for all ,Furthermore, if the submodular function is DR-submodular, then all second-derivatives are nonpositive ; i.e., for all ,In addition, the twice differentiability implies that the submodular is smooth . Moreover, we say that a submodular function is -smooth if, , we haveNote that the above definition is equivalent toFurthermore, a function is called weakly DR-submodular function with parameter ifMore details about weak DR-submodular functions are available in .
3. Problem Formulation and Algorithm Design
In this section, we first describe the problem of our interest, and then we design an algorithm to efficiently solve the problem.
In this paper, we focus on the following constrained optimization problem:where denotes the constraint set, denotes an unknown distribution, is a submodular continuous function for all . Moreover, we assume that the constraint set, , is convex, where each is convex and closed set for all . The problem has recently been introduced in . In addition, we use the notation to denote the optimal value of for all , i.e., . Furthermore, we can see that the function is submodular function because each function is submodular continuous function for all .
To solve problem (8), the projected stochastic gradient methods are a class of efficient algorithms . However, we focus on the case that the decision vectors are high-dimensional in this work; i.e., the dimensionality of vectors is large. The full gradient computations are prohibitive expensive and become computational bottleneck. Therefore, we propose a stochastic block-coordinate gradient method by combining the great features of block-coordinate and stochastic gradient. We assume that the components of decision variables are arbitrarily chosen but fixed for each processor. Furthermore, at each iteration, each processor randomly chooses a subset of (stochastic) gradients, rather than all the (stochastic) gradients. The detailed description of the proposed algorithm is as follows. Starting from an initial value , for , each updates its decision variable aswhere is the step-size, denotes the Euclidean projection of on the set , are independent and identically Bernoulli random variables with for all and , and denotes the unbiased estimate of the gradient , which denotes the -th coordinate in .
We introduce the following matrix. Therefore, we can write relation (9) more compactly aswhere , and . Note that the -th coordinate of is missing when at each iteration , and then the -th coordinate of is not updated. Therefore, a random subset of is updated at each iteration . In addition, we use the notation to denote a diagonal matrix with size ; i.e., , where .
Let denote the history information of all random variables generated by the proposed algorithm (11) up to time . In this paper, we adopt the following assumption on the random variables , which is stated as follows.
Assumption 1. For all , the random variables and are independent of each other. Furthermore, the random variables are independent of and for any decision variables .
In addition, we assume that the function and the sets satisfy the following conditions.
Assumption 2. Assume that the following properties hold:
(a) The constraint set is convex, and each set is convex and closed for all .
(b) The function is monotone and weakly DR-submodular with parameter over .
(c) The function is differentiable and -smooth with respect to norm .
Next, we make the following assumption about stochastic oracle .
Assumption 3. Assume that the stochastic oracle satisfies the following conditions: and The above assumption implies that the stochastic oracle is an unbiased estimate of .
In this section, we first formulate an optimization problem, and then design an optimization method to solve it. Moreover, we also give some standard assumptions to analyze the performance of the proposed method.
4. Main Results
In this section, We first provide the performance of convergence. To this end, we first introduce the definition of a stationary point, which is defined as in .
Definition 4. For a vector and a function , if , then is a stationary point of over .
From Definition 4, the convergence of our proposed algorithm is given in the following theorem.
Theorem 5. Let Assumptions 1–3 hold. Assume that the set of stationary points is nonempty and . Moreover, the sequence is generated by the stochastic block-coordinate gradient projection algorithm (11). Then, the iterative sequence converges to some stationary point with probability 1.
The proof can be found in the next section. The above result shows that the iterations converge to some local maximum with probability 1.
Furthermore, when the function is differentiable and DR-submodular, we have the following result.
Theorem 6. Let Assumptions 1–3 hold. Moreover, assume in (7) and . The sequence is generated by the stochastic block-coordinate gradient projection algorithm (11). Furthermore, the random decision variable is picked by choosing , with probability and the other variables with probability . Then, for any random variable for , we havewhere , .
The proof can be found in the next section. From the above result, we can see that an objective value in expectation can be obtained after iterations of the stochastic block-coordinate gradient projection algorithm (11) for any initial value. Moreover, the objective value is at least for any DR-submodular function.
In addition, when the function is weakly DR-submodular function with parameter , we also yield the following result.
Theorem 7. Let Assumptions 1–3 hold. The sequence is generated by the stochastic block-coordinate gradient projection algorithm (11) with . Furthermore, the random decision variable is picked by choosing in with probability . Then, for any for , we havewhere , .
The proof can be found in the next section. Note that the stochastic block-coordinate gradient projection algorithm yields an objective value after iterations from any initial value. Furthermore, the expectation of the objective value is in at least for any weakly DR-submodular function.
5. Performance Analysis
In this section, the detailed proofs of main results are provided. We first analyze the convergence performance of the stochastic block-coordinate gradient projection algorithm.
Proof of Theorem 5. By the Projection Theorem , we havefor all . Therefore, let and in inequality (16); we obtainwhere we have used relation (11). By simple algebraic manipulations, we yieldFurthermore, when for any , at each iteration . Therefore, we haveFrom the above relation, we also obtainPlugging relation (20) into inequality (18), we haveTaking conditional expectation in (21), we havewhere we have used in the last inequality. In addition, since the function is -smooth, we haveTaking conditional expectation on in (23) and using relation (22), we obtainfor step-size . For brevity, let . Inequality (24) implies thatFrom the definition of , we have for ; i.e., the sequence of random variables is nonnegative for all . Therefore, according to the Supermartingale Convergence Theorem , we can see that the sequence is convergent with probability 1. Furthermore, we also havewith probability 1. From relation (11), inequality (26) implies thatwith probability 1, where . Therefore, we obtain thatwith probability 1. Thus, there exists a subsequence , which converges to . Then, we haveSince the gradient projection operation is continuous, we havewith probability 1. The above relation implies thatwith probability 1. Then, relation (31) implies that . Therefore, is a stationary point of over with probability 1. The statement of the theorem is completely proved.
Lemma 8. For all , we havefor any diagonal matrix .
The next lemma is due to , which is stated as follows.
Lemma 9. Assume that a function is submodular and monotone. Then, we havefor any points .
In addition, we also have the following lemma.
Proof. In inequality (21), we let , where denotes the diagonal matrix with the -th entry equal to 1 and the other entries equal to 0. Then, we haveFurthermore, the above relation implies thatTherefore, for any , we obtainwhere in the last inequality we have used . Since for all , setting and following from relation (23), we havewhere the last inequality is obtained by using inequality (37), Young’s inequality, and the fact that when for all . Moreover, for all . Rearranging the terms in (38), the lemma is obtained completely.
Proof. From the result in Lemma 10, we haveIn addition, following on from Lemma 8, we also obtainwhich implies thatCombining inequalities (40) and (42), we yieldwhere the last inequality is due to . Taking conditional expectation of the above inequality on , we yieldThus, by some algebraic manipulations, inequality (39) is obtained.
Next, we start to prove Theorem 6.
Proof of Theorem 6. Setting in Lemma 11, where is the globally optimal solution for problem (8), i.e., , we haveSincetaking conditional expectation of (46) with respect to , we havewhich implies thatSetting and in Lemma 9 and taking condition expectation on , we obtainThus, plugging inequality (49) into relation (48), we getTaking expectation in (50) and using some algebraic manipulations, we havewhere we have used the relation to obtain the first inequality. Summing both sides of (51) for , we obtainwhere in the last inequality we have used the fact that and for all . On the other hand, we also havewhere and the last inequality is due to (52). Since , we havePlugging the above inequality into (53) and dividing both sides by ,where we have used the fact that in the last inequality. Furthermore, the above inequality implies thatIn addition, the sample is obtained for by choosing , with probability and the other decision vectors with probability ; we have