Distributed Stochastic Subgradient Projection Algorithms Based on Weight-Balancing over Time-Varying Directed Graphs
We consider a distributed constrained optimization problem over graphs, where cost function of each agent is private. Moreover, we assume that the graphs are time-varying and directed. In order to address such problem, a fully decentralized stochastic subgradient projection algorithm is proposed over time-varying directed graphs. However, since the graphs are directed, the weight matrix may not be a doubly stochastic matrix. Therefore, we overcome this difficulty by using weight-balancing technique. By choosing appropriate step-sizes, we show that iterations of all agents asymptotically converge to some optimal solutions. Further, by our analysis, convergence rate of our proposed algorithm is under local strong convexity, where is the number of iterations. In addition, under local convexity, we prove that our proposed algorithm can converge with rate . In addition, we verify the theoretical results through simulations.
In this paper, we focus on distributed constrained optimization problems, which have arisen in many applications. For instance, large-scale machine learning [1–4], resource allocation [5, 6], sensor networks [7–10], and multiagent systems . To address such problems, the designs of distributed optimization algorithms are necessary. Moreover, the goal is to minimize the sum of cost functions of all agents over networks, where each agent only knows its own information and can receive the information from its neighbors.
Distributed optimization algorithms are originally introduced in seminal work  and have a vast literature devoted to them in recent years [13–19]. In this literature, the distributed (sub)gradient methods are used over networks. Moreover, the performance limitations and convergence rates of these algorithms are well understood. In addition, the distributed Newton methods and other descent methods are used to solve the distributed optimization problems [20, 21], and their rates of convergence are also analyzed.
However, these methods cited above assumed that information exchange among agents takes place over either fixed or undirected graphs. Nevertheless, in some communication networks such as mobile sensor networks, the communication between agents is unidirectional because different agents have different interference and noise patterns, the information are broadcasted at different power level, and the communication links between agents are directed in these networks [22–24]. Hence, the directed network topology is a natural assumption. In addition, the time-varying communication network topology is also a valid assumption in wireless networks, where each agent can move or the communication links may be destroyed randomly. For these reasons, we assume that these network topologies can be modeled as time-varying directed graphs. In this paper, we assume that each out-degree is known to each agent at each round, where the assumption cannot be removed . To obtain knowledge of the out-degree, we can bidirectionally exchange “Hello” messages in a single communication process.
In fact, the recent works [22, 23] provide subgradient-push distributed algorithms to minimize the cost function over time-varying directed network in discrete-time. Similarly,  proposes a distributed subgradient algorithm, which used weight-balancing in discrete-time. In addition,  considers a continuous-time optimization algorithm over underlying time-varying directed networks. Moreover, the distributed algorithms in [22, 24] converge with rate , whereas in , their algorithm converges with rate by assuming that the local cost functions are differentiable and strongly convex. Nevertheless, these works consider the unconstrained optimization problem. In , the authors propose the D-DPS algorithm over a directed network with convergence rate . However, we assume that the graph is time-varying and directed in this paper. To overcome the asymmetry caused by directed graphs, we employ the weight-balancing technique in the paper. Hence, a distributed stochastic subgradient projection algorithm is proposed, based on weight-balancing technique. Assume that each local cost function is strongly convex; even if all agents have access to their own noisy subgradient, our proposed algorithm is asymptotically convergent with rate . Besides, our proposed algorithm also asymptotically converges with rate for generally convex cost functions. Compared with the best previous convergence rate of , which is achieved in centralized way, where the local functions are strongly convex and differentiable, the variance of noisy subgradient is bounded [28, 29]. Thus, our convergence results are quite close to it. However, we need not assume that local cost functions are differentiable. In addition, we assume that local cost functions do not establish Lipschitz continuous gradients.
Our goal is to design a distributed optimization algorithm and analyze the properties of the proposed algorithm, based on weight-balancing over time-varying directed networks. This work has the following contributions:(i)We propose a distributed stochastic subgradient projection algorithm based on weight-balancing over time-varying directed networks. Furthermore, each local cost function is private information for other agents. Hence, each agent only utilizes its own private information. Moreover, noisy subgradient of local cost function is known to each agent for . In addition, the algorithm is implemented without any centralized control, and every agent need not to know network topology and only requires to know its out-degree at each round.(ii)By some standard assumptions, we show that our proposed algorithm asymptotically converges to some optimal solutions.(iii)For strongly convex cost functions, we prove that the convergence rate of is achieved. In addition, we also show that the convergence rate is for locally convex cost functions.
Organization. In Section 2, the constrained optimization problem is described, we also give some assumptions, and then a distributed stochastic subgradient projection algorithm is proposed over time-varying directed networks. We state the main results of this paper in Section 3. We provide the proofs of the main results in Section 4. In Section 5, simulations are also presented. The conclusion of the paper is provided in Section 6.
Notation. We use lowercase boldface to denote the vectors in and use lowercase normal font to denote scalars or vectors, which are not -dimensional vectors. For instance, denotes a vector in at agent at round , while the notation is a scalar in . The vector such as in is obtained by stacking all scalars for . In addition, we use the natation “” to denote transpose operation. is the Euclidean norm of . Besides, the notation is -norm of . The notation denotes a -dimensional vector whose all elements are 1, and is identity matrix, whose size is . denotes the expectation operator. Besides, the notation means that a vector is projected onto the constraint set . The notation denotes Kronecker product operator.
2. Problem Setup, Algorithm Description, and Assumptions
We use a graph to model a network, which consists of agents. Moreover, each agent represents a node. Further, we also consider the case that network topology is time-varying and directed. Hence, we use the notation to denote network topology at each round , where denotes the agent set and denotes the directed edge set. means that agent can send message to agent at round . If two agents can directly exchange information, then we say the agents are neighboring. Furthermore, we use and to represent the set of out-neighbors and the set of in-neighbors of agent at round , respectively. Formally, and , respectively. Besides, we denote the out-degree and in-degree by and , respectively.
In this paper, the constrained optimization problem is described as follows:where denotes local cost function of agent and denotes constraint set.
Our goal is to solve problem (1) by cooperative and fully decentralized way over time-varying and directed networks. Further, each local cost function can be only known to each agent and all agents know constraint set . Moreover, each agent can share its own iteration with its out-neighbors.
In this paper, we assume that the network topology is time-varying and directed. Note that the directed graph may cause the asymmetry. Thus, to overcome the asymmetry, we employ the weight-balancing technique in the paper. Following from , we give the definition of balancing weights over time-varying directed networks as follows.
Definition 1 (balancing weights). The weight of agent , , balances a time-varying directed network at round if, for any agent , the agent weight satisfies where is the index of relationships of neighbors at time .
From Definition 1, the total weight incoming from agent (which is ) is equal to the total outgoing weight of agent (which is ) at round over a time-varying directed network.
In order to solve problem (1) over a time-varying directed graph , we first give some standard assumptions as follows.
Assumption 2. Assume that the time-varying directed network sequence is strongly connected at each round .
Assumption 2 ensures that each agent can receive information from the other agents at round in network .
Assumption 3. Let constraint set be closed and convex. In addition, assuming that local cost function is convex with , for all , i.e., for , each local cost function satisfieswhere denotes a (sub)gradient of at . If , is -strongly convex. Otherwise, is convex.
Assumption 4. We assume that the subgradient of is uniformly bounded over for . Namely, , for all .
Next, we describe our proposed optimization algorithm which is executed over a time-varying directed network. Assume that is the iteration of agent at round . Moreover, the iteration is updated as follows:for all , where denotes a step-size sequence and we use to abbreviate the notation , which represents a noisy subgradient of at . Following from (4)-(5), each agent first linearly fuses from its own estimate and the estimates of in-coming neighbor agents and updates the estimate in opposite direction of its own noisy subgradient. Finally, a new estimate of agent is obtained by projecting the updated estimate onto the constraint set . Moreover, the above update equations can be executed by simple broadcast communication.
From (5), we need to make some assumptions about the noisy subgradient. Specifically, we describe the noisy subgradient as follows:where denotes a stochastic subgradient error and denotes a subgradient of at . Let denote all the information generated by the distributed stochastic subgradient projection algorithms (4)-(5) for all . Hence, the assumption for stochastic subgradient error is as follows.
Assumption 5. For every , , we assume that the stochastic subgradient error at round is a random variable with . Moreover, assume that are independent of each other. Assume that the noise-norm at round is uniformly bounded, i.e., , where is a positive scalar.
In our proposed algorithm, is weight of agent at round . As with , every agent updates its weight over time-varying directed networks as follows:
In this paper, let the optimal set of the proposed algorithms be nonempty, which can be defined as follows: where .
First, we formally introduce the constrained optimization problem in this section, and then we also give some valid assumptions. Simultaneously, a distributed stochastic subgradient projection algorithm is proposed to solve constrained optimization problem (1) over time-varying directed networks. The main results of the paper are presented in the next section.
3. Main Results
We first present an asymptotic convergence of the distributed stochastic subgradient projection algorithms (4)-(5) with appropriately chosen step-sizes. Specifically, the result is described as follows.
Theorem 6. Under Assumptions 2–5, let the optimal set be nonempty. Moreover, for the positive step-size , which satisfies decay conditions (26), let sequences , , be generated by algorithms (4)-(5). Then, each sequence converges to some optimal solutions in with probability 1 for all . Namely, for all , holds with probability 1.
By Theorem 6, the iterations asymptotically converge to some optimal solutions over time-varying directed networks. Namely, by our proposed algorithm, we can obtain some optimal solutions with probability 1.
We now state convergence rate of our proposed algorithm. Under different assumptions about local cost functions , we establish the different convergence rate. To this end, we first introduce the weighted average of the estimate sequence , which is defined as for all . Hence, for all , we have the following recursive relation:for all , where , for all , and .
From Theorem 7, we establish the convergence rate when local cost functions are strongly convex. Moreover, converges to for any agent with probability 1. Further, following from (12), the numerator has an or a constant and the denominator has a . Therefore, our algorithm converges to some optimal solutions with probability 1 at rate of .
Theorem 8 gives the convergence rate of our proposed algorithm for local convex functions, which is . Further, by Theorems 7 and 8, we can see that the time-varying and directed network topology does not affect the convergence rate.
In this section, we show that our proposed algorithm is asymptotically convergent. For strongly convex functions, we derive a convergence rate for our algorithm over time-varying directed networks. Furthermore, we also present that our algorithm converges with rate under general convexity. In the next section, we will give the detailed proofs of main results.
4. Analysis of Convergence Results
For the sake of analysis, we first describe the scalar version of (4)-(5), where the variables and are scalar variables, for all . Thus, the estimate of agent is updated byfor and all , where and denotes a constraint set.
To facilitate analysis, we can rewrite algorithms (14)-(15) in a compact form, namely,where the matrix is defined as follows: for any and , , at each round . Moreover, we also define the following variables and :
Hence, let , which can be referred as the perturbation. Then, following from (19)-(20), we havefor any and with . Furthermore, we introduce the following matrix: for any and with , and . Hence, from (21), we havefor any and with . Moreover, let for convenience.
Since the matrices and are crucial in convergence analysis of our proposed algorithm, we first provide some proprieties of these matrices.
Lemma 9. We assume that Assumption 2 holds. Then, for any , the matrix is column stochastic. Moreover, for all , there exist some positive constants and , and the matrix satisfiesfor any and with .
In addition, we also give the properties of the projection operator . Following from , we have the following.
Lemma 10. Assume that constraint set is closed and convex in . Furthermore, we assume that is nonempty. Thus, we conclude that for any ,
(1) , for all
(2) for any
(3) , for all
(4) , for all .
Besides, we present some auxiliary results as follows, which are used to prove the relative conclusions.
Lemma 11 (see ). We assume that is a scalar for all positive integer .
(i) If , then we have for any .
(ii) If and , then, for , we have .
(a) For , we conclude that
(b) If for , we obtain
(c) If , then, we obtainfor all .
Proof. (a) From (23) and (24), we havewhere we use Hlder’s inequality to obtain the first inequality, and we also obtain the last inequality by Lemma 9. Hence, this conclusion of part (a) is obtained.
(b) Since , following from preceding relation (a) and letting , thusSince for , so as . Hence, by Lemma 11(i), we obtainThus, we conclude the statement of (b).
(c) By the decay conditions of , we obtain thatSince , we can see that is finite. Hence, following from Lemma 11(ii), the sum is finite. Hence, the relation in part (c) is obtained.
Informally, the distributed stochastic subgradient algorithm based on weight-balancing ensures that all track the running average with a geometric rate .
Following from the similar arguments as (19)-(20), we rewrite algorithms (4)-(5) aswhere and Moreover, let , for all , which is referred as perturbation vector. Hence, we introduce the following vector: Note that the vectors , , , , and are in . Besides, the following corollary is obtained immediately from Lemma 12(a).
Corollary 13. For every agent , , the vector variables and in are given in (4)-(5); then, we haveFor any vector, 1-norm is greater than or equal to standard Euclidean norm. Therefore, from Lemma 12(a), which is applied to each coordinate of , Corollary 13 holds immediately.
Proof. Following from Lemma 10(4), we haveFurthermore, by Assumptions 4 and 5, we havefor all . Hence, we obtain with probability 1Further, following from the inequality , thenAccording to the definition of perturbations vector , we haveTherefore, this lemma is proved completely.
To establish the convergence properties of our algorithm, an auxiliary variable is first defined, for all , as follows:
We now establish the recurrence relation for . Hence, for all and , where , we introduce a vector such that . Following from algorithms (4)-(5), we obtain for where is a vector in , which is defined as , for all and .
Since the matrix is column stochastic, we have, for all ,We follow from the definition of ; then,
We also establish the following lemma, which is crucial in our proofs.
Lemma 15 (basic iteration relation). Let the estimate sequence , , be generated by (4)-(5). Under Assumptions 2–5, for all vectors and , then we obtain with probability 1where and are defined as Assumptions 2–3, respectively. The constant is defined in (3).
Proof. By (48), we haveBy taking expectation on the both sides of (50) with respect to , furthermore, we follow from (6) and the fact and we have thatNext, we bound the term in (52) as follows:where we use the inequality to obtain the above relation. In order to obtain the upper bound of , we need to bound the term in (53). Following from the similar arguments as Lemma 14, we havefor all . Hence, combining (53) and (54), we obtainSubstituting (55) into (52), we haveWe next estimate the term in (56). By (3) in Assumption 3, we have thatFurther, we also obtain where we use Cauchy-Schwarz inequality to obtain the last inequality. Moreover, the term in (57) can be rewritten as By using Assumptions 3 and 4, we have thatHence, combining (57), (58), and (60), we obtainSubstituting (61) into (56), we haveFollowing from the definition of the global cost function , and then using (62), the lemma is completely proved.
Lemma 16. We assume that , , , and are random sequences. Moreover, we further suppose that elements of these sequences are nonnegative random variables and with probability 1 satisfy the following relation: for . Further, if and hold, then, asymptotically converges to , where is a nonnegative random variable, i.e., . Moreover, we also have with probability 1.
Using the conclusion of Lemma 16, we have the following lemma.
Lemma 17. Consider an optimization problem . Assume that the set be nonempty, where is the optimal set. Let the function be continuous. Suppose that , , and are random sequences with nonnegative elements. Moreover, these sequences satisfy the following relation:for all and with probability 1, where for all . Further, we assume that and hold. Then, the random sequence converges to some optimal solutions with probability 1.
Proof. Let and ; then, we haveThus, we satisfy the conditions of Lemma 16. Hence, the sequence is convergent, and we haveSince , with probability 1 we haveSince the sequence is convergent for , the sequence is bounded. Therefore, there exists a convergent subsequence of the sequence that converges to some . Moreover, the subsequence satisfiesBy using continuity of , we have Following from (68), we can see that the random sequence asymptotically converges to some . Therefore, the random sequence converges to by setting .
Proof of Theorem 6. Since as , we obtainwhich follows from Lemma 10(2). Therefore, using (70), we haveWe also have where Assumptions 2 and 3 are used. Thus, following from the decay conditions of , we havefor . Following from Lemma 10(3), we obtainHence, we obtainLet for in Lemma 15, and we haveAccording to the conclusion of Lemma 10(1), we obtain Hence, we have with probability oneBy using (75), we haveSince and hold by the assumption of , the conditions of Lemma 17 hold. Following from Lemma 17, we can see that asymptotically converges to . Further, following from (71), also asymptotically converges to for all , i.e.,with the notion that probability 1 holds. Therefore, Theorem 6 is proved completely.
Now, we establish the following lemma, which is an important relation for the proof of the Theorem 7.