Abstract
Distributed optimization is a very important concept with applications in control theory and many related fields, as it is high faulttolerant and extremely scalable compared with centralized optimization. Centralized solution methods are not suitable for many application domains that consist of large number of networked systems. In general, these largescale networked systems cooperatively find an optimal solution to a common global objective during the optimization process. Thus, it gives us an opportunity to analyze distributed optimization techniques that is demanded in most distributed optimization settings. This paper presents an analysis that provides an overview of decomposition methods as well as currently existing distributed methods and techniques that are employed in largescale networked systems. A detailed analysis on gradient like methods, subgradient methods, and methods of multipliers including the alternating direction method of multipliers is presented. These methods are analyzed empirically by using numerical examples. Moreover, an example highlighting the fact that the gradient method fails to solve distributed problems in some circumstances is discussed under numerical results. A numerical implementation is used to demonstrate that the alternating direction method of multipliers can solve this particular problem, by revealing its robustness compared with the gradient method. Finally, we conclude the paper with possible future research directions.
1. Introduction
Optimization is a mathematical discipline which determines the best possible solution corresponding to the optimum performance of a quantitatively welldefined system. The theory of optimization has been established as a desirable tool that is used in a wide range of disciplines, such as automatic control systems, estimation and signal processing, communications and networks, electronic circuit design, data analysis and modeling, statistics, and finance [1–3]. In the recent study [4], the novelty search, a tool that is used in evolutionary and swarm robotics was developed for the use of global optimization. Formally, a mathematical optimization problem can be posed as follows:where is a realvalued objective function of the decision variables and .
However, in reality, it may be difficult or not possible to find analytic solutions to certain optimization problems. As a result, iterative methods that provide approximate solutions have been introduced by researchers. Algorithms that are used to solve optimization problems have been extensively analyzed mainly under centralized and decentralized architectures [5, 6]. Centralized solution methods are not suitable for many communication networking problems such as largescale and dataintensive problems that demand distributed solutions. Consequently, the application of distributed optimization techniques where subsystems coordinate to find a solution to the original problem is of utmost importance. Intranets, the Internet, telecommunication networks, aircraft control systems, sensor networks, and electronic banking are some important examples for distributed systems. These systems consist of a large number of smaller subsystems, and they integrate together to reach an optimal status of the process. This optimal status of process in largescale networked systems needs to be achieved without incurring errors and exceeding already set time limits for expected outcomes. Therefore, the study of wellestablished theoretical concepts together with empirical implementations on distributed optimization is critical. This gives us an opportunity to analyze currently existing distributed techniques and methods. In general, we may have many subsystems in a distributed optimization setting. We consider the following optimization problem with five subsystems as an example to provide a deeper explanation of distributed optimization:where are subsets of . In this problem, we can observe that there are three complicating variables , and . The variable is shared by subsystems , and 4, the variable is shared by subsystems 2 and 5, while the variable is shared among subsystems 4 and 5. Figure 1 shows the associated decomposition structure of (2), and the related distributed problem can be stated as follows:
Here, we can observe that problem (3) is minimized by multiple users cooperatively. Hence, a distributed method is required to find a solution.
Many networked systems cannot communicate exact information between subsystems due to unavoidable errors that may occur as a result of limited communication bandwidths and sometimes due to measurement errors [7, 8]. Therein lies the importance of analysing quantized distributed methods in real life situations [9–13]. Although many quantized distributed methods have been analyzed, deeper investigation of quantization methods is still required.
We present the outline of our paper as follows. In Section 2, we discuss the preliminaries related to distributed optimization and primal and dual decomposition. Section 3 provides a general literature review on currently existing wellknown distributed optimization methods. Next, in Sections 4, 5, and 6, we discuss the gradient method, the subgradient method, and the alternating direction method of multipliers (ADMM), respectively. In those sections, we discuss the theoretical concepts of the relevant methods as well as previous studies performed on them. In Section 7, we continue our discussion on distributed optimization with noise to emphasize the importance of involvement of error in distributed optimization methods. In Section 8, we provide our numerical results to discuss the convergence of aforementioned distributed methods. Finally, in Section 9, we conclude our paper with possible future research directions.
2. Preliminaries
In this section, we discuss the concept of distributed optimization and we introduce primal decomposition and dual decomposition, which play an important role in distributed optimization. Our introduction on primal decomposition and dual decomposition is mainly inspired by the lecture notes on decomposition methods by boyd et al. [14]. Throughout the paper, we will use following notations.
Notation. We let , and represent the set of real numbers, dimensional Euclidean space, and positive orthant in dimensional Euclidean space, respectively. For denotes the Euclidean norm and denotes the projection of on to the set . The set of matrices is denoted by . The transpose of a matrix is given by . represents the gradient of a scalar valued function .
2.1. Distributed Optimization
Distributed optimization is an optimization process that is used in networked systems with a large number of users. This process enables the system to solve a global problem cooperatively even if there is no central controller available in the system. When compared with centralized techniques, distributed optimization has many considerable advantages. In distributed algorithms, nodes or users in the network share information only with necessary parties. This fact improves cyber security and reduces communication cost. Furthermore, distributed techniques have the ability to handle problems even if the problem size is very large. These techniques also have the potential to increase the solution speed [15].
Distributed optimization algorithms solve largescale and dataintensive problems in a wide range of application areas such as communications [16–19], electricity grid [20, 21], largescale multiagent systems [22, 23], smart grids, wireless sensor networks [24], and statistical learning. Zhang and SahraeiArdakani have developed a fully distributed DC optimal power flow method that incorporates flexible transmission and discussed the effect of communication limitations on the convergence properties [25, 26]. In [27], authors have presented a study on finitetime consensus opinion dynamics and studied an application to distributed optimization over digraph.
Many distributed optimization algorithms are built on decomposition methods. Decomposition is an interesting approach to solving a global problem by breaking it up into smaller subproblems and solving each of them separately. These subproblems get solved either in parallel or sequentially [6, 14, 28–30]. Decomposition in optimization appears in early work on largescale linear programs from 1960s [31]. The simplest decomposition structure is available in block separable problems. For an example, a block separable problem can be given as follows:
In this form, we can minimize and separately in parallel and obtain the optimal value and optimal solution. However, this method seems to be trivial and does not seem to be an interesting task as many real life problems appear in a more complex form than this [14]. This problem becomes more complicated and creates more interest when the subvectors and are coupled. This situation can be handled by primal decomposition and dual decomposition, which are the most wellknown decomposition methods currently available.
2.2. Primal Decomposition
Primal decomposition deals with complicating variables. Here, we consider a constrained minimization problem that consists of number of users as follows:where , and s represent realvalued objective functions of individual users. Here, the variable is called the complicating variable, which complicates the system. When is fixed, problem (5) decomposes in to smaller subproblems.
Subproblems are as follows:
Then, the original problem (5) is equivalent to the problemand this is called the master problem in primal decomposition [14]. Next, the original problem (5) can be solved by solving the master problem (7), using a distributed algorithm under some welldefined assumptions on individual primal objective functions s.
2.3. Dual Decomposition
Here, we consider the same problem (5) discussed under primal decomposition only with two users. Then, we have the objective function as . Next, the problem can be rearranged by introducing new variables and as follows [14]:
According to this new arrangement, the objective function is separable. Next, we can apply the decomposition with its dual problem. The Lagrangian of (8) is given by
Next, the related dual function is given bywhich is accompanied with subproblems
Then, the dual problem of (8) is given by
This is called the master problem in dual decomposition. This problem can be solved by using an iterative method such as subgradient method, which will be discussed under Section 5. Although we are able to solve the dual problem and find dual optimal measures, we still cannot guarantee that we can find primal optimal measures without introducing some acceptable conditions on the primal objective function. For an example, if and are strictly convex, then the primal variables , and found by solving two subproblems and are guaranteed to converge to the optimal solution of the primal problem (8) [14].
3. A General Literature Review on Distributed Methods for Solving Optimization Problems
In this section, we provide a general overview of currently existing distributed optimization methods. A detailed analysis will be given in later sections with more technical details. Most of the existing studies done on distributed optimization problems have been analyzed and related solution methods have been discussed when the optimization problem is convex. Convex optimization problems can be solved very reliably and efficiently using interiorpoint methods, and most of the theories related to convex optimization have been already developed. Therefore, recognizing or formulating a problem as a convex optimization problem gives us a great advantage. In the texts [5, 6], authors have provided the readers with a very good background to develop a working knowledge on convex optimization to recognize, formulate, and solve convex optimization problems. For example, if we consider a nonconvex constrained optimization problem, the associated negative dual problem is always convex. Hence, in some situations, the original problem can be solved by using the dual problem which provides an easy environment to work with because of the convexity.
We have observed that currently available stateoftheart distributed methods of solving optimization problems are gradientbased algorithms, subgradientbased algorithms, and their variants, such as ADMM [30, 32–38]. The gradient method is generally applied on unconstrained optimization problems. In 1970, Ramsay had studied gradient methods for optimizing nonlinear functions of several variables that cause difficulties when second derivative approaches are used [39]. In the recent study [40], Nedić et al. have focused on solving a distributed convex optimization problem using “pushpull gradient methods.” They have given this interesting name as the agents in the problem network push the gradient information to the neighbors and the decision variable information is pulled by neighbors throughout the method. In [41], Calamai and Moré have studied the convergence properties of the projected gradient method for linearly constrained problems which are useful in largescale problems. The projected gradient method is a variant of the gradient method which is used in constrained optimization.
The subgradient method can be considered as a generalization of the gradient method and is useful in optimizing nondifferentiable functions. In [9–12, 22, 42], subgradient methods are used to solve largescaled distributed problems that deals with the sum of a large number of convex local objective functions. References [24, 43–45] are some studies that have been focused on effects of constraints, and they have presented projected subgradient algorithms to solve constrained optimization problems. In [44], Amini and Yousefian have studied a very important class of bilevel convex optimization problems that are often used for largescale data processing in machine learning and neural networks. The authors in [45] have studied the binary iterative hard thresholding algorithm, a stateoftheart recovery algorithm in onebit compressive sensing which makes use of the projected subgradient method.
ADMM is also a wellsuited method used in distributed convex optimization over largescale networked systems arising in statistics and machine learning. The ADMM was first proposed by Gabay, Mercier, Glowinski, and Marrocco [46] in the mid1970s. In the recent study [47], Xiao et al. have presented a distributed and scalable algorithm for managing the residential demand response programs using ADMM. They have shown through their simulation studies that the proposed method can reduce customers’ electricity bills and peak load. Authors in [48] have presented a distributed ADMM for solving the direct current dynamic optimal power flow with carbon emission trading problem. In [49], Hajinezhad and Shi proposed an algorithm related to ADMM to study a class of nonconvex nonsmooth optimization problems with bilinear constraints which are widely used in machine learning and signal processing application domains. The study [50] has presented a modified distributed ADMM to handle nonconvex optimization problems with discrete control variables.
4. The Gradient Method
Let us consider an unconstrained minimization problem as follows:where is differentiable and . Then, the gradient method to solve optimization problems of form (13) can be expressed by following iterative process, which starts from some initial point :where is known to be the step size. The convergence of method (14) can be discussed under various considerations, using the theorems presented in [51].
Theorem 1 (see [51]). Suppose that (a constant step size) in (14). Let be differentiable on , is Lipschitz continuous with constant , and let be a strongly convex with constant . Then, method (14) converges to a unique global minimum point with the rate of geometric progression when :Next, the following theorem shows the convergence of (14) for an even smaller class of functions.
Theorem 2 (see [51]). Let be strongly convex and twice differentiable. Suppose that
Then, for ,
Moreover, when , is minimal and equal to . The proofs of Theorem 1 and 2 are given in [51], and the convergence to a local minimum point of is also discussed in the same text under Theorem 4 of Section 1.4. We discuss the convergence of the gradient method using a numerical example in the numerical results section (Section 8). In Section 8.1, our focus of discussion is the convergence results with the use of primal decomposition.
There are many early studies done on gradient methods [39, 41, 52, 53]. Authors in [53] had combined gradient methods with back propagation methods for neural networks to discuss the optimization of weights of multilayer neural networks. In the study [52], authors have proposed two new step sizes for the classicalsteepest descent method, where in method (14) is used as . The most interesting fact regarding these new step sizes is that they require less computational effort than the classicalsteepest descent method. However, these studies have not given enough attention and emphasis on distributed optimization techniques, which have become crucial to be analyzed in many application domains.
Some recent work that relies on gradient methods can be found in [8, 40, 54, 55]. In these studies, the gradient method has been applied with the use of distributed techniques. In [8], the authors have investigated fundamental properties of distributed optimization based on gradient methods, where gradient information is communicated using a limited number of bits. It is a wellknown fact that message exchange between subsystems is a common phenomenon in distributed optimization settings. However, perfect message exchange between subsystems is not possible due to limited communication bandwidths between subsystems. Therefore, quantized information tends to be exchanged between users in networked systems, which led to the exploration of new findings on quantized distributed techniques. The study [8] is a very good initiative in this regard. This piece of work has studied a general class of quantized gradient methods where the gradient direction is approximated by a finite quantization set, to optimize a constrained convex optimization problem. Here, they have considered optimization problems of the form as follows:where is convex and differentiable with LLipschitz continuous gradient, , is closed and convex set, and the optimal solution set is nonempty and bounded.
To solve problem (18), they have used the projected gradient method as follows:where is quantized gradient information coded using limited number of bits. In this paper, authors have proposed two types of quantization schemes, namely, binary quantization and proper quantization.
(a) Binary Quantization. In this quantization scheme, the quantization set is taken as , where . A convergence proof of method (19) was given under this binary quantization when and . These convergence results are very important as they can deal with a dual problem of form (18) associated with equality and inequality constrained primal problems.
(b) Proper Quantization. When the above discussed binary quantization is used to solve TCP problems, the related quantized gradients are transmitted using bits. There are many applications, where the dual problem is maintained by an individual coordinator [18, 19]. Therefore, it is worth seeking to analyze whether it is possible to use less number of bits than when an individual coordinator exerts the problem. This fact motivates authors in [8] to discuss about the proper quantization. Here, we like to highlight the following two definitions they have used to establish their results.
Definition 1 (see [8]). A finite set is a proper quantization for problem (18); if for every initialization in iterates (19), we can choose and .
Definition 2 (see [8]). The finite set is a if and , s.t , where represents the unit sphere in . It has been proved that is a proper quantization for the problem class (18), and the minimal proper quantization is [8].
Authors in [54] have introduced two measures of communication complexity of dual decomposition, which help to identify the communication overhead required by limited communication networks. The first measure determines the smallest number of bits needed to find a solution within a given accuracy, while the second measure quantifies the best possible solution accuracy when a fixed amount of bits were communicated. Furthermore, in this same work, the authors have studied a quantization scheme (introduced as PrimalFeasible quantization scheme) which guaranteed primal feasibility at each iteration in their method.
5. The Subgradient Method
Subgradient method is basically used to minimize nondifferentiable convex problems. Nondifferentiable or nonsmooth functions are one important class of problems that arise in many applications of mathematical programming, such as game theory, multicriteria models, nonlinear programming problems, optimal control problems with continuous or discrete time, and integer and mixed integer programming problems [56]. Subgradient methods are firstorder methods. Their performance highly depends on problem scaling and conditioning, whereas Newton’s method and interiorpoint methods are not dependent on problem scaling [57].
Before entering into the topic of subgradient methods, we would like to discuss about subgradients, which can be introduced as a generalized concept of gradients. When a function is nondifferentiable, the gradient of the function at nondifferentiable points cannot be found uniquely. Therefore, a welldefined way to express the slope of the function at those nondifferentiable points is required, mainly in optimization theory. Thus, getting a better understanding of subgradients is essential in the field of optimization theory. Reference [56] gives a very good exposition of the concept of subgradients, and it provides many important theoretical aspects related to subgradients. Polyak’s text [51] and the text [6] of Bertsekas are two other good references that discuss subgradients and subgradient methods. Next, we will define a subgradient of a convex function.
Definition 3. A vector is a subgradient of at if for all ,The set of all subgradients of at is called the subdifferential of at and denoted by . If is differentiable, then its subgradient at is unique and it is the gradient of at .
5.1. The Basic Subgradient Method
We consider the same form of the unconstrained optimization problem (13) considered in Section 4. The objective function is still convex but not necessarily differentiable. Then, the subgradient method used to solve this problem can be given by the following iterative sequence starting at some initial point :where is the th iterate, is an any subgradient of at , and is the step size related to th iteration. The subgradient method (21) can be considered as an extension of the gradient method (14). The difference is that, in each iteration, we use a subgradient of the function at instead of in (14). Moreover, the step size selection in the subgradient method is much different to the gradient method. In [57], Boyd has given five basic step size rules, namely, constant step size, constant step length, square summable but not summable, nonsummable diminishing, and nonsummable diminishing step length. From theses five step size rules, we present three common ones as follows:(1)A constant step size, is a positive constant and independent of .(2)Square summable but not summable: the step sizes satisfy For example, .(3)Nonsummable diminishing: the step sizes satisfy For example, .
Above choices for the step size do not depend on details computed during the subgradient algorithm. This fact differs from the step size rules found in standard descent methods, which uses current point and search direction. Good discussions on descent methods can be found in chapter 9 of [5] and chapter 8 of [58]. We can find many other choices for step size in addition to the choices mentioned above. In [51], Polyak has shown that the subgradient method (21) cannot converge rapidly under diminishing nonsummable step size rule. Therefore, the author has described another variant of the subgradient method, by introducing a different step size rule that depends on , the optimal value of . We introduce this step size in Theorem 4.
Next, we discuss the convergence of the subgradient method (21) that relies on Boyd’s step size rules mentioned above. We use the following assumptions to discuss the convergence: Assumption 1. Optimal set , the set of minimizers of problem (13) is nonempty Assumption 2. is bounded Assumption 3. The number s.t is known, where and is the initial point of the algorithm
Theorem 3 (see [57]). Let Assumptions 1, 2, and 3 hold and let . Then, in method (21), the following inequality holds:where is s.t and is s.t for all .
The proof of Theorem 3 can be found in Section 3.2 of [57]. Using this theorem, one can show that the subgradient method converges within some range of the optimal value , for constant step size and constant step length. For other variants of the step size, square summable but not summable, nonsummable diminishing, and nonsummable diminishing step lengths, the subgradient method converges exactly to the optimal value without incurring any error. We discuss the convergence of the basic subgradient method empirically, in the numerical results section with the above presented three step size rules. In Section 8.2, we use a constrained optimization problem, and we dedicate our attention to discussing the convergence using dual decomposition. Next, we state the following theorem which gives the convergence of the subgradient method using Polyak’s step length.
Theorem 4 (see [51]). Let the set of minimizers of problem (13) (with nondifferentiable ) is nonempty and . Then, in method (21), .
The proof of above theorem is given by Polyak in his book [51]. Now, we discuss and analyze some studies done on subgradient methods. In [22], authors have considered a subgradient method to optimize a sum of convex objective functions corresponding to multiple agents. This work analyzes largescale networked systems, where it is essential to design decentralized resource allocation methods, since the centralized solution methods are not suitable. This paper has considered a scenario where agents cooperatively minimize a common additive cost. The corresponding optimization problem can be posed as follows:where the function represents the cost function of agent , which is convex and not necessarily to be differentiable, and is the decision vector. To analyze this problem, authors have proposed the following subgradient method:where represents the weight that agent assigns to the information received from a neighboring agent and the scalar represents the step size used by agent . The vector is a subgradient of agent ’s objective function at . Next, to analyze the convergence of method (26), they have used a different representation of that method in a way that each iteration can be estimated using the information and estimates . In this study, the authors have considered an unconstrained optimization problem, but in general, this problem can be viewed in a more advanced setting, in the presence of constraints. This fact motivates readers to extend this seminal work done by Nedić and Ozdaglar to a different path of research, which will lead to a different line of convergence analysis. Furthermore, their model assumes that agents can exchange exact information, which is not possible in practice due to limited communication bandwidths. Therefore, the information is usually quantized before being sent, and it is considered that the quantization reduces the communication cost in networked control systems [59–61].
In [11], authors have considered the distributed subgradient method discussed in [22] and they have presented improved convergence results. Furthermore, they have shown that upper bounds for the difference between the estimated objective function value and the exact optimal value of the problem have a polynomial dependence on the number of agents , by using results of their prior work [62]. We can view these bounds as improved versions of error bounds obtained in studies [22, 42], which involve exponential dependence on . Moreover, the authors have studied the subgradient method when the communicated information is quantized to address the issue that perfect message exchange between agents cannot be performed. Some other works related to the same line of research are [9, 10, 12].
5.2. Projected Subgradient Method
Projected subgradient method is an extension of the basic subgradient method used in constrained optimization problems. Consider the optimization problem of the formwhere and are convex. Then, the projected subgradient method can be given bywhere is any subgradient of at . Convergence of method (28) can be attained under the same step size rules described under the basic subgradient method [57].
Authors in [43] have presented distributed algorithms to solve a constrained consensus problem and a constrained optimization problem. They have used a distributed projected subgradient method to solve the constrained optimization problem, which consist of minimizing a sum of convex local objective functions. They have shown that their method converges to the optimal solution with square summable but not summable step size rule. In [24], Madan and Lall have proposed two distributed projected subgradient methods to find an optimal routing flow to maximize the network lifetime in a partially and fully decentralized manner. In their solution, subgradient methods have been applied with their dual problem. We noticed that most of the studies performed on distributed optimization have used their original primal objective function in the optimization process. They have not shown much interest on duality theory, which provide many advantages in solving constrained optimization problems. Under these circumstances, Madan’s and Lall’s work [24] provides immense value addition to the study of distributed optimization.
6. Alternating Direction Method of Multipliers
ADMM is a simple but strong method that is used in distributed optimization [32]. ADMM is a variant of augmented Lagrangian and method of multipliers that uses the decomposability of dual ascent. In [32], augmented Lagrangian and method of multipliers are discussed under the following equality constrained optimization problem:where , and is convex. Then, the augmented Lagrangian for problem (29) is given bywhere is known as the penalty parameter. Then, the corresponding dual function is given by . The authors have used the gradient method to minimize negative with penalty parameter as the step size. The method of multipliers can be viewed as more robust version of the dual ascent method, and it yields convergence under more general conditions than the dual ascent. However, “when is separable, is not separable” is the fact that the authors in [32] have concerns with. When is not separable, the minimization process cannot be continued in parallel, and hence, the method of multipliers cannot be used in dual decomposition. Therefore, an alternative way of observing problem (29) is needed, and consequently the ADMM has been introduced to address this issue. ADMM is a method well suited for distributed optimization settings that consist of largescaled problems. In [32], authors have considered another variation of problem (29) as follows, to view it in separable form which has then led to the introduction of ADMM:where , and . Moreover, and are convex functions. Then, the distributed algorithm for ADMM can be given using Algorithm 1.
There are many early studies done on the method of multipliers and ADMM [63, 64]. Some recent studies done on ADMM can be found in [65–69]. In [65], Erseghe has proposed a fully distributed algorithm for optimal power flow using ADMM. In this paper, the author has introduced another variation on ADMM Algorithm 1 with assumptions such as , where is contained in a linear space with associated orthogonal projector and also with certain assumptions on initial choices. In the study [66], authors have presented a decomposed solution approach with ADMM to solve a cost minimization problem, where the objective consists of energy and battery degradation cost. This work has used a modified version of ADMM, which helps to reduce the computations cost and ensures the stability of the solution. Most of the researchers including the ones mentioned above who have worked on ADMM have no concerns on noises that can be embedded in their models due to different types of errors occurring in practice, for an example, due to limited communication bandwidths. This fact motivates readers to work on this path with ADMM.

7. Distributed Optimization with Noise
The distributed methods for solving optimization problems can be applied in pure form only if errors and inaccuracies are fully avoided, which is hardly possible in the real world. As an example, errors or noises can occur due to inexact computation or measurement of subgradients and function values, sparsification [70], and quantization [8, 71]. The noise can be deterministic or random according to the behaviour of the application domain. Most of the real world problems consist of largescale networked systems and mostly solve a common objective function interactively. In such situations, subsystems have to exchange their private information with neighboring subsystems during the optimization process. However, the subsystems may not be able to communicate exact information due to several reasons such as security measures and communication overheads. Therefore, it is very important to analyze distributed methods with noise imposed on the system.
7.1. Distributed Methods with Noise for Optimizing Smooth Functions
In distributed methods for optimizing differentiable (smooth) functions, we always deal with a computation of the gradient, and instead of the exact value of the gradient , we may have it computed with errorwhere is introduced to be the noise. In chapter 4 of [51], Polyak has discussed four types of most important classes of noise:(1)Absolute deterministic noise: is deterministic and satisfies the boundedness condition (2)Relative deterministic noise: is deterministic and satisfies the condition (3)Absolute random noise: is random, independent, centered, and has bounded variance, and (4)Relative random noise: satisfies the condition
In the above classes of noise, , and represent positive constants. In the same text [51], the convergence of the gradient method (14) was discussed, where the gradient is computed with error as given in (32). Here, the convergence properties of the gradient method were analyzed under all four types of errors mentioned above, under the assumption that the objective function is strongly convex and with a gradient satisfying a Lipschitz condition.
Most of the related literatures available to solve optimization problems with the use of gradient like methods under the presence of noise were analyzed under boundedness assumptions on the objective function and the decision variable or show only [72–75]. Authors in [55] discussed convergence results for the following method, by removing various boundedness conditions such as boundedness from below of , boundedness of , or boundedness of :where represents a descent direction of a function and is a deterministic or stochastic error. They first focus on the above method with deterministic error, with satisfying following conditions:where and are some positive scalars. Then, the convergence of method (33) was obtained using following theorem.
Theorem 5 (see [55]). Suppose that in method (33) is a descent direction satisfying for some positive scalars and , and for all ,
Then, for with square summable but not summable step size rule, method (33) guaranteed to convergent to the optimal solution.
Next, the authors have obtained convergence results for minimizing a sum of large number of functions using incremental gradient methods. Moreover, they have focused on stochastic gradient methods. In the recent study [68], authors have analyzed the convergence of distributed ADMM for consensus optimization in the presence of random error. They have presented lower and upper bounds on the mean squared steady state error of the algorithm when individual objective functions are strongly convex and when the gradients are Lipschitz continuous. Furthermore, authors have presented that steady state error of their noisy ADMM algorithm is bounded when they have a bounded random error and when individual objectives are proper, closed, and convex.
7.2. Distributed Methods with Noise for Optimizing Nonsmooth Functions
In chapter 5 of [51], Polyak has introduced the wellknown subgradient method of optimizing nondifferentiable (nonsmooth) problems with noise,where is the noise imposed on the subgradient. The convergence results of the noisy subgradient method (36) have been discussed by the same author under the same classes of noises discussed in the previous subsection. In the early study [76], Polyak has studied minimization methods of a nonlinear function with nonlinear constraints when the values of the objective function, constraints, and gradients are computed with errors. In [77], authors have studied the effect of noise on subgradient methods for convex constrained optimization problems of form (27). They have discussed the convergence properties of the following projected subgradient method when the noise is deterministic and bounded:where is an approximate subgradient of the form , where is the noise and is an subgradient of at for some . Convergence properties of method (37) have been analyzed under three step size rules, namely, constant step size rule, diminishing step size rule, and dynamic step size rule which is given bywhere is an error involved function value and is a target level approximating the optimal value . First, the convergence of method (37) has been obtained when the constrained set is compact. Secondly, the authors have analyzed their method using a convex objective function which has a sharp set of minima. The important results observed by authors were as follows: (a) in the first scenario, the method converges to the optimal value with some tolerance and (b) in the second scenario, the method converges exactly to the optimal value without any error.
It is very important to pay attention to the stochastic optimization processes since many practical problems cannot be viewed as deterministic structures. Some studies that paid attention to this particular area can be found in [76, 78]. Authors in [78] have studied stochastic quasigradient methods which allow solving optimization problems without calculating exact values of objectives and constraints. In [76], a general convex problem with noise was solved with assumptions as follows:(i)The objective function and inequality constraint functions are convex continuous(ii)Feasible set is a convex closed bounded set(iii)Slater condition holds(iv)All noises are with mean zero with bounded variance and are independent at different points
8. Numerical Results
In this section, we discuss the convergence of the gradient method, subgradient method, and ADMM empirically by using some numerical examples.
8.1. Example 1 (Gradient Method: Primal Decomposition)
Here, we consider an unconstrained minimization problem with two users as follows:where and with and . Here, and are positive definite matrices. We use primal decomposition and analyze the convergence of the gradient method (14) for this problem with the use of Theorem 1. The subproblems related to (39) can be given as follows: Subproblem 1: Subproblem 2:
Then, the master problem corresponding to (39) is given by
Analytically, by solving the subproblems, we can show that , where and with , and for . Then, is quadratic as and are quadratic. Moreover, and are strongly convex since and are positive definite. Hence, is also strongly convex and is Lipschitz continuous. Therefore, we can apply Theorem 1 to solve problem (40) using the gradient method (14). We use Algorithm 2 to solve (40). In this algorithm, at each iteration, the gradient update is given by , where and .

First, we illustrate our results with scalar valued primal variables , and ( case) for different values of constant step sizes . Figure 2 shows the convergence of with , and . Next, we show the convergence results for different dimensions of the complicating variable with and . Figure 3 shows the convergence of the residuals with step size , for , and , where represents the optimal value of . We present Figure 4, which indicates log values of , to analyze the convergence of residuals when they approach to zero. For this same set of dimensions of with same step size, the convergence of iterates is shown under Figure 5. Moreover, Figure 6 shows that the primal variable iterates and converge exactly to their optimal solutions using and .
8.2. Example 2 (Subgradient Method: Dual Decomposition)
Here, we focus on a problem which is not quadratic. We consider the problem in the following form with two users:where and with , and . Here, we intend to solve this problem in a fully distributed manner using dual decomposition. We implement our results for (scalar valued variables). We consider , and . The dual function corresponding to the primal problem (41) is given byand we use corresponding subproblems in dual decomposition as follows:
Then, the dual problem corresponding to the primal problem (41) is given by . We know that is always concave (see chapter 05 of [5]). We have obtained the graph of as given in Figure 7. This figure also confirms the concavity of . Moreover, this figure shows that is nondifferentiable as it has a sharp point around . Hence, is convex and nondifferentiable, and therefore we use subgradient method (21) to minimize using Algorithm 3.

We analyze the convergence results of the subgradient method using Theorem 3 discussed under Section 5. Therefore, we have to check whether Assumptions 1–3 used in Theorem 3 hold for our particular problem considered here. Figure 7 shows that there exists an optimal solution to the dual function . Hence, Assumption 1 holds. At each iteration in Algorithm 3, used in the dual variable update represents a subgradient of at ( represents a subgradient of at ). We can observe that as . Hence, Assumption 2 holds. Moreover, we use the initialization , and we found that using the CVX solver in Matlab. Therefore, it turns out that , from which Assumption 3 follows. Hence, we can use Theorem 3 to analyze the convergence of the subgradient method.
We have obtained the convergence results with constant, square summable but not summable, and nonsummable diminishing step size rules. Figure 8 shows the convergence of log values of for different constant step sizes. This figure shows that large step sizes give fast convergence. Next, we show the convergence with step sizes , and , in the same figure (Figure 9) so as to identify the effect of different step size rules. Here, we considered the convergence up to tolerance. We can observe a slower convergence using and than that for the constant step size rule.
In our Algorithm 3, both users solve their subproblems separately and find optimal primal variables locally at each iteration. Next, they exchange their information and with each other and update the dual variable individually. In general, their iterates and are not feasible. Therefore, at each iteration, they agreed to have a feasible solution as . Next, by using these primal variable iterates and updated dual variable , user 1 and user 2 can compute and , respectively. Then, can be calculated. This is always a lower bound on , the optimal value of the primal problem [5]. Moreover, at each iteration, users can compute two upper bounds on as follows [14]:where and . In [14], and are defined as the worst bound and the better bound. Worst bound represents the primal objective function values evaluated at each iteration using feasible points and . Better bound can be obtained by replacing and with and then solving subproblems involved with related primal decomposition structure of (41). Figure 10 shows the convergence of , better bound, and worst bound using constant step size rule and scalar valued primal variables. Here, we can observe that for this particular problem, the lower bound and two upper bounds converge exactly to .
8.3. Example 3 (ADMM)
Here, we first discuss the robustness of ADMM compared with the gradient method. Let us consider the following linear programme:where and are decision variables of the problem, with and is a constant vector. Suppose that the set of solutions of (45) is nonempty.
The dual function , where with and , for problem (45) is given by
Then, analytically we can obtain
Next, the dual problem is given by
Here, we can easily observe that the optimal value of the dual problem (48) is , which is attained when and . Usually we use following subproblems when we use the gradient method to solve (48):
Algorithm 4 represents the corresponding gradient algorithm.

We can observe that and minimization steps (Algorithm 4) given in Algorithm 4 cannot proceed for any arbitrarily chosen as and are unbounded below. Hence, the gradient method fails to solve (48), and therefore the linear programme (45) also cannot be solved. However, the interesting fact is that ADMM solves this problem without any issue, showing its robustness compared with the gradient method.
To solve (48) using ADMM, we consider the augmented Lagrangian as follows:where represents the penalty parameter. Then, the corresponding dual function is given by . Next, we maximize by using Algorithm 5. In this algorithm, represents a suitably chosen step size. Here, we discuss the convergence of iterates (Algorithm 5) with
