Top Banner
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 63, NO. 22, NOVEMBER 15, 2015 6013 A Proximal Gradient Algorithm for Decentralized Composite Optimization Wei Shi, Qing Ling, Gang Wu, and Wotao Yin Abstract—This paper proposes a decentralized algorithm for solving a consensus optimization problem defined in a static networked multi-agent system, where the local objective functions have the smooth+nonsmooth composite form. Examples of such problems include decentralized constrained quadratic program- ming and compressed sensing problems, as well as many regular- ization problems arising in inverse problems, signal processing, and machine learning, which have decentralized applications. This paper addresses the need for efficient decentralized algorithms that take advantages of proximal operations for the nonsmooth terms. We propose a proximal gradient exact first-order algorithm (PG-EXTRA) that utilizes the composite structure and has the best known convergence rate. It is a nontrivial extension to the recent algorithm EXTRA. At each iteration, each agent locally computes a gradient of the smooth part of its objective and a prox- imal map of the nonsmooth part, as well as exchanges information with its neighbors. The algorithm is “exact” in the sense that an exact consensus minimizer can be obtained with a fixed step size, whereas most previous methods must use diminishing step sizes. When the smooth part has Lipschitz gradients, PG-EXTRA has an ergodic convergence rate of in terms of the first-order optimality residual. When the smooth part vanishes, PG-EXTRA reduces to P-EXTRA, an algorithm without the gradients (so no “G” in the name), which has a slightly improved convergence rate at in a standard (non-ergodic) sense. Numerical experi- ments demonstrate effectiveness of PG-EXTRA and validate our convergence results. Index Terms—Composite objective, decentralized optimization, multi-agent network, nonsmooth, proximal, regularization. I. INTRODUCTION T HIS paper considers a connected network of agents that cooperatively solve the consensus optimization problem in the form (1) Manuscript received March 16, 2015; revised July 11, 2015; accepted July 13, 2015. Date of publication July 28, 2015; date of current version October 07, 2015. The associate editor coordinating the review of this manuscript and ap- proving it for publication was Prof. Sergios Theodoridis. The work of Q. Ling is supported in part by NSFC grants 61004137 and 61573331. The work of W. Yin is supported in part by NSF grant DMS-1317602. Part of this paper ap- pears in the Fortieth International Conference on Acoustics, Speech, and Signal Processing, Brisbane, Australia, April 19–25, 2015 [1]. W. Shi, Q. Ling, and G. Wu are with the Department of Automation, Univer- sity of Science and Technology of China, Hefei, Anhui 230026, China (e-mail: [email protected]). W. Yin is with the Department of Mathematics, University of California, Los Angeles, CA 90095 USA. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSP.2015.2461520 and are convex differentiable and possibly nondifferentiable functions, respectively, that are kept private by agent . We say that the objective has the smooth+nonsmooth composite structure. We develop an algo- rithm for all the agents in the network to obtain a consensual solution to problem (1). In the algorithm, each agent locally computes the gradient and the so-called proximal map of (see Section I-C for its definition) and performs one-hop communication with its neighbors. The iterations of the agents are synchronized. The smooth+nonsmooth structure of the local objectives arises in a large number of signal processing, statistical in- ference, and machine learning problems. Specific examples include: (i) the geometric median problem in which vanishes and is the -norm [2], [3]; (ii) the compressed sensing problem, where is the data-fidelity term, which is often differentiable, and is a sparsity-promoting regularizer such as the -norm [4], [5]; (iii) optimization problems with per-agent constraints, where is a differentiable objective function of agent and is the indicator function of the constraint set of agent , that is, if satisfies the constraint and otherwise [6]–[8]. A. Background and Prior Art Pioneered by the seminal work [9], [10] in 1980s, decentral- ized optimization, control, and decision-making in networked multi-agent systems have attracted increasing interest in recent years due to the rapid development of communication and com- putation technologies [11]–[13]. Different to centralized pro- cessing, which requires a fusion center to collect data, decen- tralized approaches rely on information exchange among neigh- bors in the network and autonomous optimization by all the indi- vidual agents, and are robust to failure of critical relaying agents and scalable to the network size. These advantages lead to suc- cessful applications of decentralized optimization in robotic net- works [14], [15], wireless sensor networks [4], [16], smart grids [17], [18], and distributed machine learning systems [19], [20], just to name a few. In these applications, problem (1) arises as a generic model. The existing algorithms that solve problem (1) include the primal-dual domain methods such as the decentralized alter- nating direction method of multipliers (DADMM) [16], [21] and the primal domain methods including the distributed sub- gradient method (DSM) [22]. DADMM reformulates problem (1) in a form to which ADMM becomes a decentralized algo- rithm. In this algorithm, each agent minimizes the sum of its local objective and a quadratic function that involves local vari- ables from of its neighbors. DADMM does not take advantages 1053-587X © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
11

IEEETRANSACTIONSONSIGNALPROCESSING,VOL.63,NO.22,NOVEMBER15,2015 …home.ustc.edu.cn/~qingling/papers/J_TSP2015_PGEXTRA.pdf · 2015-10-17 · 1.Allagents pickarbitraryinitial anddo

Jul 12, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: IEEETRANSACTIONSONSIGNALPROCESSING,VOL.63,NO.22,NOVEMBER15,2015 …home.ustc.edu.cn/~qingling/papers/J_TSP2015_PGEXTRA.pdf · 2015-10-17 · 1.Allagents pickarbitraryinitial anddo

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 63, NO. 22, NOVEMBER 15, 2015 6013

A Proximal Gradient Algorithm for DecentralizedComposite OptimizationWei Shi, Qing Ling, Gang Wu, and Wotao Yin

Abstract—This paper proposes a decentralized algorithm forsolving a consensus optimization problem defined in a staticnetworked multi-agent system, where the local objective functionshave the smooth+nonsmooth composite form. Examples of suchproblems include decentralized constrained quadratic program-ming and compressed sensing problems, as well as many regular-ization problems arising in inverse problems, signal processing,andmachine learning, which have decentralized applications. Thispaper addresses the need for efficient decentralized algorithmsthat take advantages of proximal operations for the nonsmoothterms. We propose a proximal gradient exact first-order algorithm(PG-EXTRA) that utilizes the composite structure and has thebest known convergence rate. It is a nontrivial extension to therecent algorithm EXTRA. At each iteration, each agent locallycomputes a gradient of the smooth part of its objective and a prox-imal map of the nonsmooth part, as well as exchanges informationwith its neighbors. The algorithm is “exact” in the sense that anexact consensus minimizer can be obtained with a fixed step size,whereas most previous methods must use diminishing step sizes.When the smooth part has Lipschitz gradients, PG-EXTRA hasan ergodic convergence rate of in terms of the first-orderoptimality residual. When the smooth part vanishes, PG-EXTRAreduces to P-EXTRA, an algorithm without the gradients (so no“G” in the name), which has a slightly improved convergence rateat in a standard (non-ergodic) sense. Numerical experi-ments demonstrate effectiveness of PG-EXTRA and validate ourconvergence results.Index Terms—Composite objective, decentralized optimization,

multi-agent network, nonsmooth, proximal, regularization.

I. INTRODUCTION

T HIS paper considers a connected network of agents thatcooperatively solve the consensus optimization problem

in the form

(1)

Manuscript received March 16, 2015; revised July 11, 2015; accepted July13, 2015. Date of publication July 28, 2015; date of current version October 07,2015. The associate editor coordinating the review of this manuscript and ap-proving it for publication was Prof. Sergios Theodoridis. The work of Q. Lingis supported in part by NSFC grants 61004137 and 61573331. The work ofW. Yin is supported in part by NSF grant DMS-1317602. Part of this paper ap-pears in the Fortieth International Conference on Acoustics, Speech, and SignalProcessing, Brisbane, Australia, April 19–25, 2015 [1].W. Shi, Q. Ling, and G. Wu are with the Department of Automation, Univer-

sity of Science and Technology of China, Hefei, Anhui 230026, China (e-mail:[email protected]).W. Yin is with the Department of Mathematics, University of California,

Los Angeles, CA 90095 USA.Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TSP.2015.2461520

and are convex differentiable and possiblynondifferentiable functions, respectively, that are kept privateby agent . We say that the objective has thesmooth+nonsmooth composite structure. We develop an algo-rithm for all the agents in the network to obtain a consensualsolution to problem (1). In the algorithm, each agent locallycomputes the gradient and the so-called proximal map of

(see Section I-C for its definition) and performs one-hopcommunication with its neighbors. The iterations of the agentsare synchronized.The smooth+nonsmooth structure of the local objectives

arises in a large number of signal processing, statistical in-ference, and machine learning problems. Specific examplesinclude: (i) the geometric median problem in which vanishesand is the -norm [2], [3]; (ii) the compressed sensingproblem, where is the data-fidelity term, which is oftendifferentiable, and is a sparsity-promoting regularizer such asthe -norm [4], [5]; (iii) optimization problems with per-agentconstraints, where is a differentiable objective function ofagent and is the indicator function of the constraint set ofagent , that is, if satisfies the constraint andotherwise [6]–[8].

A. Background and Prior ArtPioneered by the seminal work [9], [10] in 1980s, decentral-

ized optimization, control, and decision-making in networkedmulti-agent systems have attracted increasing interest in recentyears due to the rapid development of communication and com-putation technologies [11]–[13]. Different to centralized pro-cessing, which requires a fusion center to collect data, decen-tralized approaches rely on information exchange among neigh-bors in the network and autonomous optimization by all the indi-vidual agents, and are robust to failure of critical relaying agentsand scalable to the network size. These advantages lead to suc-cessful applications of decentralized optimization in robotic net-works [14], [15], wireless sensor networks [4], [16], smart grids[17], [18], and distributed machine learning systems [19], [20],just to name a few. In these applications, problem (1) arises asa generic model.The existing algorithms that solve problem (1) include the

primal-dual domain methods such as the decentralized alter-nating direction method of multipliers (DADMM) [16], [21]and the primal domain methods including the distributed sub-gradient method (DSM) [22]. DADMM reformulates problem(1) in a form to which ADMM becomes a decentralized algo-rithm. In this algorithm, each agent minimizes the sum of itslocal objective and a quadratic function that involves local vari-ables from of its neighbors. DADMM does not take advantages

1053-587X © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: IEEETRANSACTIONSONSIGNALPROCESSING,VOL.63,NO.22,NOVEMBER15,2015 …home.ustc.edu.cn/~qingling/papers/J_TSP2015_PGEXTRA.pdf · 2015-10-17 · 1.Allagents pickarbitraryinitial anddo

6014 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 63, NO. 22, NOVEMBER 15, 2015

of the smooth+nonsmooth structure. In DSM, each agent aver-ages its local variable with those of its neighbors and movesalong a negative subgradient direction of its local objective.DSM is computationally cheap but does not take advantagesof the smooth+nonsmooth structure either. When the local ob-jectives are Lipschitz differentiable, the recent exact first-orderalgorithm EXTRA [23] is much faster, yet it cannot handle non-smooth terms.The algorithms that consider smooth+nonsmooth objec-

tives in the form of (1) include the following primal-domainmethods: the (fast) distributed proximal gradient method(DPGM) [24] and the distributed iterative soft thresholdingalgorithm (DISTA) [25], [26]. Both DPGM and DISTA consistof a gradient step for the smooth part and a proximal step forthe nonsmooth part. DPGM uses two loops where the innerone is dedicated for consensus. The nonsmooth terms of theobjective functions of all agents must be the same. DISTAis exclusively designed for minimization problems andhas a similar restriction on the nonsmooth part. In addition,primal-dual type methods include [7], [27], which are based onDADMM. In this paper, we propose a simpler algorithm thatdoes not explicitly use any dual variable. We establish con-vergence under weaker conditions and show that the residualof the first-order optimality condition reduces at the rate of

, where is the iteration number.When , the proposed algorithm PG-EXTRA reduces to

EXTRA [23]. Clearly, PG-EXTRA extends EXTRA to handlenonsmooth objective terms. This extension is not the same asthe extension from the gradient method to the proximal-gra-dient method. As the reader will see, PG-EXTRA will havetwo interlaced sequences of iterates, whereas the proximal-gra-dient method just inherits the sequence of iterates in the gradientmethod.

B. Paper Organization and Contributions

Section II of this paper develops PG-EXTRA, which takesadvantages of the smooth+nonsmooth structure of the objec-tive functions. The details are given in Section II-A. The specialcases of PG-EXTRA are discussed in Section II-B. In particular,it reduces to an algorithm P-EXTRA when all and thegradient (the “G”) steps are no longer needed.Section III establishes the convergence and derives the rates

for PG-EXTRA and P-EXTRA.Under the Lipschitz assumptionof , the iterates of PG-EXTRA converge to a solution andthe first-order optimality condition asymptotically holds at anergodic rate of . The rate improves to non-ergodicfor P-EXTRA.The performance of PG-EXTRA and P-EXTRA is numer-

ically evaluated in Section IV, on a decentralized geometricmedian problem (Section IV-A), a decentralized compressedsensing problem (Section IV-B), and a decentralized quadraticprogram (Section IV-C). Simulation results confirm our theo-retical findings and validate the competitiveness of he proposedalgorithms.We have not yet found ways to further improve the con-

vergence rates or to relax our algorithms to take stochasticor asynchronous steps with provable performance guarantees,

though some numerical experiments with modified algorithmsappeared to be successful.

C. NotationEach agent holds a local variable , whose value at

iteration is denoted by . We introduce an objective func-tion that sums all the local terms as

where

...

The th row of corresponds to the agent . We say that isconsensual if all of its rows are identical, i.e., .Similar to the definition of , we define

By definition, .The gradient of at is given by

...

where, for each , is the gradient of at . We letdenote a subgradient of at :

...

where is a subgradient of at . The th rows of ,, and belong to the agent .

In the proposed algorithm, each agent needs to compute theproximal map of , which is the subproblem

where is a given point and is a scalar. We assumethat is proximable, that is, the subproblem has a closed-formsolution or can be solved at the complexity or .This is true when is the norm, the composition of thenorm and an orthogonal matrix, the -norm, the indicator func-tion of simple constraints, etc.The Frobenius norm of a matrix is denoted as .

Given a symmetric positive semidefinite matrix , we definethe —norm: . The largest singularvalue of a matrix is denoted as . The largest and

Page 3: IEEETRANSACTIONSONSIGNALPROCESSING,VOL.63,NO.22,NOVEMBER15,2015 …home.ustc.edu.cn/~qingling/papers/J_TSP2015_PGEXTRA.pdf · 2015-10-17 · 1.Allagents pickarbitraryinitial anddo

SHI et al.: A PROXIMAL GRADIENT ALGORITHM FOR DECENTRALIZED COMPOSITE OPTIMIZATION 6015

smallest eigenvalues of a symmetric matrix are denoted asand , respectively. The smallest nonzero

eigenvalue of a symmetric positive semidefinite matrixis denoted as . We have . Let

denote the null space of , anddenote the subspace

spanned by the columns of .

II. ALGORITHM DEVELOPMENT

This section derives PG-EXTRA for problem (1) inSection II-A and discusses its special cases in Section II-B.

A. Proposed Algorithm: PG-EXTRA

PG-EXTRA starts from an arbitrary initial point ,that is, each agent holds an arbitrary point . The next point

is generated by a proximal gradient iteration

(2a)

(2b)

where is the step size and is themixing matrix which we will discuss later. All the subsequentpoints are obtained through the following update:

(3a)

(3b)

In (3a), is another mixing matrix, whichwe typically set as though there are more general choices.With that typical choice, can be easily computedfrom . PG-EXTRA is outlined in Algorithm 1, where thecomputation for all individual agents is presented.

Algorithm 1: PG-EXTRA

Set mixing matrices and ;Choose step size ;1. All agents pick arbitrary initial

and do

2. for , all agents do

end for

Algorithm 2: P-EXTRA

Set mixing matrices and ;Choose step size ;1. All agents

pick arbitrary initial and do

2. for , all agents do

end for

We will require and if are not neigh-bors and . Then, the terms like and

only involve , as well as that are fromthe neighbors of agent . All the other terms use only localinformation. Note that the algorithm maintains both of the se-quences and . The latter sequence cannotbe eliminated since is generated from , whichexplicitly depends on .We impose the following assumptions on and .Assumption 1 (Mixing Matrices): Consider a connected

network consisting of a set of agentsand a set of undirected edges . An unordered

pair if agents and have a direct communi-cation link. The mixing matrices and

satisfy1) (Decentralization property) If and , then

.2) (Symmetry property) , .3) (Null space property) ;

.4) (Spectral property) and .The first two conditions together are standard (see [22], for

example). The first condition alone ensures communications tooccur between neighboring agents. All the four conditions to-gether ensure that satisfies and its other eigen-values lie in ( 1,1). Typical choices of can be found in [23],[28]. If a matrix satisfies all the conditions, thenalso satisfies the conditions.

B. Special Cases: EXTRA and P-EXTRA

When the possibly-nondifferentiable term , we havein (2a) and (2b) and thus

(4)

Page 4: IEEETRANSACTIONSONSIGNALPROCESSING,VOL.63,NO.22,NOVEMBER15,2015 …home.ustc.edu.cn/~qingling/papers/J_TSP2015_PGEXTRA.pdf · 2015-10-17 · 1.Allagents pickarbitraryinitial anddo

6016 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 63, NO. 22, NOVEMBER 15, 2015

In (3a) and (3b), we have and thus

(5)The updates (4) and (5) are known as EXTRA, a recent algo-rithm for decentralized differentiable optimization [23].When the differentiable term , PG-EXTRA reduces

to P-EXTRA by removing all gradient computation, which isgiven in Algorithm 2.

III. CONVERGENCE ANALYSIS

A. PreliminariesUnless otherwise stated, the convergence results in this sec-

tion are given under Assumptions 1–3.Assumption 2 (Convex Objective Functions and the Smooth

Parts Having Lipschitz Gradients): For all , func-tions and are proper closed convex and satisfy

where are constant.To prove global convergence of PG-EXTRA, we require that

the objective functions are convex and the smooth parts haveLipschitz gradients. The assumption of Lipschitz gradientsholds in many decentralized detection and estimation appli-cations; for example, when the smooth parts are least-squaresterms. Even for some cases where the gradients are not Lips-chitz over the whole space, this assumption remains to hold ina bounded region. Note that our sequence is provably boundedsince the distance to the solution set is monotonic.Following Assumption 2, is proper

closed convex and satisfies

with constant .Assumption 3 (Solution Existence): The set of solution(s)

to problem (1) is nonempty.We first give a lemma on the first-order optimality condition

of problem (1).Lemma 1 (First-Order Optimality Condition): Given mixing

matrices and and the economical-form singular value de-composition , define

. Then, under Assumptions 1–3, the fol-lowing two statements are equivalent• is consensual, that is,

, and every is optimal to problem (1);• There exists for some and subgradient

such that

(6a)(6b)

Proof: According to Assumption 1 and the definition of ,we have

Hence is consensual if and only if (6b) holds.

Next, any row of the consensual is optimal if and onlyif . Since is symmetric and

, (6a) gives . Conversely,if , then

follows from and thusfor some . Let .

Then, and (6a) holds.Let and satisfy the optimality conditions (6a) and (6b).

Introduce an auxiliary sequence

The next lemma restates the updates of PG-EXTRA in terms of, , , and for convergence analysis.Lemma 2 (Recursive Relations of PG-EXTRA): In

PG-EXTRA, the quadruple sequence obeys

(7)

and

(8)

for any .Proof: By giving the first-order optimality conditions of

the subproblems (2b) and (3b) and eliminating the auxiliary se-quence for , we have the following equivalentsubgradient recursions with respect to :

Summing these subgradient recursions over times 1 through, we get

(9)

Using and the decomposition, (7) follows from (9) immediately.Since , and

, we have

(10)

Subtracting (10) from (7) and addingto (7), we obtain (8).

The recursive relations of P-EXTRA are shown in the fol-lowing corollary of Lemma 2.Corollary 1 (Recursive Relations of P-EXTRA): In

P-EXTRA, the quadruple sequence obeys

(11)

Page 5: IEEETRANSACTIONSONSIGNALPROCESSING,VOL.63,NO.22,NOVEMBER15,2015 …home.ustc.edu.cn/~qingling/papers/J_TSP2015_PGEXTRA.pdf · 2015-10-17 · 1.Allagents pickarbitraryinitial anddo

SHI et al.: A PROXIMAL GRADIENT ALGORITHM FOR DECENTRALIZED COMPOSITE OPTIMIZATION 6017

and

(12)

for any .The convergence analysis of PG-EXTRA is based on the re-

cursions (7) and (8) and that of P-EXTRA is based on (11) and(12). Define

For PG-EXTRA, we show that converges to a solution andthe successive iterative difference converges to 0at an ergodic rate (see Theorem 2); the same ergodic rateshold for the first-order optimality residuals, which are defined inTheorem 2. For the special case, P-EXTRA, also convergesto an optimal solution ; the progress and thefirst-order optimality residuals converge to 0 at improved non-ergodic rates (see Theorem 2).

B. Convergence and Convergence Rates of PG-EXTRA1) Convergence of PG-EXTRA: We first give a theorem that

shows the contractive property of PG-EXTRA. This theoremprovides a sufficient condition for PG-EXTRA to converge toa solution. In addition, it prepares for analyzing convergencerates of PG-EXTRA in Section III-B-2 and its limit case givesthe contractive property of P-EXTRA (see Section III-C).Theorem 1: Under Assumptions 1–3, if we set the step size

, then the sequence generated byPG-EXTRA satisfies

(13)where . Furthermore, converges to anoptimal .

Proof: By Assumption 2, and are convex, and isLipschitz continuous with constant , we have

(14)

and

(15)

Substituting (8) from Lemma 2 for, it follows from (14) and (15) that

(16)

For the terms on the right-hand side of (16), we have

(17)

(18)

and

(19)

Plugging (17)–(19) into (16), we have

(20)

Using the definitions of , and , (20) is equivalent to

(21)

Applying the basic equality

(22)

to (21), we have

(23)

By Assumption 1, in particular, , we haveand thus

(24)

where . The last inequality holds since

.It shows from (24) that for any solution , is

bounded and contractive. Therefore, is convergentas long as . The convergence of to a

Page 6: IEEETRANSACTIONSONSIGNALPROCESSING,VOL.63,NO.22,NOVEMBER15,2015 …home.ustc.edu.cn/~qingling/papers/J_TSP2015_PGEXTRA.pdf · 2015-10-17 · 1.Allagents pickarbitraryinitial anddo

6018 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 63, NO. 22, NOVEMBER 15, 2015

solution follows from the standard analysis for contractionmethods; see, for example, Theorem 3 in [29].2) Ergodic Rates of PG-EXTRA: To establish rate of

convergence, we need the following proposition. Parts of it ap-peared in recent works [30]–[32].Proposition 1: If a sequence obeys: (1)

and (2) , then we have: (i) ; (ii); (iii) . If the sequence

further obeys: (3) , then in addition, we have:(iv) .

Proof: Part (i) is obvious. Let . By theassumptions, is uniformly bounded and obeys

from which part (ii) follows. Since is monotoni-cally non-increasing, we have

This, with , gives us orpart (iii).If is further monotonically non-increasing, we have

Since , we get part (iv).This proposition serves for the proof of Theorem 2, as well

as that of Theorem 2 appearing in Section III-C. We give theergodic convergence rates of PG-EXTRA below.Theorem 2: In the same setting of Theorem 1, the following

rates hold for PG-EXTRA:(i) Running-average successive difference:

(ii) Running-best successive difference:

(iii) Running-average optimality residuals:

(iv) Running-best optimality residuals:

Before proving the theorem, let us explain the rates. The firsttwo rates on the squared successive difference are used to de-duce the last two rates on the optimality residuals. Since ouralgorithm does not guarantee to reduce objective functions in

a monotonic manner, we choose to establish our convergencerates in terms of optimality residuals, which show how quicklythe residuals to the KKT system (6) reduce. Note that the ratesare given on the standard squared quantities since they are sum-mable and naturally appear in the convergence analysis. Withparticular note, these rates match those on the squared suc-cessive difference and the optimality residual in the classical(centralized) gradient-descent method.

Proof: Parts (i) and (ii): Since converges to 0when goes to , we are able to sum (13) in Theorem 1 over

through and apply the telescopic cancellation, whichyields

(25)

Then, the results follow from Proposition 1 immediately.Parts (iii) and (iv): Using the basic inequality

which holds for any and any matricesand of the same size, it follows that

(26)

Since and are symmetric and

there exists a bounded such that

It follows from (26) that

(27)

Page 7: IEEETRANSACTIONSONSIGNALPROCESSING,VOL.63,NO.22,NOVEMBER15,2015 …home.ustc.edu.cn/~qingling/papers/J_TSP2015_PGEXTRA.pdf · 2015-10-17 · 1.Allagents pickarbitraryinitial anddo

SHI et al.: A PROXIMAL GRADIENT ALGORITHM FOR DECENTRALIZED COMPOSITE OPTIMIZATION 6019

As part (i) shows that , wehaveand .From (27) and (25), we see that both

and aresummable. Again, by Proposition 1, we have part (iv), the

rates of the running best first-order optimality residuals.The monotonicity of is an open question. If it

holds, then convergence rates will apply to the sequenceitself.Remark 1: The convergence rate implies that

PG-EXTRA needs at most iterations in order to reach anaccuracy of (in terms of the first-order optimality residual).Assuming that the per-iteration computational cost of calcu-lating the gradients and proximal maps is , we obtain theoverall computational complexity .

C. Convergence Rates of P-EXTRA

Convergence of P-EXTRA follows from that of PG-EXTRAdirectly. Since P-EXTRA is a special case of PG-EXTRA andits updates are free of gradient steps, it enjoys slightly betterconvergence rates: non-ergodic . Let us brief on our steps.First, as a special case of Theorem 1 by letting , thesequence is summable. Second, the sequence

of P-EXTRA is shown to be monotonic inLemma 3. Based on these results, the non-ergodic con-vergence rates are then established for successive difference andfirst-order optimality residuals.Lemma 3: Under the same assumptions of Theorem 1 except

, for any step size , the sequence generatedby P-EXTRA satisfies

(28)

for all .Proof: To simplify the description of the proof, de-

fine , ,, and .

By convexity of in Assumption 2, we have

(29)

Taking difference of (7) at the -th and -th iterationsyields

(30)

Combine (29) and (30) it follows that

(31)

Using the definition of , . Thus, we have

(32)

Substituting (32) into (31) yields

(33)

or equivalently

(34)

By applying the basic equalityto (34), we finally

have

(35)

which implies (28) and completes the proof.The next theorem gives the convergence rates of

P-EXTRA. Its proof is omitted as it is similar to that of Theorem2. The only difference is that has been shownmonotonic in P-EXTRA. Invoking fact (iv) in Proposition 1,the rates are improved from ergodic to non-ergodic

.Theorem 3: In the same setting of Lemma 3, the following

rates hold for P-EXTRA:(i) Successive difference:

(ii) First-order optimality residuals:

Remark (Less Restriction on Step Size): We can see fromSection III-C that P-EXTRA accepts a larger range of step sizethan PG-EXTRA.

IV. NUMERICAL EXPERIMENTS

In this section, we provide three numerical experiments,decentralized geometric median, decentralized compressedsensing, and decentralized quadratic programming, to demon-strate the effectiveness of the proposed algorithms. All theexperiments are conducted over a randomly generated con-nected network showing in Fig. 1, which has agentsand edges.In the numerical experiments, we use the relative error

and the successive difference asperformance metrics; the former is a standard metric to assessthe solution optimality and the later evaluates the bounds ofthe rates proved in this paper. It is worth noting that all thedecentralized algorithms numerically evaluated in this sectionhave the same communication cost per iteration. Consequently,

Page 8: IEEETRANSACTIONSONSIGNALPROCESSING,VOL.63,NO.22,NOVEMBER15,2015 …home.ustc.edu.cn/~qingling/papers/J_TSP2015_PGEXTRA.pdf · 2015-10-17 · 1.Allagents pickarbitraryinitial anddo

6020 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 63, NO. 22, NOVEMBER 15, 2015

Fig. 1. The underlying network for the experiments.

the amounts of information exchange over the network are pro-portional to their numbers of iterations.

A. Decentralized Geometric Median

Consider a decentralized geometric median problem. Eachagent holds a vector , and all theagents collaboratively calculate the geometric medianof all . This task can be formulated as solving the followingminimization problem:

Computing decentralized geometric medians have interestingapplications: (i) in [2], the multi-agent system locates a facilityto minimize the cost of transportation in a decentralizedmanner;(ii) in cognitive robotics [33], a group of collaborative robotssets up a rally point such that the overall moving cost is minimal;(iii) in distributed robust Bayesian learning [34], decentralizedgeometric median is also an important subproblem.The above problem can further be generalized as the group

least absolute deviations problem

( is themeasurement matrix on agent ), which can be consid-ered as a variant of cooperative least squares estimation whilebeing capable of detecting anomalous agents and maintainingthe system out of harmful effects caused by agents of collapse.The geometric median problem is solved by P-EXTRA.

The minimization subproblem in P-EXTRAhas an explicit solution

where we have that and, .

Fig. 2. Relative error and successive differenceof P-EXTRA, DSM, and DADMM in the decentralized geometric medianproblem.

We set , that is, each point . Data aregenerated following the uniform distribution in

. The algorithm starts from.We use the Metropolis constant edge weight for and

, as well as a constant step size .We compare P-EXTRA to DSM [22] and DADMM [21]. In

DSM, at each iteration, each agent combines the local variablesfrom its neighbors with a Metropolis constant edge weight andperforms a subgradient step along the negative subgradient ofits own objective with a diminishing step size .In DADMM, at each iteration, each agent updates its primallocal copy by solving an optimization problem and then updatesits local dual variable with simple vector operations. DADMMhas a penalty coefficient as its parameter. We have hand-op-timized this parameter and the step sizes for DADMM, DSM,and P-EXTRA.The numerical results are illustrated in Fig. 2. It shows that the

relative errors of DADMM and P-EXTRA both drop to in100 iterations while DSM has a relative error of larger thanbefore 400 iterations. P-EXTRA is better than DSM becauseit utilizes the problem structure, which is ignored by DSM. Inthis case, both P-EXTRA and DADMM can be considered asproximal point algorithms and thus have similar convergenceperformance.

Page 9: IEEETRANSACTIONSONSIGNALPROCESSING,VOL.63,NO.22,NOVEMBER15,2015 …home.ustc.edu.cn/~qingling/papers/J_TSP2015_PGEXTRA.pdf · 2015-10-17 · 1.Allagents pickarbitraryinitial anddo

SHI et al.: A PROXIMAL GRADIENT ALGORITHM FOR DECENTRALIZED COMPOSITE OPTIMIZATION 6021

B. Decentralized Compressed Sensing

Consider a decentralized compressed sensing problem. Eachagent holds its own measurement equations,

, where is a measurementvector, is a sensing matrix, is an un-known sparse signal, and is an i.i.d. Gaussian noisevector. The goal is to estimate the sparse vector . The numberof total measurements is often less than the number ofunknowns , which fails the ordinary least squares. We insteadsolve an -regularized least squares problem

where

and is the regularization parameter on agent .The decentralized compressed sensing is a special case of

the general distributed compressed sensing [35] where its intra-signal correlation is consensus. This case appears specificallyin cooperative spectrum sensing for cognitive radio networks.More of its applications can be found in [5] and the referencestherein.Considering the smooth+nonsmooth structure, PG-EXTRA

is applied here. The minimization subproblem in PG-EXTRAhas an explicit

solution

which utilizes the soft-thresholding operatorin an element-wise manner.

In this experiment, each agent holds measure-ments and the dimension of is . Measurement ma-trices and noises vectors are randomly generated withtheir elements following an i.i.d. Gaussian distribution andhave been normalized to have . The signal israndomly generated and has a sparsity of 0.8 (containing 10nonzero elements). The algorithm starts from . InPG-EXTRA, we use theMetropolis constant edge weight forand , and a constant step size .We compare PG-EXTRA with the recent work DISTA [25],

[26], which has two free parameters: temperature parameterand . We have hand optimized

and show the effect of in our experiment.The numerical results are illustrated in Fig. 3. It shows that

the relative errors of PG-EXTRA drops to in 1000 iter-ations while DISTA still has a relative error larger thanwhen it is terminated at 4000 iterations. Both PG-EXTRA andDISTA utilize the smooth+nonsmooth structure of the problembut PG-EXTRA achieves faster convergence.

C. Decentralized Quadratic Programming

We use decentralized quadratic programming as an exampleto show that how PG-EXTRA solves a constrained optimization

Fig. 3. Relative error and successive differenceof PG-EXTRA and DISTA in the decentralized compressed problem. Constant

is the step size given in Theorem 1 for PG-EXTRA. Constantis a parameter of DISTA.

problem. Each agent has a local quadratic ob-jective and a local linear constraint ,where the symmetric positive semidefinite matrix ,the vectors and , and the scalar arestored at agent . The agents collaboratively minimize the av-erage of the local objectives subject to all local constraints. Thequadratic program is:

We recast it as

(36)where

if ,otherwise,

is an indicator function. Setting and, it has the form of (1) and can be solved

by PG-EXTRA. The minimization subproblem in PG-EXTRA

Page 10: IEEETRANSACTIONSONSIGNALPROCESSING,VOL.63,NO.22,NOVEMBER15,2015 …home.ustc.edu.cn/~qingling/papers/J_TSP2015_PGEXTRA.pdf · 2015-10-17 · 1.Allagents pickarbitraryinitial anddo

6022 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 63, NO. 22, NOVEMBER 15, 2015

Fig. 4. Relative error and successive difference ofPG-EXTRA, DSPM1, and DSPM2 in the decentralized quadratic programmingproblem. Constant is the critical step size given in Theorem 1for PG-EXTRA.

has an explicitsolution. Indeed, for agent , the solution is

if ,

otherwise.

In this experiment, we set . Each is generatedby a -by- matrix, whose elements follow and i.i.d. Gaussiandistribution, multiplying its transpose. Each ’s elements aregenerated following and i.i.d. Gaussian distribution. We alsorandomly generate the constraints data and but guaranteethat the feasible set is nonempty and the optimal solution to theproblem (36) is different to the optimization problem with thesame objective of (36) but without constraints . Inthis way we can make sure at least one of the constraints

is activated. In PG-EXTRA, we use the Metropolis con-stant edge weight for and , and constant step size.The numerical experiment result is illustrated in Fig. 4.

We compare PG-EXTRA with two distributed subgradient

projection methods (DSPMs) [36] (denoted as DSPM1and DSPM2 in Fig. 4). DSPM1 assumes that each agentknows all the constraints so that DSPMcan be applied to solve (36). The iteration of DSPM1 ateach agent iswhere the set and standsfor projection onto . The projection step employs the al-ternating projection method [8] and its computation cost ishigh. To address this issue, we modify DSPM1 to DSPM2with where

. DSPM2 is likely to be convergent buthas no theoretical guarantee. Both DSPM1 and DSPM2 usediminishing step size and we hand-optimizethe initial step sizes.It is shown in Fig. 4 that the relative errors of PG-EXTRA

drops to in 4000 iterations while DSPM1 and DSPM2have their relative errors larger than when they are ter-minated at 20000 iterations. PG-EXTRA is better than DSPM1and DSPM2 because it utilizes the specific problem structure.

V. CONCLUSIONThis paper attempts to solve a broad class of decentralized

optimization problems with local objectives in the smooth+non-smooth form by extending the recent method EXTRA, whichintegrates gradient descent with consensus averaging. We pro-posed PG-EXTRA, which inherits most properties of EXTRAand can take advantages of easy proximal operations on manynonsmooth functions. We proved its convergence and estab-lished its convergence rate. The preliminary numericalresults demonstrate its competitiveness, especially over the sub-gradient and double-loop algorithms on the tested smooth+non-smooth problems. It remains open to improve the rate towith certain acceleration techniques, and to extend our methodto asynchronous and stochastic settings.

REFERENCES[1] W. Shi, Q. Ling, G. Wu, and W. Yin, “A Proximal Gradient Algo-

rithm for Decentralized Nondifferentiable Optimization,” in Proc. 40thIEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2015, pp.2964–2968.

[2] Foundations of Location Analysis, H. Eiselt and V. Marianov, Eds.New York, NY, USA: Springer, 2011.

[3] H. Lopuhaa and P. Rousseeuw, “Breakdown points of affine equi-variant estimators of multivariate location and covariance matrices,”Ann. Statist., vol. 19, no. 1, pp. 229–248, 1991.

[4] Q. Ling and Z. Tian, “Decentralized sparse signal recovery forcompressive sleeping wireless sensor networks,” IEEE Trans. SignalProcess., vol. 58, no. 7, pp. 3816–3827, 2010.

[5] G. Mateos, J. Bazerque, and G. Giannakis, “Distributed sparselinear regression,” IEEE Trans. Signal Process., vol. 58, no. 10, pp.5262–5276, 2010.

[6] S. Lee and A. Nedic, “Distributed random projection algorithm forconvex optimization,” IEEE J. Sel. Topics Signal Process., vol. 7, no.2, pp. 221–229, 2013.

[7] T. Chang, M. Hong, and X. Wang, “Multi-agent distributed optimiza-tion via inexact consensus ADMM,” IEEE Trans. Signal Process., vol.63, no. 2, pp. 482–497, 2015.

[8] C. Pang, “Set intersection problems: Supporting hyperplanes andquadratic programming,” Math. Programm., vol. 149, no. 1–2, pp.329–359, 2015.

[9] J. Tsitsiklis, “Problems in decentralized decision making and computa-tion,” Ph.D. dissertation, Electr. Eng. Comput. Sci. Dept., Massachu-setts Inst. of Technol., Cambridge, MA, USA, 1984.

[10] J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distributed asynchronousdeterministic and stochastic gradient optimization algorithms,” IEEETrans. Autom. Control, vol. 31, no. 9, pp. 803–812, 1986.

Page 11: IEEETRANSACTIONSONSIGNALPROCESSING,VOL.63,NO.22,NOVEMBER15,2015 …home.ustc.edu.cn/~qingling/papers/J_TSP2015_PGEXTRA.pdf · 2015-10-17 · 1.Allagents pickarbitraryinitial anddo

SHI et al.: A PROXIMAL GRADIENT ALGORITHM FOR DECENTRALIZED COMPOSITE OPTIMIZATION 6023

[11] B. Johansson, “On distributed optimization in networked systems,”Ph.D. dissertation, School of Electrical Engineering, Royal Inst. ofTechnol., Stockholm, Sweden, 2008.

[12] Y. Cao, W. Yu, W. Ren, and G. Chen, “An overview of recent progressin the study of distributed multi-agent coordination,” IEEE Trans. Ind.Informat., vol. 9, no. 1, pp. 427–438, 2013.

[13] A. Sayed, “Adaptation, learning, and optimization over networks,”Found. Trends Mach. Learn., vol. 7, no. 4–5, pp. 311–801, 2014.

[14] F. Bullo, J. Cortes, and S. Martinez, Distributed Control of RoboticNetworks. Princeton, NJ, USA: Princeton Univ. Press, 2009.

[15] K. Zhou and S. Roumeliotis, “Multirobot active target tracking withcombinations of relative observations,” IEEE Trans. Robot., vol. 27,no. 4, pp. 678–695, 2010.

[16] I. Schizas, A. Ribeiro, and G. Giannakis, “Consensus in ad hocWSNswith noisy links-part I: Distributed estimation of deterministic signals,”IEEE Trans. Signal Process., vol. 56, no. 1, pp. 350–364, 2008.

[17] V. Kekatos and G. Giannakis, “Distributed robust power system stateestimation,” IEEE Trans. Power Syst., vol. 28, no. 2, pp. 1617–1626,2013.

[18] G. Giannakis, V. Kekatos, N. Gatsis, S. Kim, H. Zhu, and B. Wol-lenberg, “Monitoring and optimization for power grids: A signal pro-cessing perspective,” IEEE Signal Process. Mag., vol. 30, no. 5, pp.107–128, 2013.

[19] P. Forero, A. Cano, and G. Giannakis, “Consensus-based dis-tributed support vector machines,” J. Mach. Learn. Res., vol. 11, pp.1663–1707, 2010.

[20] F. Yan, S. Sundaram, S. Vishwanathan, and Y. Qi, “Distributed Au-tonomous Online Learning: Regrets and Intrinsic Privacy-PreservingProperties,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 11, pp.2483–2493, 2013.

[21] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear conver-gence of the ADMM in decentralized consensus optimization,” IEEETrans. Signal Process., vol. 62, no. 7, pp. 1750–1761, 2014.

[22] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Trans. Autom. Control, vol. 54, no. 1, pp.48–61, 2009.

[23] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-orderalgorithm for decentralized consensus optimization,” SIAM J. Optim.,vol. 25, no. 2, pp. 944–966, 2015.

[24] A. Chen, “Fast Distributed First-Order Methods,” M.S. thesis, Electr.Eng. Comput. Sci. Dept., Massachusetts Inst. of Technol., Cambridge,MA, USA, 2012.

[25] C. Ravazzi, S. Fosson, and E. Magli, “Distributed iterative thresh-olding for -regularized linear inverse problems,” IEEE Trans.Inf. Theory, vol. 61, no. 4, pp. 2081–2100, 2015.

[26] C. Ravazzi, S. M. Fosson, and E. Magli, “Distributed soft thresholdingfor sparse signal recovery,” in Proc. IEEE Global Commun. Conf.(GLOBECOM), 2013, pp. 3429–3434.

[27] P. Bianchi,W.Hachem, and F. Iutzeler, “A stochastic primal-dual algo-rithm for distributed asynchronous composite optimization,” in Proc.2nd Global Conf. Signal Inf. Process. (GlobalSIP), 2014, pp. 732–736.

[28] S. Boyd, P. Diaconis, and L. Xiao, “Fastest mixing Markov chain on agraph,” SIAM Rev., vol. 46, no. 4, pp. 667–689, 2004.

[29] B. He, “A new method for a class of linear variational inequalities,”Math. Programm., vol. 66, no. 1–3, pp. 137–144, 1994.

[30] W. Deng, M. Lai, Z. Peng, and W. Yin, “Parallel multi-block ADMMwith convergence,” UCLA CAM Rep. 13-64, 2013.

[31] D. Davis and W. Yin, “Convergence rate analysis of several splittingschemes,” 2014, arXiv preprint arXiv:1406.4834.

[32] D. Davis and W. Yin, “Convergence rates of relaxed Peaceman-Rach-ford and ADMM under regularity assumptions,” 2014, arXiv preprintarXiv:1407.5210.

[33] R. Ravichandran, G. Gordon, and S. Goldstein, “A scalable distributedalgorithm for shape transformation in multi-robot systems,” in Proc.IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2007, pp. 4188–4193.

[34] S. Minsker, S. Srivastava, L. Lin, and D. Dunson, “Scalable and robustBayesian inference via the median posterior,” in Proc. 31st Int. Conf.Mach. Learn. (ICML), 2014, pp. 1656–1664.

[35] D. Baron, M. Duarte, M. Wakin, S. Sarvotham, and R. Baraniuk, “Dis-tributed compressive sensing,” 2009, arXiv preprint arXiv:0901.3403.

[36] S. Ram, A. Nedic, and V. Veeravalli, “Distributed stochastic subgra-dient projection algorithms for convex optimization,” J. Optim. TheoryAppl., vol. 147, no. 3, pp. 516–545, 2010.

Wei Shi received the B.E. degree in automation fromUniversity of Science and Technology of China in2010. He is now pursuing his Ph.D. degree in controltheory and control engineering in the Department ofAutomation, University of Science and Technologyof China. His current research focuses on decentral-ized optimization of networked multi-agent systems.

Qing Ling received the B.E. degree in automationand the Ph.D. degree in control theory and controlengineering from University of Science and Tech-nology of China in 2001 and 2006, respectively.From 2006 to 2009, he was a Post-Doctoral Re-search Fellow in the Department of Electrical andComputer Engineering, Michigan TechnologicalUniversity. Since 2009, he has been an AssociateProfessor in the Department of Automation,University of Science and Technology of China.His current research focuses on decentralized

optimization of networked multi-agent systems.

GangWu received the B.E. degree in automation andthe M.S. degree in control theory and control engi-neering from University of Science and Technologyof China in 1986 and 1989, respectively. Since 1991,he has been in the Department of Automation, Uni-versity of Science and Technology of China, wherehe is now a Professor. His current research interestsare advanced control and optimization of industrialprocesses.

Wotao Yin received the B.S. degree in mathematicsand applied mathematics from Nanjing Universityin 2001, the M.S. and Ph.D. degrees in operationsresearch from Columbia University in 2003 and2006, respectively. From 2006 to 2013, he wasan Assistant Professor and then an AssociateProfessor in the Department of Computationaland Applied Mathematics, Rice University. Since2013, he has been a Professor in the Departmentof Mathematics, University of California, LosAngeles. His current research interest is large-scale

decentralized/distributed optimization.