Top Banner
A Flexible ADMM for Big Data Applications Daniel Robinson and Rachael Tappenden, Johns Hopkins University Introduction We study the optimization problem: minimize x 1 ,...,x n f (x) := n X i=1 f i (x i ) subject to n X i=1 A i x i = b. (1) Functions f i : R N i R ∪ {∞} are strongly convex with constant μ i > 0 Vector b R m and matrices A i R m×N i are problem data. We can write x =[x 1 ,...,x n ] R N , A =[A 1 ,...,A n ], with x i R N i and N = n i=1 N i We assume a solution to (1) exists What’s new?! 1. Present a new Flexible Alternating Direction Method of Multipliers (F-ADMM) to solve (1). For strongly convex f i ; for general n 2; uses a Gauss-Seidel updating scheme. 2. Incorporate (very general) regularization matrices {P i } n i=1 Stabilize the iterates; make subproblems easier to solve Assume: {P i } n i=1 are symmetric and sufficiently positive definite. 3. Prove that F-ADMM is globally convergent. 4. Introduce a Hybrid ADMM variant (H-ADMM) that is partially parallelizable. 5. Special case of H-ADMM can be applied to convex functions A Flexible ADMM (F-ADMM) Algorithm 1 F-ADMM for solving problem (1). 1: Initialize: x (0) , y (0) , parameters ρ> 0, γ (0, 2), and matrices {P i } n i=1 2: for k =0, 1, 2,... do 3: for i =1,...,n do Update the primal variables in a Gauss-Seidel fashion: x (k +1) i arg min x i n f i (x i )+ ρ 2 kA i x i + i-1 X j =1 A j x (k +1) j + n X l =i+1 A l x (k ) l - b - y (k ) ρ k 2 2 + 1 2 kx i - x (k ) i k 2 P i o 4: end for 5: Update dual variables: y (k +1) y (k ) - γρ(Ax (k +1) - b) 6: end for Theorem 1. Under our assumptions the sequence {(x (k ) ,y (k ) )} k 0 generated by Algorithm 1 converges to some vector (x * ,y * ) that is a solution to problem (1). A Hybrid ADMM (H-ADMM) F-ADMM is inherently serial (it has a Gauss-Seidel-type updating scheme) Jacobi-type (parallel) methods are imperative for big data problems Goal: find a balance between algorithm speed via parallelization (Jacobi) and allowing up- to-date information to be fed back into the algorithm (Gauss-Seidel). Apply F-ADMM to “grouped data” and choose the regularization matrix carefully New Hybrid Gauss-Seidel/Jacobi ADMM method (H-ADMM) that is partially parallelizable Group the data: Form groups of p blocks where n = ‘p x =[ x 1 ,...,x p | {z } | x p+1 ,...,x 2p | {z } | ... | x (-1)p+1 ...x n | {z } ] x 1 x 2 x and A =[ A 1 ,...,A p | {z } | A p+1 ,...,A 2p | {z } | ... | A (-1)p+1 ...A n | {z } ] A 1 A 2 A Problem (1) becomes: minimize xR N f (x) X j =1 f j (x j ) subject to X j =1 A j x j = b. (2) Choice of ‘group’ regularization matrices {P i } n i=1 is crucial: Choosing P i := blkdiag(P S i,1 ,...,P S i,p ) - ρA T i A i (3) (index set S i = {(i - 1)p, . . . , ip} for 1 i ) makes group iterations separable/parallelizable! Algorithm 2 H-ADMM for solving problem (1). 1: Initialize: x (0) , y (0) , ρ> 0 and γ (0, 2), index sets {S i } i=1 , matrices {P i } n i=1 . 2: for k =0, 1, 2,... do 3: for i =1,...,‘ (in a serial Gauss-Seidel fashion solve) do 4: Set b i b - i-1 q =1 A q x (k +1) q - s=i A s x (k ) s + y (k) ρ 5: for j ∈S i (in parallel Jacobi fashion solve) do x (k +1) j arg min x j n f j (x j )+ ρ 2 kA j (x j - x (k ) j ) - b i k 2 2 + 1 2 kx j - x (k ) j k 2 P j o 6: end for 7: end for 8: Update the dual variables: y (k +1) y (k ) - γρ(Ax (k +1) - b). 9: end for Key features of H-ADMM H-ADMM is a special case of F-ADMM (Convergence is automatic from F-ADMM theory) Groups updated in Gauss-Seidel fashion; Blocks within group updated in Jacobi fashion Decision variables x j for j ∈S i are solved for in parallel ! Regularization matrices P i : Never explicitly form P i for i =1,...,‘. Only need P 1 ,...,P n Regularization matrix P i makes subproblem for group x (k +1) i separable into p blocks – Assume: Matrices P i are symmetric and sufficiently positive definite Computational considerations “Optimize” H-ADMM to number of processors: For machine with p processors, set group size to p too! Still using “new” information, just not as much as in ‘individual block’ setting Special Case: Hybrid ADMM with =2, convergence holds for convex f i Numerical Experiments Determine the solution to an underdetermined system of equations with the smallest 2-norm: minimize xR N 1 2 kxk 2 2 subject to Ax = b. (4) n = 100 blocks; p = 10 processors; = 10 groups; block size N i = 100 i; Convexity parameter μ i =1 i; A is 3 · 10 3 × 10 4 and sparse; Noiseless data Algorithm parameters: ρ =0.1, γ =1; stopping condition: 1 2 kAx - bk 2 2 10 -10 . Choosing P j = τ j I - ρA T j A j j ∈S i , for some τ j > 0, gives x (k +1) j = τ j τ j +1 x (k ) j + ρ τ j +1 A T j b i j ∈S i . We compare F-ADMM and H-ADMM with Jacobi ADMM (J-ADMM) [1]. (J-ADMM is similar to F-ADMM, but the blocks are updated using a Jacobi-type scheme.) Results using theoretical τ j values 0 20 40 60 80 100 0 100 200 300 Magnitude of tau for each block J-ADMM F-ADMM H-ADMM Method Epochs J-ADMM 4358.2 H-ADMM 214.1 F-ADMM 211.3 Left plot: A plot of the magnitude of τ j for each block 1 j n for problem (4). The τ j values are much smaller for H-ADMM and F-ADMM, than for J-ADMM. Larger τ j corresponds to ‘more positive definite’ regularization matrices P j . Right table: Number of epochs required by J-ADMM, F-ADMM, and H-ADMM for problem (4) using theoretical values of τ j for j =1,...,n (averaged over 100 runs). H-ADMM and F-ADMM require significantly fewer epochs than J-ADMM using theoretical values of τ j . Results using parameter tuning Parameter tuning can help practical performance! Assign the same value τ j for all blocks j =1,...,n and all algorithms. (i.e., τ 1 = τ 2 = ··· = τ n .) τ j J-ADMM H-ADMM F-ADMM ρ 2 2 kAk 4 530.0 526.3 526.2 0.6 · ρ 2 2 kAk 4 324.0 320.1 319.9 0.4 · ρ 2 2 kAk 4 217.7 214.5 214.1 0.22 · ρ 2 2 kAk 4 123.1 119.3 119.0 0.2 · ρ 2 2 kAk 4 95.8 95.5 0.1 · ρ 2 2 kAk 4 75.3 73.0 The table presents the number of epochs required by J-ADMM, F-ADMM, and H-ADMM on problem (4) as τ j varies. For each τ j we run each algorithm (J-ADMM, F-ADMM and H-ADMM) on 100 random instances of the problem formulation described above. Here, τ j takes the same value for all blocks j =1,...,n. All algorithms require fewer epochs as τ j decreases, and F-ADMM requires the smallest number of epochs. As τ j decreases, number of epochs decreases For fixed τ j F-ADMM and H-ADMM outperform J-ADMM F-ADMM converge for very small τ j values References 1. Wei Deng, Ming-Jun Lai, Zhimin Peng, and Wotao Yin, “Parallel Multi-Block ADMM with o(1/k) Convergence”, Technical Report, UCLA (2014) 2.Daniel Robinson and Rachael Tappenden, “A Flexible ADMM Algorithm for Big Data Applications”, Technical Report, JHU, arXiv:1502.04391 (2015) {daniel.p.robinson,rtappen1}@jhu.edu
1

A Flexible ADMM for Big Data Applicationsprichtar/Optimization_and_Big_Data_2015/... · A Flexible ADMM for Big Data Applications Daniel Robinson and Rachael Tappenden, Johns Hopkins

Jun 21, 2018

Download

Documents

hoangthien
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Flexible ADMM for Big Data Applicationsprichtar/Optimization_and_Big_Data_2015/... · A Flexible ADMM for Big Data Applications Daniel Robinson and Rachael Tappenden, Johns Hopkins

A Flexible ADMM for Big Data ApplicationsDaniel Robinson and Rachael Tappenden, Johns Hopkins University

Introduction

We study the optimization problem:

minimizex1,...,xn

f (x) :=

n∑i=1

fi(xi) subject to

n∑i=1

Aixi = b. (1)

• Functions fi : RNi → R ∪ {∞} are strongly convex with constant µi > 0

• Vector b ∈ Rm and matrices Ai ∈ Rm×Ni are problem data.•We can write x = [x1, . . . , xn] ∈ RN , A = [A1, . . . , An], with xi ∈ RNi and N =

∑ni=1Ni

•We assume a solution to (1) exists

What’s new?!

1. Present a new Flexible Alternating Direction Method of Multipliers (F-ADMM) to solve (1).• For strongly convex fi; for general n ≥ 2; uses a Gauss-Seidel updating scheme.

2. Incorporate (very general) regularization matrices {Pi}ni=1• Stabilize the iterates; make subproblems easier to solve•Assume: {Pi}ni=1 are symmetric and sufficiently positive definite.

3. Prove that F-ADMM is globally convergent.4. Introduce a Hybrid ADMM variant (H-ADMM) that is partially parallelizable.5. Special case of H-ADMM can be applied to convex functions

A Flexible ADMM (F-ADMM)

Algorithm 1 F-ADMM for solving problem (1).1: Initialize: x(0), y(0), parameters ρ > 0, γ ∈ (0, 2), and matrices {Pi}ni=12: for k = 0, 1, 2, . . . do3: for i = 1, . . . , n do Update the primal variables in a Gauss-Seidel fashion:

x(k+1)i ←argmin

xi

{fi(xi) +

ρ

2‖Aixi +

i−1∑j=1

Ajx(k+1)j +

n∑l=i+1

Alx(k)l − b−

y(k)

ρ‖22 +

1

2‖xi − x

(k)i ‖

2Pi

}4: end for5: Update dual variables: y(k+1)← y(k) − γρ(Ax(k+1) − b)6: end for

Theorem 1. Under our assumptions the sequence {(x(k), y(k))}k≥0 generated by Algorithm 1converges to some vector (x∗, y∗) that is a solution to problem (1).

A Hybrid ADMM (H-ADMM)

F-ADMM is inherently serial (it has a Gauss-Seidel-type updating scheme)Jacobi-type (parallel) methods are imperative for big data problemsGoal: find a balance between algorithm speed via parallelization (Jacobi) and allowing up-to-date information to be fed back into the algorithm (Gauss-Seidel).• Apply F-ADMM to “grouped data” and choose the regularization matrix carefully•New Hybrid Gauss-Seidel/Jacobi ADMM method (H-ADMM) that is partially parallelizableGroup the data: Form ` groups of p blocks where n = `p

x = [x1, . . . , xp︸ ︷︷ ︸ | xp+1, . . . , x2p︸ ︷︷ ︸ | . . . | x(`−1)p+1 . . . xn︸ ︷︷ ︸ ]x1 x2 x`

and

A = [A1, . . . , Ap︸ ︷︷ ︸ | Ap+1, . . . , A2p︸ ︷︷ ︸ | . . . | A(`−1)p+1 . . . An︸ ︷︷ ︸ ]A1 A2 A`

Problem (1) becomes:

minimizex∈RN

f (x) ≡∑̀j=1

fj(xj) subject to∑̀j=1

Ajxj = b. (2)

Choice of ‘group’ regularization matrices {Pi}ni=1 is crucial: Choosing

Pi := blkdiag(PSi,1, . . . , PSi,p)− ρATi Ai (3)

(index set Si = {(i−1)p, . . . , ip} for 1 ≤ i ≤ `) makes group iterations separable/parallelizable!

Algorithm 2 H-ADMM for solving problem (1).1: Initialize: x(0), y(0), ρ > 0 and γ ∈ (0, 2), index sets {Si}`i=1, matrices {Pi}ni=1.2: for k = 0, 1, 2, . . . do3: for i = 1, . . . , ` (in a serial Gauss-Seidel fashion solve) do4: Set bi← b−

∑i−1q=1Aqx

(k+1)q −

∑`s=iAsx

(k)s + y(k)

ρ5: for j ∈ Si (in parallel Jacobi fashion solve) do

x(k+1)j ←argmin

xj

{fj(xj) +

ρ

2‖Aj(xj − x

(k)j )− bi‖22 +

1

2‖xj − x

(k)j ‖

2Pj

}6: end for7: end for8: Update the dual variables: y(k+1)← y(k) − γρ(Ax(k+1) − b).9: end for

Key features of H-ADMM

•H-ADMM is a special case of F-ADMM (Convergence is automatic from F-ADMM theory)•Groups updated in Gauss-Seidel fashion; Blocks within group updated in Jacobi fashion

– Decision variables xj for j ∈ Si are solved for in parallel !•Regularization matrices Pi:

– Never explicitly form Pi for i = 1, . . . , `. Only need P1, . . . , Pn– Regularization matrix Pi makes subproblem for group x

(k+1)i separable into p blocks

– Assume: Matrices Pi are symmetric and sufficiently positive definite•Computational considerations

– “Optimize” H-ADMM to number of processors: For machine with p processors, set groupsize to p too!

– Still using “new” information, just not as much as in ‘individual block’ setting– Special Case: Hybrid ADMM with ` = 2,⇒ convergence holds for convex fi

Numerical Experiments

Determine the solution to an underdetermined system of equations with the smallest 2-norm:

minimizex∈RN

1

2‖x‖22 subject to Ax = b. (4)

• n = 100 blocks; p = 10 processors; ` = 10 groups; block size Ni = 100 ∀i;•Convexity parameter µi = 1 ∀i; A is 3 · 103 × 104 and sparse; Noiseless data• Algorithm parameters: ρ = 0.1, γ = 1; stopping condition: 1

2‖Ax− b‖22 ≤ 10−10.

Choosing Pj = τjI − ρATj Aj ∀j ∈ Si, for some τj > 0, gives

x(k+1)j =

τjτj + 1

x(k)j +

ρ

τj + 1ATj bi ∀j ∈ Si.

We compare F-ADMM and H-ADMM with Jacobi ADMM (J-ADMM) [1]. (J-ADMM is similarto F-ADMM, but the blocks are updated using a Jacobi-type scheme.)

Results using theoretical τj values

0 20 40 60 80 1000

100

200

300

Magnitude of tau for each block

J-ADMMF-ADMMH-ADMM

Method Epochs

J-ADMM 4358.2H-ADMM 214.1F-ADMM 211.3

Left plot: A plot of the magnitude of τj for each block 1 ≤ j ≤ n for problem (4). The τj values are much smaller for H-ADMM andF-ADMM, than for J-ADMM. Larger τj corresponds to ‘more positive definite’ regularization matrices Pj. Right table: Number of

epochs required by J-ADMM, F-ADMM, and H-ADMM for problem (4) using theoretical values of τj for j = 1, . . . , n (averaged over100 runs). H-ADMM and F-ADMM require significantly fewer epochs than J-ADMM using theoretical values of τj.

Results using parameter tuning

Parameter tuning can help practical performance! Assign the same value τj for all blocksj = 1, . . . , n and all algorithms. (i.e., τ1 = τ2 = · · · = τn.)

τj J-ADMM H-ADMM F-ADMMρ2

2 ‖A‖4 530.0 526.3 526.2

0.6 · ρ2

2 ‖A‖4 324.0 320.1 319.9

0.4 · ρ2

2 ‖A‖4 217.7 214.5 214.1

0.22 · ρ2

2 ‖A‖4 123.1 119.3 119.0

0.2 · ρ2

2 ‖A‖4 — 95.8 95.5

0.1 · ρ2

2 ‖A‖4 — 75.3 73.0

The table presents the number of epochs required by J-ADMM, F-ADMM, and H-ADMM on problem (4) as τj varies. For each τj werun each algorithm (J-ADMM, F-ADMM and H-ADMM) on 100 random instances of the problem formulation described above. Here,τj takes the same value for all blocks j = 1, . . . , n. All algorithms require fewer epochs as τj decreases, and F-ADMM requires thesmallest number of epochs.

• As τj decreases, number of epochs decreases• For fixed τj F-ADMM and H-ADMM outperform J-ADMM• F-ADMM converge for very small τj values

References1. Wei Deng, Ming-Jun Lai, Zhimin Peng, and Wotao Yin, “Parallel Multi-Block ADMM with o(1/k) Convergence”, Technical

Report, UCLA (2014)

2. Daniel Robinson and Rachael Tappenden, “A Flexible ADMM Algorithm for Big Data Applications”, Technical Report,JHU, arXiv:1502.04391 (2015)

{daniel.p.robinson,rtappen1}@jhu.edu

1