Aug Lagrangian and ADMM

8/12/2019 Aug Lagrangian and ADMM

1/19


2/19

ST810 Lecture 24

Augmented Lagrangian method

Outline


ADMM

Final words


3/19

ST810 Lecture 24



Consider minimizingf(x)subject to equality constraintsgi(x) =0 fori=1, . . . , q.

Inequality constraints are ignored for simplicity

Assumef andgiare smooth for simplicity

At a constrained minimum, the Lagrange multiplier condition

0= f(x) +

q

i=1

igi(x)

holds providedgi(x)are linearly independent


4/19

ST810 Lecture 24


Augmented Lagrangian:

L(x,) =f(x) +

qi=1

igi(x) +

2

qi=1

gi(x)2

The penalty term(/2)

qi=1 gi(x)2 punishes violations of theequality constraintsgi()

Idea: optimize the Augmented Lagrangian and adjust in thehope of matching the true Lagrange multipliers

For large enough (finite), the unconstrained minimizer of the

augmented Lagrangian coincides with the constrained solution ofthe original problem

At convergence, the gradient gi(x)gi(x)vanishes and werecover the standard multiplier rule


5/19

ST810 Lecture 24


Algorithm: take initially large or gradually increase it; iterate find the unconstrained minimum

x(t+1) minxL(x,

(t))

update the multiplier vector

(t+1)i

(t)i +gi(x(

t)), i=1, . . . , q

Intuition for updating: ifx(t) is the unconstrained minimum ofL(x,), then the stationarity condition says

0 = f(x

(t)

) +

q

i=1

(t)

i gi(x

(t)

) +

q

i=1

gi(x

(t)

)gi(x

(t)

)

= f(x(t)) +

q

i=1

[(t)i +gi(x

(t))]gi(x(t))

For non-smoothf, replace gradientfby subdifferentialf


6/19

ST810 Lecture 24


Example: basis pursuit

Basis pursuit problem seeks the sparsest solution subject to linearconstraints

minimize x1

subject to Ax=b

Take initially large or gradually increase it; iterate according to

x(t+1) minx1+(t),Ax b+

2

Ax b22 (lasso)

(t+1) (t) +(Ax(t+1) b)

Converges in a finite (small) number of steps (Yin et al., 2008)


7/19

ST810 Lecture 24


Remarks

The augmented Lagrangian method dates back to 50s

(Hestenes, 1969; Powell, 1969) Monograph by Bertsekas (1982) provides a general treatment

Same as theBregman iteration(Yin et al., 2008) proposed forbasis pursuit (compressive sensing)

Equivalent to proximal point algorithm applied to the dual; can be

accelerated (Nesterov)


8/19

ST810 Lecture 24

ADMM

Outline


ADMM

Final words


9/19

ST810 Lecture 24

ADMM

ADMM

Alternatingdirectionmethod ofmultipliers

Consider minimizingf(x) +g(y)subject to affine constraintsAx+ By=c

The augmented Lagrangian

L(x, y,) =f(x) +g(y) + ,Ax+ By c +

2Ax+ By c22

Idea: perform block descent on xandyand then updatemultiplier vector

x(t+1) minx

f(x) + ,Ax+ By(t) c + 2Ax+ By(t) c22

y(t+1) miny

g(y) + ,Ax(t+1) + By c +

2Ax(t+1) + By c22

(t+1) (t) +(Ax(t+1) + By(t+1) c)


10/19

ST810 Lecture 24

ADMM

Example: fused lasso

Fused lasso problem minimizes

1

2

y X22+

p1

j=1

|j+1j|

Define=D, where

D=

1 1

1 1

.

Then we minimize 12y X22+1 subject toD=

ST810 L t 24


11/19

ST810 Lecture 24

ADMM

Augmented Lagrangian is

L(,,) = 1

2y X22+1+

T(D ) +

2D 22

ADMM: Update is a smooth quadratic problem Update is a separated lasso problem (elementwise thresholding) Update multipliers

(t+1) (t) +(D(t) (t))

Same algorithm applies to a general regularization matrixD(generalized lasso)

ST810 Lecture 24


12/19

ST810 Lecture 24

ADMM

Remarks on ADMM

Related algorithms split Bregman iteration(Goldstein and Osher, 2009) Dykstra (1983)s alternating projection algorithm ...

Proximal point algorithm applied to the dual

Numerous applications in statistics and machine learning: lasso,gen. lasso, graphical lasso, (overlapping) group lasso, ...

Embraces distributed computing for big data (Boyd et al., 2011)

ST810 Lecture 24


13/19

ST810 Lecture 24

Final words

Outline


ADMM

Final words

ST810 Lecture 24


14/19

ST810 Lecture 24

Final words

Take-home messages from this course

Statistics, the science ofdata analysis, is the appliedmathematics in the 21st century Read the first few pages and the last few pages of Tukey (1962)s

Future of data analysis(posted on course website). They are amust for every statistician

Big dataera:wiki,WSJ,white house,McKinsey report, ...Challenges: methodology: bigp

efficiency: bignand/or bigp memory: bign, distributed computing via MapReduce (Hadoop),

online algorithms

ST810 Lecture 24
http://en.wikipedia.org/wiki/Big_datahttp://online.wsj.com/article/SB10001424127887323751104578147311334491922.htmlhttp://online.wsj.com/article/SB10001424127887323751104578147311334491922.htmlhttp://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_release_final_2.pdfhttp://www.slideshare.net/fred.zimny/mckinsey-quarterlys-2011-report-the-challenge-and-opportunityof-big-datahttp://www.slideshare.net/fred.zimny/mckinsey-quarterlys-2011-report-the-challenge-and-opportunityof-big-datahttp://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_release_final_2.pdfhttp://online.wsj.com/article/SB10001424127887323751104578147311334491922.htmlhttp://en.wikipedia.org/wiki/Big_data


15/19

ST810 Lecture 24

Final words

Coding Prototyping: R and Matlab

A real programming language: C/C++, Fortran, Python Scripting language: Python, Perl, JavaScript

Numerical linear algebra. Use standardlibraries(BLAS,LAPACK)! Sparse linear algebra is critical for exploitingsparsitystructure in big data

ST810 Lecture 24


16/19

Final words

Optimization Disciplined convex programming(LS, LP, QP, GP, SOCP, SDP)

Convex programming is becoming atechnology, just like least

squares (LS). Many statisticians dont realize this Specialized tools in statistics: EM/MM, Fisher scoring,

Gauss-Newton, simulated annealing, ... Combinatorial optimization techniques: divide-and-conquer,

dynamic programming, greedy algorithm, ...

ST810 Lecture 24


17/19

Final words

About final project

In your presentation

describe your research question

what variables/features in the data are used describe preprocessing procedure

describe implementation details: language, software, algorithm,timing, ...

describe the difficulties you met. Which are or are not working?

send us your slides before your presentation, so we can givebetter feedback

ST810 Lecture 24


18/19

Final words

(Partial) answers to your questions

Weis group lasso (with equality constraint) problem: SOCP,accelerated proximal gradient (Nesterov) method, ADMM

Tians fused-lasso problem: QP, ADMM, accelerated proximalgradient method coupled with DP, re-parameterize to lasso, pathalgorithm

Kehuis composite quantile regression problem: LP (althoughoriginal problem is non-convex)

Shikai: SDP

Feel free to ask more

ST810 Lecture 24


19/19

References

Bertsekas, D. P. (1982). Constrained Optimization and Lagrange Multiplier Methods.Computer Science and Applied Mathematics. Academic Press Inc. [Harcourt BraceJovanovich Publishers], New York.

Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributedoptimization and statistical learning via the alternating direction method ofmultipliers. Found. Trends Mach. Learn., 3(1):1122.

Dykstra, R. L. (1983). An algorithm for restricted least squares regression. J. Amer.Statist. Assoc., 78(384):837842.

Goldstein, T. and Osher, S. (2009). The split Bregman method forl1-regularizedproblems. SIAM J. Img. Sci., 2:323343.

Hestenes, M. R. (1969). Multiplier and gradient methods. J. Optimization Theory Appl.,4:303320.

Powell, M. J. D. (1969). A method for nonlinear constraints in minimization problems. InOptimization (Sympos., Univ. Keele, Keele, 1968), pages 283298. AcademicPress, London.

Tukey, J. W. (1962). The future of data analysis. Ann. Math. Statist., 33:167.

Yin, W., Osher, S., Goldfarb, D., and Darbon, J. (2008). Bregman iterative algorithmsforl1-minimization with applications to compressed sensing. SIAM J. Imaging Sci.,1(1):143168.

Aug Lagrangian and ADMM

Documents

Aug Lagrangian and ADMM