Top Banner
Alternating Minimization (and Friends) Lecture 7: 6.883, Spring 2016 Suvrit Sra Massachusetts Institute of Technology Feb 29, 2016
94

Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Nov 01, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Alternating Minimization (and Friends)

Lecture 7: 6.883, Spring 2016

Suvrit Sra

Massachusetts Institute of Technology

Feb 29, 2016

Page 2: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Background: Coordinate Descent

For x ∈ Rn consider

min f (x) = f (x1, x2, . . . , xn)

Ancient idea: optimize over individual coordinates

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 2 / 39

Page 3: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent

Coordinate descent For k = 0,1, . . .

Pick an index i from 1, . . . ,n

Optimize the i th coordinate

xk+1i ← argmin

ξ∈Rf (xk+1

1 , . . . , xk+1i−1︸ ︷︷ ︸

done

, ξ︸︷︷︸current

, xki+1, . . . , x

kn︸ ︷︷ ︸

todo

)

Decide when/how to stop; return xk

xk+1i overwrites value in xk

i (implementation)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 3 / 39

Page 4: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent

Coordinate descent For k = 0,1, . . .

Pick an index i from 1, . . . ,nOptimize the i th coordinate

xk+1i ← argmin

ξ∈Rf (xk+1

1 , . . . , xk+1i−1︸ ︷︷ ︸

done

, ξ︸︷︷︸current

, xki+1, . . . , x

kn︸ ︷︷ ︸

todo

)

Decide when/how to stop; return xk

xk+1i overwrites value in xk

i (implementation)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 3 / 39

Page 5: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent

Coordinate descent For k = 0,1, . . .

Pick an index i from 1, . . . ,nOptimize the i th coordinate

xk+1i ← argmin

ξ∈Rf (xk+1

1 , . . . , xk+1i−1︸ ︷︷ ︸

done

, ξ︸︷︷︸current

, xki+1, . . . , x

kn︸ ︷︷ ︸

todo

)

Decide when/how to stop; return xk

xk+1i overwrites value in xk

i (implementation)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 3 / 39

Page 6: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent

♣ One of the simplest optimization methods♣ Various ideas for next coordinate to optimize

♣ Old idea: Gauss-Seidel, Jacobi methods for Ax = b♣ Can be “slow”; sometimes very competitive♣ Gradient, stochastic gradient also “slow”♣ But scalable (eg: libsvm)♣ Renewed interest; esp. stochastic CD♣ Notice: in general CD is “derivative free”

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 4 / 39

Page 7: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent

♣ One of the simplest optimization methods♣ Various ideas for next coordinate to optimize♣ Old idea: Gauss-Seidel, Jacobi methods for Ax = b

♣ Can be “slow”; sometimes very competitive♣ Gradient, stochastic gradient also “slow”♣ But scalable (eg: libsvm)♣ Renewed interest; esp. stochastic CD♣ Notice: in general CD is “derivative free”

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 4 / 39

Page 8: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent

♣ One of the simplest optimization methods♣ Various ideas for next coordinate to optimize♣ Old idea: Gauss-Seidel, Jacobi methods for Ax = b♣ Can be “slow”; sometimes very competitive

♣ Gradient, stochastic gradient also “slow”♣ But scalable (eg: libsvm)♣ Renewed interest; esp. stochastic CD♣ Notice: in general CD is “derivative free”

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 4 / 39

Page 9: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent

♣ One of the simplest optimization methods♣ Various ideas for next coordinate to optimize♣ Old idea: Gauss-Seidel, Jacobi methods for Ax = b♣ Can be “slow”; sometimes very competitive♣ Gradient, stochastic gradient also “slow”

♣ But scalable (eg: libsvm)♣ Renewed interest; esp. stochastic CD♣ Notice: in general CD is “derivative free”

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 4 / 39

Page 10: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent

♣ One of the simplest optimization methods♣ Various ideas for next coordinate to optimize♣ Old idea: Gauss-Seidel, Jacobi methods for Ax = b♣ Can be “slow”; sometimes very competitive♣ Gradient, stochastic gradient also “slow”♣ But scalable (eg: libsvm)

♣ Renewed interest; esp. stochastic CD♣ Notice: in general CD is “derivative free”

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 4 / 39

Page 11: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent

♣ One of the simplest optimization methods♣ Various ideas for next coordinate to optimize♣ Old idea: Gauss-Seidel, Jacobi methods for Ax = b♣ Can be “slow”; sometimes very competitive♣ Gradient, stochastic gradient also “slow”♣ But scalable (eg: libsvm)♣ Renewed interest; esp. stochastic CD

♣ Notice: in general CD is “derivative free”

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 4 / 39

Page 12: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent

♣ One of the simplest optimization methods♣ Various ideas for next coordinate to optimize♣ Old idea: Gauss-Seidel, Jacobi methods for Ax = b♣ Can be “slow”; sometimes very competitive♣ Gradient, stochastic gradient also “slow”♣ But scalable (eg: libsvm)♣ Renewed interest; esp. stochastic CD♣ Notice: in general CD is “derivative free”

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 4 / 39

Page 13: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Example: Least-squares

Assume A ∈ Rm×n

min ‖Ax − b‖22

Coordinate descent update

xj ←∑m

i=1 aij

(bi −

∑l 6=j ailxl

)∑m

i=1 a2ij

(dropped superscripts, since we overwrite)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 5 / 39

Page 14: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Example: Least-squares

Assume A ∈ Rm×n

min ‖Ax − b‖22

Coordinate descent update

xj ←∑m

i=1 aij

(bi −

∑l 6=j ailxl

)∑m

i=1 a2ij

(dropped superscripts, since we overwrite)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 5 / 39

Page 15: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent remarksAdvantages♦ Each iteration usually cheap (single variable optimization)♦ No extra storage vectors needed

♦ No stepsize tuning♦ No other pesky parameters that must be tuned♦ Simple to implement♦ Works well for large-scale problems♦ Currently quite popular; parallel versions exist

Disadvantages♠ Tricky if single variable optimization is hard♠ Convergence theory can be complicated♠ Can slow down near optimum♠ Non-differentiable case more tricky

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 6 / 39

Page 16: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent remarksAdvantages♦ Each iteration usually cheap (single variable optimization)♦ No extra storage vectors needed

♦ No stepsize tuning♦ No other pesky parameters that must be tuned♦ Simple to implement♦ Works well for large-scale problems♦ Currently quite popular; parallel versions exist

Disadvantages♠ Tricky if single variable optimization is hard♠ Convergence theory can be complicated♠ Can slow down near optimum♠ Non-differentiable case more tricky

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 6 / 39

Page 17: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Block coordinate descent (BCD)

min f (x) := f (x1, . . . , xn)

x ∈ X1 ×X2 × · · · × Xm.

Gauss-Seidel update

xk+1i ← argmin

ξ∈Xi

f (xk+11 , . . . , xk+1

i−1︸ ︷︷ ︸done

, ξ︸︷︷︸current

, xki+1, . . . , x

km︸ ︷︷ ︸

todo

)

Jacobi update (easy to parallelize)

xk+1i ← argmin

ξ∈Xi

f (xk1 , . . . , x

ki−1︸ ︷︷ ︸

don′t clobber

, ξ︸︷︷︸current

, xki+1, . . . , x

km︸ ︷︷ ︸

todo

)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 7 / 39

Page 18: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Block coordinate descent (BCD)

min f (x) := f (x1, . . . , xn)

x ∈ X1 ×X2 × · · · × Xm.

Gauss-Seidel update

xk+1i ← argmin

ξ∈Xi

f (xk+11 , . . . , xk+1

i−1︸ ︷︷ ︸done

, ξ︸︷︷︸current

, xki+1, . . . , x

km︸ ︷︷ ︸

todo

)

Jacobi update (easy to parallelize)

xk+1i ← argmin

ξ∈Xi

f (xk1 , . . . , x

ki−1︸ ︷︷ ︸

don′t clobber

, ξ︸︷︷︸current

, xki+1, . . . , x

km︸ ︷︷ ︸

todo

)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 7 / 39

Page 19: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Block coordinate descent (BCD)

min f (x) := f (x1, . . . , xn)

x ∈ X1 ×X2 × · · · × Xm.

Gauss-Seidel update

xk+1i ← argmin

ξ∈Xi

f (xk+11 , . . . , xk+1

i−1︸ ︷︷ ︸done

, ξ︸︷︷︸current

, xki+1, . . . , x

km︸ ︷︷ ︸

todo

)

Jacobi update (easy to parallelize)

xk+1i ← argmin

ξ∈Xi

f (xk1 , . . . , x

ki−1︸ ︷︷ ︸

don′t clobber

, ξ︸︷︷︸current

, xki+1, . . . , x

km︸ ︷︷ ︸

todo

)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 7 / 39

Page 20: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Two block BCD

min f (x , y), s.t. x ∈ X , y ∈ Y.

Theorem (Grippo & Sciandrone (2000)). Let f be continuouslydifferentiable; and X , Y be closed and convex sets. Assum-ing both subproblems have solutions, and that the sequence

(xk , yk )

has limit points. Then, every limit point is stationary.

I Subproblems need not have unique solutionsI BCD for 2 blocks aka Alternating Minimization

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 8 / 39

Page 21: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Two block BCD

min f (x , y), s.t. x ∈ X , y ∈ Y.

Theorem (Grippo & Sciandrone (2000)). Let f be continuouslydifferentiable; and X , Y be closed and convex sets. Assum-ing both subproblems have solutions, and that the sequence

(xk , yk )

has limit points. Then, every limit point is stationary.

I Subproblems need not have unique solutionsI BCD for 2 blocks aka Alternating Minimization

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 8 / 39

Page 22: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

AltMin

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 9 / 39

Page 23: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 10 / 39

Page 24: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 10 / 39

Page 25: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Clustering

Original matrixa + a + +z z a + a + +− * − * *− * − * *z z

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 11 / 39

Page 26: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Clustering

Clustered matrixa a + + +z z a a + + +− − * * *− − * * *z z

After clustering and permutation

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 11 / 39

Page 27: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Clustering

Co-clustered matrixa a + + +a a + + +z z z z − − * * *− − * * *

After co-clustering and permutation

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 11 / 39

Page 28: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Clustering

I Let X ∈ Rm×n be the input matrixI Cluster columns of XI Well-known k-means clustering problem can be written as

minB,C

12‖X − BC‖2F s.t. CT C = Diag(sizes)

where B ∈ Rm×k , and C ∈ 0,1k×n.I Optimization problem with 2 blocks; min F (B,C)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 11 / 39

Page 29: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Clustering

I Let X ∈ Rm×n be the input matrixI Cluster columns of XI Well-known k-means clustering problem can be written as

minB,C

12‖X − BC‖2F s.t. CT C = Diag(sizes)

where B ∈ Rm×k , and C ∈ 0,1k×n.I Optimization problem with 2 blocks; min F (B,C)

Exercise: Write co-clustering in matrix formHint: Write using 3 blocks

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 11 / 39

Page 30: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Matrix Completion

Given matrix with missing entries, fill in the restRecall Netflix million-$ prize problemGiven User-Movie ratings, recommend movies to users

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 12 / 39

Page 31: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Matrix Completion

Input: matrix A with missing entries“Predict” missing entries to “complete” the matrixNetflix: movies x users matrix; available entries wereratings given to movies by usersTask: predict missing entriesWinning methods based on low-rank matrix completion

Am

n

≈ B

m

kCkn

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 12 / 39

Page 32: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Matrix Completion

Input: matrix A with missing entries“Predict” missing entries to “complete” the matrixNetflix: movies x users matrix; available entries wereratings given to movies by usersTask: predict missing entriesWinning methods based on low-rank matrix completion

Am

n

≈ B

m

kCkn

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 12 / 39

Page 33: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Matrix Completion

Am

n

≈ B

m

kCkn

Task: Recover matrix A given a sampling of its entriesTheorem: Can recover most low-rank matrices!

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 12 / 39

Page 34: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Matrix completion

min rank(X )

s.t. Xij = Aij , ∀(i , j) ∈ Ω = Rating pairs

another formulation

min∑

(i,j)∈Ω

(Xij − Aij)2

s.t. rank(X ) ≤ k .

Both are NP-Hard problems

convex relaxation

rank(X ) ≤ k 7→(‖X‖∗ :=

∑m

j=1σj(X )

)≤ k

Candes and Recht prove that convex relaxation solves matrix completion(under assumptions on Ω and A)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 13 / 39

Page 35: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Matrix completion

min rank(X )

s.t. Xij = Aij , ∀(i , j) ∈ Ω = Rating pairs

another formulation

min∑

(i,j)∈Ω

(Xij − Aij)2

s.t. rank(X ) ≤ k .

Both are NP-Hard problems

convex relaxation

rank(X ) ≤ k 7→(‖X‖∗ :=

∑m

j=1σj(X )

)≤ k

Candes and Recht prove that convex relaxation solves matrix completion(under assumptions on Ω and A)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 13 / 39

Page 36: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Matrix completion

I convex relaxation does not scale wellI commonly used heuristic is Alternating MinimizationI Write X = BC where B is m × k , C is k × n

minB,C

F (B,C) := ‖PΩ(A)− PΩ(BC)‖2F,

where [PΩ(X )]ij = Xij for (i , j) ∈ Ω, and 0 otherwise.

Result:I Initialize B,C using SVD of PΩ(A)

I AltMin iterations to compute B and CI Can be shown (Jain, Netrapalli, Sanghavi 2012) under assumptions on

Ω (uniform sampling) and A (incoherence, most entries similar inmagnitude) that AltMin generates B and C such that ‖A− BC‖F ≤ εafter O(log(1/ε)) steps.

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 14 / 39

Page 37: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

p(x) :=∑K

k=1 πkpN (x ; Σk , µk)

Gaussian Mixture Model

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 15 / 39

Page 38: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

12‖a ∗ x − y‖2 + λΩ(x)

Image deblurringSuvrit Sra ([email protected]) 6.883: Alternating Minimization 16 / 39

Page 39: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

12‖a ∗ x − y‖2 + λΩ(x)

Image deblurringSuvrit Sra ([email protected]) 6.883: Alternating Minimization 16 / 39

Page 40: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

→(Mairal et al., 2010)∑n

i=112‖y i − Dci‖2 + Ω1(ci) + Ω2(D)

Dict. learning, matrix factorization

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 17 / 39

Page 41: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Online matrix factorization

time t y t = at ∗ x + nt

0 = ∗ + n0

1 = ∗ + n1

2 = ∗ + n2

k = ∗ + nk

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 18 / 39

Page 42: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Non-online formulation |... |

y1 | yn

| ... |

≈ |

... |a1 | at

| ... |

∗ x

Rewrite: a ∗ x = Ax = Xa[y1 y2 · · · yt

]≈ X

[a1 a2 · · · at

]Y ≈ XA

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 19 / 39

Page 43: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Why online?

Example, 5000 frames of size 512× 512Y262144×5000 ≈ X262144×262144A262144×5000

Without structure ≈ 70 billion parameters!With structure, ≈ 4.8 million parameters!

Despite structure, alternatingminimization impracticalFix X , solve for A, requiresupdating ≈ 4.5 million params

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 20 / 39

Page 44: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Why online?

Example, 5000 frames of size 512× 512Y262144×5000 ≈ X262144×262144A262144×5000

Without structure ≈ 70 billion parameters!With structure, ≈ 4.8 million parameters!

Despite structure, alternatingminimization impracticalFix X , solve for A, requiresupdating ≈ 4.5 million params

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 20 / 39

Page 45: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Online matrix factorization

minAt ,x

∑T

t=112‖yt − Atx‖2 + Ω(x) + Γ(At )

Initialize guess x0For t = 1,2, . . .

1. Observe image y t ;2. Use x t−1 to estimate At3. Solve optimization subproblem to obtain x t

Step 2. Model, estimate blur At — separate lecture

Step 3. convex subproblem — reuse convex subroutines

Do Steps 2, 3 online =⇒ realtime processing!

Video

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 21 / 39

Page 46: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Online matrix factorization

minAt ,x

∑T

t=112‖yt − Atx‖2 + Ω(x) + Γ(At )

Initialize guess x0For t = 1,2, . . .

1. Observe image y t ;

2. Use x t−1 to estimate At3. Solve optimization subproblem to obtain x t

Step 2. Model, estimate blur At — separate lecture

Step 3. convex subproblem — reuse convex subroutines

Do Steps 2, 3 online =⇒ realtime processing!

Video

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 21 / 39

Page 47: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Online matrix factorization

minAt ,x

∑T

t=112‖yt − Atx‖2 + Ω(x) + Γ(At )

Initialize guess x0For t = 1,2, . . .

1. Observe image y t ;2. Use x t−1 to estimate At

3. Solve optimization subproblem to obtain x t

Step 2. Model, estimate blur At — separate lecture

Step 3. convex subproblem — reuse convex subroutines

Do Steps 2, 3 online =⇒ realtime processing!

Video

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 21 / 39

Page 48: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Online matrix factorization

minAt ,x

∑T

t=112‖yt − Atx‖2 + Ω(x) + Γ(At )

Initialize guess x0For t = 1,2, . . .

1. Observe image y t ;2. Use x t−1 to estimate At3. Solve optimization subproblem to obtain x t

Step 2. Model, estimate blur At — separate lecture

Step 3. convex subproblem — reuse convex subroutines

Do Steps 2, 3 online =⇒ realtime processing!

Video

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 21 / 39

Page 49: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Online matrix factorization

minAt ,x

∑T

t=112‖yt − Atx‖2 + Ω(x) + Γ(At )

Initialize guess x0For t = 1,2, . . .

1. Observe image y t ;2. Use x t−1 to estimate At3. Solve optimization subproblem to obtain x t

Step 2. Model, estimate blur At — separate lecture

Step 3. convex subproblem — reuse convex subroutines

Do Steps 2, 3 online =⇒ realtime processing!

Video

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 21 / 39

Page 50: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Some math: NMF

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 22 / 39

Page 51: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Nonnegative matrix factorization

We want a low-rank approximation A ≈ BC

SVD yields dense B and CB and C contain negative entries, even if A ≥ 0SVD factors may not be that easy to interpret

NMF imposes B ≥ 0, C ≥ 0

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 23 / 39

Page 52: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Nonnegative matrix factorization

We want a low-rank approximation A ≈ BC

SVD yields dense B and CB and C contain negative entries, even if A ≥ 0SVD factors may not be that easy to interpret

NMF imposes B ≥ 0, C ≥ 0

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 23 / 39

Page 53: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Nonnegative matrix factorization

We want a low-rank approximation A ≈ BC

SVD yields dense B and CB and C contain negative entries, even if A ≥ 0SVD factors may not be that easy to interpret

NMF imposes B ≥ 0, C ≥ 0

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 23 / 39

Page 54: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Algorithms

A ≈ BC s.t. B,C ≥ 0

Least-squares NMF

min 12‖A− BC‖2F s.t. B,C ≥ 0.

KL-Divergence NMF

min∑

ijaij log

(BC)ij

aij− aij + (BC)ij s.t. B,C ≥ 0.

♣ NP-Hard (Vavasis 2007) – no surprise♣ Recently, Arora et al. showed that if the matrix A has a

special “separable” structure, then actually globally optimalNMF is approximately solvable. More recent progress too!

♣ We look at only basic methods in this lecture

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 24 / 39

Page 55: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Algorithms

A ≈ BC s.t. B,C ≥ 0

Least-squares NMF

min 12‖A− BC‖2F s.t. B,C ≥ 0.

KL-Divergence NMF

min∑

ijaij log

(BC)ij

aij− aij + (BC)ij s.t. B,C ≥ 0.

♣ NP-Hard (Vavasis 2007) – no surprise♣ Recently, Arora et al. showed that if the matrix A has a

special “separable” structure, then actually globally optimalNMF is approximately solvable. More recent progress too!

♣ We look at only basic methods in this lecture

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 24 / 39

Page 56: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

NMF Algorithms

Hack: Compute TSVD; “zero-out” negative entriesAlternating minimization (AM)Majorize-Minimize (MM)Global optimization (not covered)“Online” algorithms (not covered)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 25 / 39

Page 57: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

AltMin / AltDesc

min F (B,C)

Alternating Descent

1 Initialize B0, k ← 02 Compute Ck+1 s.t. F (A,BkCk+1) ≤ F (A,BkCk )

3 Compute Bk+1 s.t. F (A,Bk+1Ck+1) ≤ F (A,BkCk+1)

4 k ← k + 1, and repeat until stopping criteria met.

F (Bk+1,Ck+1) ≤ F (Bk ,Ck+1) ≤ F (Bk ,Ck )

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 26 / 39

Page 58: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

AltMin / AltDesc

min F (B,C)

Alternating Descent

1 Initialize B0, k ← 02 Compute Ck+1 s.t. F (A,BkCk+1) ≤ F (A,BkCk )

3 Compute Bk+1 s.t. F (A,Bk+1Ck+1) ≤ F (A,BkCk+1)

4 k ← k + 1, and repeat until stopping criteria met.

F (Bk+1,Ck+1) ≤ F (Bk ,Ck+1) ≤ F (Bk ,Ck )

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 26 / 39

Page 59: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

AltMinAlternating Least Squares

C = argminC

‖A− BkC‖2F;

Ck+1 ← max(0,C)

B = argminB

‖A− BCk+1‖2F; Bk+1 ← max(0,B)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 27 / 39

Page 60: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

AltMinAlternating Least Squares

C = argminC

‖A− BkC‖2F; Ck+1 ← max(0,C)

B = argminB

‖A− BCk+1‖2F; Bk+1 ← max(0,B)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 27 / 39

Page 61: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

AltMinAlternating Least Squares

C = argminC

‖A− BkC‖2F; Ck+1 ← max(0,C)

B = argminB

‖A− BCk+1‖2F; Bk+1 ← max(0,B)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 27 / 39

Page 62: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

AltMinAlternating Least Squares

C = argminC

‖A− BkC‖2F; Ck+1 ← max(0,C)

B = argminB

‖A− BCk+1‖2F; Bk+1 ← max(0,B)

ALS is fast, simple, often effective, but ...

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 27 / 39

Page 63: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

AltMinAlternating Least Squares

C = argminC

‖A− BkC‖2F; Ck+1 ← max(0,C)

B = argminB

‖A− BCk+1‖2F; Bk+1 ← max(0,B)

ALS is fast, simple, often effective, but ...

‖A− Bk+1Ck+1‖2F ≤ ‖A− BkCk+1‖2F ≤ ‖A− BkCk‖2F

descent need not hold

x∗

xuc

(xuc)+

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 27 / 39

Page 64: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Alternating Minimization: correctly

Use alternating nonnegative least-squares

Ck+1 = argminC

‖A− BkC‖2F s.t. C ≥ 0

Bk+1 = argminB

‖A− BCk+1‖2F s.t. B ≥ 0

Advantages: Guaranteed descent. Theory of block-coordinatedescent guarantees convergence to stationary point.

Disadvantages: more complex; slower than ALS

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 28 / 39

Page 65: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Just Descent

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 29 / 39

Page 66: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Majorize-Minimize (MM)

Consider F (B,C) = 12‖A− BC‖2F: convex separately in B and C

We use F (C) to denote function restricted to C.

Since F (C) separable (over cols of C), we just illustrate

minc≥0

f (c) = 12‖a− Bc‖22

Recall, our aim is: find Ck+1 such that F (Bk ,Ck+1) ≤ F (Bk ,Ck )

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 30 / 39

Page 67: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Majorize-Minimize (MM)

cct ct+1 ct+k

F (c)

G(ct; c)

G(ct+k; c)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 31 / 39

Page 68: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Descent technique

minc≥0 f (c) = 12‖a− Bc‖22

1 Find a function g(c, c) that satisfies:

g(c, c) = f (c), for all c,g(c, c) ≥ f (c), for all c, c.

2 Compute ct+1 = argminc≥0 g(c, ct )

3 Then we have descent

f (ct+1)def≤ g(ct+1, ct )

argmin≤ g(ct , ct )

def= f (ct ).

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 32 / 39

Page 69: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Descent technique

minc≥0 f (c) = 12‖a− Bc‖22

1 Find a function g(c, c) that satisfies:

g(c, c) = f (c), for all c,g(c, c) ≥ f (c), for all c, c.

2 Compute ct+1 = argminc≥0 g(c, ct )

3 Then we have descent

f (ct+1)def≤ g(ct+1, ct )

argmin≤ g(ct , ct )

def= f (ct ).

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 32 / 39

Page 70: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Descent technique

minc≥0 f (c) = 12‖a− Bc‖22

1 Find a function g(c, c) that satisfies:

g(c, c) = f (c), for all c,g(c, c) ≥ f (c), for all c, c.

2 Compute ct+1 = argminc≥0 g(c, ct )

3 Then we have descent

f (ct+1)def≤ g(ct+1, ct )

argmin≤ g(ct , ct )

def= f (ct ).

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 32 / 39

Page 71: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Descent technique

minc≥0 f (c) = 12‖a− Bc‖22

1 Find a function g(c, c) that satisfies:

g(c, c) = f (c), for all c,g(c, c) ≥ f (c), for all c, c.

2 Compute ct+1 = argminc≥0 g(c, ct )

3 Then we have descent

f (ct+1)

def≤ g(ct+1, ct )

argmin≤ g(ct , ct )

def= f (ct ).

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 32 / 39

Page 72: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Descent technique

minc≥0 f (c) = 12‖a− Bc‖22

1 Find a function g(c, c) that satisfies:

g(c, c) = f (c), for all c,g(c, c) ≥ f (c), for all c, c.

2 Compute ct+1 = argminc≥0 g(c, ct )

3 Then we have descent

f (ct+1)def≤ g(ct+1, ct )

argmin≤ g(ct , ct )

def= f (ct ).

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 32 / 39

Page 73: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Descent technique

minc≥0 f (c) = 12‖a− Bc‖22

1 Find a function g(c, c) that satisfies:

g(c, c) = f (c), for all c,g(c, c) ≥ f (c), for all c, c.

2 Compute ct+1 = argminc≥0 g(c, ct )

3 Then we have descent

f (ct+1)def≤ g(ct+1, ct )

argmin≤ g(ct , ct )

def= f (ct ).

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 32 / 39

Page 74: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Descent technique

minc≥0 f (c) = 12‖a− Bc‖22

1 Find a function g(c, c) that satisfies:

g(c, c) = f (c), for all c,g(c, c) ≥ f (c), for all c, c.

2 Compute ct+1 = argminc≥0 g(c, ct )

3 Then we have descent

f (ct+1)def≤ g(ct+1, ct )

argmin≤ g(ct , ct )

def= f (ct ).

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 32 / 39

Page 75: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Constructing g for f

We exploit that h(x) = 12x2 is a convex function

h(∑

i λixi)≤∑i λih(xi), where λi ≥ 0,

∑i λi = 1

f (c) = 12

∑i(ai − bT

i c)2 =

12

∑ia2

i − 2aibTi c + (bT

i c)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jbijcj

)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jλijbijcj/λij

)2

cvx≤ 1

2

∑ia2

i − 2aibTi c + 1

2

∑ijλij(bijcj/λij

)2

=: g(c, c), where λij are convex coeffts

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 33 / 39

Page 76: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Constructing g for f

We exploit that h(x) = 12x2 is a convex function

h(∑

i λixi)≤∑i λih(xi), where λi ≥ 0,

∑i λi = 1

f (c) = 12

∑i(ai − bT

i c)2 =

12

∑ia2

i − 2aibTi c + (bT

i c)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jbijcj

)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jλijbijcj/λij

)2

cvx≤ 1

2

∑ia2

i − 2aibTi c + 1

2

∑ijλij(bijcj/λij

)2

=: g(c, c), where λij are convex coeffts

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 33 / 39

Page 77: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Constructing g for f

We exploit that h(x) = 12x2 is a convex function

h(∑

i λixi)≤∑i λih(xi), where λi ≥ 0,

∑i λi = 1

f (c) = 12

∑i(ai − bT

i c)2 = 12

∑ia2

i − 2aibTi c + (bT

i c)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jbijcj

)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jλijbijcj/λij

)2

cvx≤ 1

2

∑ia2

i − 2aibTi c + 1

2

∑ijλij(bijcj/λij

)2

=: g(c, c), where λij are convex coeffts

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 33 / 39

Page 78: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Constructing g for f

We exploit that h(x) = 12x2 is a convex function

h(∑

i λixi)≤∑i λih(xi), where λi ≥ 0,

∑i λi = 1

f (c) = 12

∑i(ai − bT

i c)2 = 12

∑ia2

i − 2aibTi c + (bT

i c)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jbijcj

)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jλijbijcj/λij

)2

cvx≤ 1

2

∑ia2

i − 2aibTi c + 1

2

∑ijλij(bijcj/λij

)2

=: g(c, c), where λij are convex coeffts

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 33 / 39

Page 79: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Constructing g for f

We exploit that h(x) = 12x2 is a convex function

h(∑

i λixi)≤∑i λih(xi), where λi ≥ 0,

∑i λi = 1

f (c) = 12

∑i(ai − bT

i c)2 = 12

∑ia2

i − 2aibTi c + (bT

i c)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jbijcj

)2

= 12

∑ia2

i − 2aibTi c

+ 12

∑i

(∑jλijbijcj/λij

)2

cvx≤ 1

2

∑ia2

i − 2aibTi c + 1

2

∑ijλij(bijcj/λij

)2

=: g(c, c), where λij are convex coeffts

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 33 / 39

Page 80: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Constructing g for f

We exploit that h(x) = 12x2 is a convex function

h(∑

i λixi)≤∑i λih(xi), where λi ≥ 0,

∑i λi = 1

f (c) = 12

∑i(ai − bT

i c)2 = 12

∑ia2

i − 2aibTi c + (bT

i c)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jbijcj

)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jλijbijcj/λij

)2

cvx≤ 1

2

∑ia2

i − 2aibTi c + 1

2

∑ijλij(bijcj/λij

)2

=: g(c, c), where λij are convex coeffts

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 33 / 39

Page 81: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Constructing g for f

We exploit that h(x) = 12x2 is a convex function

h(∑

i λixi)≤∑i λih(xi), where λi ≥ 0,

∑i λi = 1

f (c) = 12

∑i(ai − bT

i c)2 = 12

∑ia2

i − 2aibTi c + (bT

i c)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jbijcj

)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jλijbijcj/λij

)2

cvx≤ 1

2

∑ia2

i − 2aibTi c + 1

2

∑ijλij(bijcj/λij

)2

=: g(c, c), where λij are convex coeffts

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 33 / 39

Page 82: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Constructing g for f

We exploit that h(x) = 12x2 is a convex function

h(∑

i λixi)≤∑i λih(xi), where λi ≥ 0,

∑i λi = 1

f (c) = 12

∑i(ai − bT

i c)2 = 12

∑ia2

i − 2aibTi c + (bT

i c)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jbijcj

)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jλijbijcj/λij

)2

cvx≤ 1

2

∑ia2

i − 2aibTi c + 1

2

∑ijλij(bijcj/λij

)2

=: g(c, c), where λij are convex coeffts

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 33 / 39

Page 83: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Constructing g for f

f (c) = 12‖a− Bc‖22

g(c, c) = 12‖a‖22 −

∑iaibT

i c + 12

∑ijλij(bijcj/λij

)2.

Only remains to pick λij as functions of c

λij =bij cj∑k bik ck

=bij cj

bTi c

Exercise: Verify that g(c, c) = f (c);Exercise: Let f (c) =

∑i ai log(ai/(Bc)i)− ai + (Bc)i . Derive an

auxiliary function g(c, c) for this f (c).

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 34 / 39

Page 84: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Constructing g for f

f (c) = 12‖a− Bc‖22

g(c, c) = 12‖a‖22 −

∑iaibT

i c + 12

∑ijλij(bijcj/λij

)2.

Only remains to pick λij as functions of c

λij =bij cj∑k bik ck

=bij cj

bTi c

Exercise: Verify that g(c, c) = f (c);Exercise: Let f (c) =

∑i ai log(ai/(Bc)i)− ai + (Bc)i . Derive an

auxiliary function g(c, c) for this f (c).

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 34 / 39

Page 85: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Constructing g for f

f (c) = 12‖a− Bc‖22

g(c, c) = 12‖a‖22 −

∑iaibT

i c + 12

∑ijλij(bijcj/λij

)2.

Only remains to pick λij as functions of c

λij =bij cj∑k bik ck

=bij cj

bTi c

Exercise: Verify that g(c, c) = f (c);Exercise: Let f (c) =

∑i ai log(ai/(Bc)i)− ai + (Bc)i . Derive an

auxiliary function g(c, c) for this f (c).

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 34 / 39

Page 86: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

NMF updates

Key step

ct+1 = argminc≥0

g(c, ct ).

Exercise: Solve ∂g(c, ct )/∂cp = 0 to obtain

cp = ctp

[BT a]p[BT Bct ]p

This yields the “multiplicative update” algorithm of Lee/Seung (1999).

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 35 / 39

Page 87: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

MM algorithms

We exploited convexity of x2

Expectation Maximization (EM) algorithm exploitsconvexity of − log xOther choices possible, e.g., by varying λij

Our technique one variant of repertoire ofMajorization-Minimization (MM) algorithmsgradient-descent also an MM algorithmRelated to d.c. programmingMM algorithms subject of a separate lecture!

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 36 / 39

Page 88: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

EM algorithm

Assume p(x) =∑K

j=1 πjp(x ; θj) is mixture density.

`(X ; Θ) :=∑n

i=1ln(∑K

j=1πjp(xi ; θj)

).

Use convexity of − log t to compute lower-bound

`(X ; Θ) ≥∑

ijβij ln

(πjp(xi ; θj)/βij

).

E-Step: Optimize over βij , to set them to posterior probabilities:

βij :=πjp(xi ; θj)∑l πlp(xi ; θl)

.

M-Step optimizes the bound over Θ, using above β values

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 37 / 39

Page 89: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

EM algorithm

Assume p(x) =∑K

j=1 πjp(x ; θj) is mixture density.

`(X ; Θ) :=∑n

i=1ln(∑K

j=1πjp(xi ; θj)

).

Use convexity of − log t to compute lower-bound

`(X ; Θ) ≥∑

ijβij ln

(πjp(xi ; θj)/βij

).

E-Step: Optimize over βij , to set them to posterior probabilities:

βij :=πjp(xi ; θj)∑l πlp(xi ; θl)

.

M-Step optimizes the bound over Θ, using above β values

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 37 / 39

Page 90: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

EM algorithm

Assume p(x) =∑K

j=1 πjp(x ; θj) is mixture density.

`(X ; Θ) :=∑n

i=1ln(∑K

j=1πjp(xi ; θj)

).

Use convexity of − log t to compute lower-bound

`(X ; Θ) ≥∑

ijβij ln

(πjp(xi ; θj)/βij

).

E-Step: Optimize over βij , to set them to posterior probabilities:

βij :=πjp(xi ; θj)∑l πlp(xi ; θl)

.

M-Step optimizes the bound over Θ, using above β values

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 37 / 39

Page 91: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

EM algorithm

Assume p(x) =∑K

j=1 πjp(x ; θj) is mixture density.

`(X ; Θ) :=∑n

i=1ln(∑K

j=1πjp(xi ; θj)

).

Use convexity of − log t to compute lower-bound

`(X ; Θ) ≥∑

ijβij ln

(πjp(xi ; θj)/βij

).

E-Step: Optimize over βij , to set them to posterior probabilities:

βij :=πjp(xi ; θj)∑l πlp(xi ; θl)

.

M-Step optimizes the bound over Θ, using above β values

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 37 / 39

Page 92: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Other Alternating methods

Alternating ProjectionsAlternating Reflections(Nonconvex) ADMM (e.g., arXiv:1410.1390)(Nonconvex) Douglas-Rachford (e.g., Borwein’s webpage!)AltMin for global optimization (we saw)BCD with more than 2 blocksADMM with more than 2 blocksSeveral others...

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 38 / 39

Page 93: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Alternating Proximal Method

min L(x , y) := f (x , y) + g(x) + h(y).

Assume: ∇f Lipschitz cont. on bounded subsets of Rm × Rn

g: lower semicontinuous on Rm

h: lower semicontinuous on Rn.Example: f (x , y) = 1

2‖x − y‖2

Alternating Proximal Method

xk+1 ∈ argmin

L(x , yk ) + 12ck‖x − xk‖2

yk+1 ∈ argmin

L(xk+1, y) + 1

2c′k‖y − yk‖2,

here ck , c′k are suitable sequences of positive scalars.

[arXiv:0801.1780. Attouch, Bolte, Redont, Soubeyran. Proximal alternatingminimization and projection methods for nonconvex problems.]

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 39 / 39

Page 94: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Alternating Proximal Method

min L(x , y) := f (x , y) + g(x) + h(y).

Assume: ∇f Lipschitz cont. on bounded subsets of Rm × Rn

g: lower semicontinuous on Rm

h: lower semicontinuous on Rn.Example: f (x , y) = 1

2‖x − y‖2

Alternating Proximal Method

xk+1 ∈ argmin

L(x , yk ) + 12ck‖x − xk‖2

yk+1 ∈ argmin

L(xk+1, y) + 1

2c′k‖y − yk‖2,

here ck , c′k are suitable sequences of positive scalars.

[arXiv:0801.1780. Attouch, Bolte, Redont, Soubeyran. Proximal alternatingminimization and projection methods for nonconvex problems.]

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 39 / 39