Top Banner
ELE 522: Large-Scale Optimization for Data Science Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019
41

Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Jul 24, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

ELE 522: Large-Scale Optimization for Data Science

Gradient methods for constrained problems

Yuxin Chen

Princeton University, Fall 2019

Page 2: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Outline

• Frank-Wolfe algorithm

• Projected gradient methods

Gradient methods (constrained case) 3-2

Page 3: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Constrained convex problems

minimizex f(x)subject to x ∈ C

• f(·): convex function• C ⊆ Rn: closed convex set

Gradient methods (constrained case) 3-3

Page 4: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Feasible direction methods

Generate a feasible sequence xt ⊆ C with iterations

xt+1 = xt + ηtdt

where dt is a feasible direction (s.t. xt + ηtdt ∈ C)

• Question: can we guarantee feasibility while enforcing costimprovement?

Gradient methods (constrained case) 3-4

Page 5: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Frank-Wolfe algorithm

Frank-Wolfe algorithm was developed by Philip Wolfe and MargueriteFrank when they worked at / visited Princeton

Page 6: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Frank-Wolfe / conditional gradient algorithm

Algorithm 3.1 Frank-wolfe (a.k.a. conditional gradient) algorithm1: for t = 0, 1, · · · do2: yt := arg minx∈C 〈∇f(xt),x〉 (direction finding)3: xt+1 = (1− ηt)xt + ηty

t (line search and update)

yt = arg minx∈C〈∇f(xt),x− xt〉

Frank-Wolfe/ conditional gradient algorithm

Algorithm 2.3 Frank-wolfe (a.k.a. conditional gradient) algorithm1: for t = 0, 1, · · · do2: yt := arg minxœX ÈÒf(xt),x ≠ xtÍ (direction finding)3: xt+1 = (1 ≠ ÷t)xt + ÷ty

t (line search and update)

• main step: linearization of objective function (same asf(xt) + ÈÒf(xt),x ≠ xtÍ)

=∆ linear optimization over convex set• appealing when linear optimization is cheap• stepsize ÷t determined by line search, or ÷t = 2

t+1

Gradient methods 2-44

Monotonicity

We start with a monotonicity result:

Lemma 2.5

Let f be convex and L-smooth. If ÷t © ÷ = 1/L, then

Îxt+1 ≠ xúÎ2 Æ Îxt ≠ xúÎ2

where xú is any minimizer with optimal f(xú)

Gradient methods 2-35

Proof of Lemma 2.5

It follows that

Îxt+1 ≠ xúÎ22 =

..xt ≠ xú ≠ ÷(Òf(xt) ≠ Òf(xú)¸ ˚˙ ˝=0

)..2

2

=..xt ≠ xú..2

2 ≠ 2÷Èxt ≠ xú, Òf(xt) ≠ Òf(xú)͸ ˚˙ ˝Ø 2÷

L ÎÒf(xt)≠Òf(xú)Î22 (smoothness)

+ ÷2..Òf(xt) ≠ Òf(xú)..2

2

Æ..xt ≠ xú..2

2 ≠ ÷2..Òf(xt) ≠ Òf(xú)..2

2

Æ..xt ≠ xú..2

2

Gradient methods 2-36

Proof of Lemma 2.5

It follows that

Îxt+1 ≠ xúÎ22 =

..xt ≠ xú ≠ ÷(Òf(xt) ≠ Òf(xú)¸ ˚˙ ˝=0

)..2

2

=..xt ≠ xú..2

2 ≠ 2÷Èxt ≠ xú, Òf(xt) ≠ Òf(xú)͸ ˚˙ ˝Ø 2÷

L ÎÒf(xt)≠Òf(xú)Î22 (smoothness)

+ ÷2..Òf(xt) ≠ Òf(xú)..2

2

Æ..xt ≠ xú..2

2 ≠ ÷2..Òf(xt) ≠ Òf(xú)..2

2

Æ..xt ≠ xú..2

2

Gradient methods 2-36

The claim would follow immediately if

(x x)>(z z) kz zk2 (together with Cauchy-Schwarz)

(= (x z x + z)>(z z) 0

(=

8>>><>>>:

h(z) h (z) + hx z| z 2@h(z)

, z zi

h(z) h(z) + hx z| z 2@h(z)

, z zi

1

2k 1k2 + c1 1

2k 2k2 + c2 1 2 prox(1) prox(2)

relative errorf(t) + g(t) minf() + g()minf() + g()

iteration t

=

11 12

>12 22

S =

S11 s12

s>12 s22

W =

W11 w12

w>12 w22

0 2 W11 s12 + @k12k1

C PC(1) PC(2)

4

• main step: linearization of the objective function (equivalent tof(xt) + 〈∇f(xt),x− xt〉)

=⇒ linear optimization over a convex set• appealing when linear optimization is cheap

• stepsize ηt determined by line search, or ηt = 2t+ 2︸ ︷︷ ︸

bias towards xt for large t

Gradient methods (constrained case) 3-6

Page 7: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Frank-Wolfe / conditional gradient algorithm

Algorithm 3.2 Frank-wolfe (a.k.a. conditional gradient) algorithm1: for t = 0, 1, · · · do2: yt := arg minx∈C 〈∇f(xt),x〉 (direction finding)3: xt+1 = (1− ηt)xt + ηty

t (line search and update)

yt = arg minx∈C〈∇f(xt),x− xt〉

Frank-Wolfe/ conditional gradient algorithm

Algorithm 2.3 Frank-wolfe (a.k.a. conditional gradient) algorithm1: for t = 0, 1, · · · do2: yt := arg minxœX ÈÒf(xt),x ≠ xtÍ (direction finding)3: xt+1 = (1 ≠ ÷t)xt + ÷ty

t (line search and update)

• main step: linearization of objective function (same asf(xt) + ÈÒf(xt),x ≠ xtÍ)

=∆ linear optimization over convex set• appealing when linear optimization is cheap• stepsize ÷t determined by line search, or ÷t = 2

t+1

Gradient methods 2-44

Monotonicity

We start with a monotonicity result:

Lemma 2.5

Let f be convex and L-smooth. If ÷t © ÷ = 1/L, then

Îxt+1 ≠ xúÎ2 Æ Îxt ≠ xúÎ2

where xú is any minimizer with optimal f(xú)

Gradient methods 2-35

Proof of Lemma 2.5

It follows that

Îxt+1 ≠ xúÎ22 =

..xt ≠ xú ≠ ÷(Òf(xt) ≠ Òf(xú)¸ ˚˙ ˝=0

)..2

2

=..xt ≠ xú..2

2 ≠ 2÷Èxt ≠ xú, Òf(xt) ≠ Òf(xú)͸ ˚˙ ˝Ø 2÷

L ÎÒf(xt)≠Òf(xú)Î22 (smoothness)

+ ÷2..Òf(xt) ≠ Òf(xú)..2

2

Æ..xt ≠ xú..2

2 ≠ ÷2..Òf(xt) ≠ Òf(xú)..2

2

Æ..xt ≠ xú..2

2

Gradient methods 2-36

Proof of Lemma 2.5

It follows that

Îxt+1 ≠ xúÎ22 =

..xt ≠ xú ≠ ÷(Òf(xt) ≠ Òf(xú)¸ ˚˙ ˝=0

)..2

2

=..xt ≠ xú..2

2 ≠ 2÷Èxt ≠ xú, Òf(xt) ≠ Òf(xú)͸ ˚˙ ˝Ø 2÷

L ÎÒf(xt)≠Òf(xú)Î22 (smoothness)

+ ÷2..Òf(xt) ≠ Òf(xú)..2

2

Æ..xt ≠ xú..2

2 ≠ ÷2..Òf(xt) ≠ Òf(xú)..2

2

Æ..xt ≠ xú..2

2

Gradient methods 2-36

The claim would follow immediately if

(x x)>(z z) kz zk2 (together with Cauchy-Schwarz)

(= (x z x + z)>(z z) 0

(=

8>>><>>>:

h(z) h (z) + hx z| z 2@h(z)

, z zi

h(z) h(z) + hx z| z 2@h(z)

, z zi

1

2k 1k2 + c1 1

2k 2k2 + c2 1 2 prox(1) prox(2)

relative errorf(t) + g(t) minf() + g()minf() + g()

iteration t

=

11 12

>12 22

S =

S11 s12

s>12 s22

W =

W11 w12

w>12 w22

0 2 W11 s12 + @k12k1

C PC(1) PC(2)

4

• main step: linearization of the objective function (equivalent tof(xt) + 〈∇f(xt),x− xt〉)

=⇒ linear optimization over a convex set• appealing when linear optimization is cheap

• stepsize ηt determined by line search, or ηt = 2t+ 2︸ ︷︷ ︸

bias towards xt for large t

Gradient methods (constrained case) 3-6

Page 8: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Frank-Wolfe can also be applied to nonconvexproblems

Example (Luss & Teboulle ’13)

minimizex − x>Qx subject to ‖x‖2 ≤ 1 (3.1)

for some Q 0

Gradient methods (constrained case) 3-7

Page 9: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Frank-Wolfe can also be applied to nonconvexproblems

We now apply Frank-Wolfe to solve (3.1). Clearly,

yt = arg minx:‖x‖2≤1

〈∇f(xt),x〉 = − ∇f(xt)‖∇f(xt)‖2

= Qxt

‖Qxt‖2

=⇒ xt+1 = (1− ηt)xt + ηtQxt/‖Qxt‖2Set ηt = arg min0≤η≤1 f

((1− η)xt + η Qxt

‖Qxt‖2

)= 1 (check). This

givesxt+1 = Qxt/‖Qxt‖2

which is essentially power method for finding leading eigenvector of Q

Gradient methods (constrained case) 3-8

Page 10: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Convergence for convex and smooth problems

Theorem 3.1 (Frank-Wolfe for convex and smooth problems,Jaggi ’13)

Let f be convex and L-smooth. With ηt = 2t+2 , one has

f(xt)− f(x∗) ≤ 2Ld2C

t+ 2

where dC = supx,y∈C ‖x− y‖2

• for compact constraint sets, Frank-Wolfe attains ε-accuracywithin O(1

ε ) iterations

Gradient methods (constrained case) 3-9

Page 11: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Proof of Theorem 3.1By smoothness,

f(xt+1)− f(xt) ≤ ∇f(xt)>(xt+1 − xt︸ ︷︷ ︸=ηt(yt−xt)

) + L

2 ‖xt+1 − xt‖22︸ ︷︷ ︸

=η2t ‖yt−xt‖2

2≤η2t d

2C

≤ ηt∇f(xt)>(yt − xt) + L

2 η2t d

2C

≤ ηt∇f(xt)>(x∗ − xt) + L

2 η2t d

2C (since yt is minimizer)

≤ ηt(f(x∗)− f(xt)

)+ L

2 η2t d

2C (by convexity)

Letting ∆t := f(xt)− f(x∗) we get

∆t+1 ≤ (1− ηt)∆t + Ld2C

2 η2t

We then complete the proof by induction (which we omit here)Gradient methods (constrained case) 3-10

Page 12: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Strongly convex problems?

Can we hope to improve convergence guarantees of Frank-Wolfe inthe presence of strong convexity?

• in general, NO• maybe improvable under additional conditions

Gradient methods (constrained case) 3-11

Page 13: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

A negative result

Example:

minimizex∈Rn12x>Qx + b>x (3.2)

subject to x = [a1, · · · ,ak]v, v ≥ 0, 1>v = 1︸ ︷︷ ︸x∈ convex-hulla1,··· ,ak

(=: Ω)

• suppose interior(Ω) 6= ∅• suppose the optimal point x∗ lies on the boundary of Ω and is

not an extreme point

Gradient methods (constrained case) 3-12

Page 14: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

A negative result

Theorem 3.2 (Canon & Cullum, ’68)Let xt be Frank-Wolfe iterates with exact line search for solving(3.2). Then ∃ an initial point x0 s.t. for every ε > 0,

f(xt)− f(x∗) ≥ 1t1+ε for infinitely many t

• example: choose x0 ∈ interior(Ω) obeying f(x0) < mini f(ai)• in general, cannot improve O(1/t) convergence guarantees

Gradient methods (constrained case) 3-13

Page 15: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Positive results?

To achieve faster convergence, one needs additional assumptions

• example: strongly convex feasible set C• active research topics

Gradient methods (constrained case) 3-14

Page 16: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

An example of positive results

A set C is said to be µ-strongly convex if ∀λ ∈ [0, 1] and ∀x, z ∈ C:

B(λx + (1− λ)z, µ2λ(1− λ)‖x− z‖22

)∈ C,

where B(a, r) := y | ‖y − a‖2 ≤ r• example: `2 ball

Theorem 3.3 (Levitin & Polyak ’66)Suppose f is convex and L-smooth, and C is µ-strongly convex.Suppose ‖∇f(x)‖2 ≥ c > 0 for all x ∈ C. Then under mildconditions, Frank-Wolfe with exact line search converges linearly

Gradient methods (constrained case) 3-15

Page 17: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Projected gradient methods

Page 18: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Projected gradient descent

x0 x1 x2 x3

Gradient methods 2-41

x0 x1 x2 x3

Gradient methods 2-41

x0 x1 x2 x3

Gradient methods 2-41

x0 x1 x2 x3

Gradient methods 2-41

The claim would follow immediately if

(x x)>(z z) kz zk2 (together with Cauchy-Schwarz)

(= (x z x + z)>(z z) 0

(=

8>>><>>>:

h(z) h (z) + hx z| z 2@h(z)

, z zi

h(z) h(z) + hx z| z 2@h(z)

, z zi

1

2k 1k2 + c1 1

2k 2k2 + c2 1 2 prox(1) prox(2)

relative errorf(t) + g(t) minf() + g()minf() + g()

iteration t

=

11 12

>12 22

S =

S11 s12

s>12 s22

W =

W11 w12

w>12 w22

0 2 W11 s12 + @k12k1

C PC(1) PC(2)

4

works well if projectiononto C can be

computed efficiently

for t = 0, 1, · · · :xt+1 = PC(xt − ηt∇f(xt))

where PC(x) := arg minz∈C ‖x− z‖22 is Euclidean projection︸ ︷︷ ︸quadratic minimization

onto C

Gradient methods (constrained case) 3-17

Page 19: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Descent direction

Descent direction

xC = PC(x)

Nonexpansiveness of projection operator

PC(x) PC(z) x z

Gradient methods (constrained) 3-13

Descent direction

Fact 3.2 (Projection theorem)

Let C be closed convex set. Then xC is projection of x onto C i

(x ≠ xC)€(z ≠ xC) Æ 0, ’z œ C

Gradient methods (constrained) 3-14

Nonexpansiveness of projection operator

Nonexpansiveness of projection operator

PC(x) PC(z) x z

Gradient methods (constrained) 3-13

The claim would follow immediately if

(x x)(z z) z z2 (together with Cauchy-Schwarz)

= (x z x + z)(z z) 0

=

h(z) h (z) + x z h(z)

, z z

h(z) h(z) + x z h(z)

, z z

1

2 12 + c1 1

2 22 + c2 1 2 prox(1) prox(2)

relative errorf(t) + g(t) minf() + g()minf() + g()

iteration t

=

11 12

12 22

S =

S11 s12

s12 s22

W =

W11 w12

w12 w22

0 W11 s12 + 121

C PC(1) PC(2)

4

Nonexpansiveness of projection operator

PC(x) PC(z) x z

Gradient methods (constrained) 3-13

Nonexpansiveness of projection operator

PC(x) PC(z) x z

Gradient methods (constrained) 3-13

Nonexpansiveness of projection operator

PC(x) PC(z) x z

Gradient methods (constrained) 3-13

Fact 3.3 (Nonexpansivness of projection)

For any x and z, one has ÎPC(x) ≠ PC(z)Î2 Æ Îx ≠ zÎ2

Gradient methods (constrained) 3-15

Nonexpansiveness of projection operator

PC(x) PC(z) x z

Gradient methods (constrained) 3-13

Fact 3.2 (Projection theorem)

Let C be closed convex set. Then xC is projection of x onto C i(x ≠ xC)€(z ≠ xC) Æ 0, ’z œ C

Gradient methods (constrained case) 3-14

Nonexpansiveness of projection operator

PC(x) PC(z) x z

Gradient methods (constrained) 3-13

Descent direction

Fact 3.2 (Projection theorem)

Let C be closed convex set. Then xC is projection of x onto C i

(x ≠ xC)€(z ≠ xC) Æ 0, ’z œ C

Gradient methods (constrained) 3-14

Nonexpansiveness of projection operator

PC(x) PC(z) x z

Gradient methods (constrained) 3-13

Fact 3.4 (Projection theorem)

Let C be closed & convex. Then xC is the projection of x onto C iff(x− xC)>(z − xC) ≤ 0, ∀z ∈ C

Gradient methods (constrained case) 3-18

Page 20: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Descent directionGradient descent (GD)

One of most important examples of (2.3): gradient descent

xt+1 = xt ≠ ÷tÒf(xt) (2.3)

• traced to Augustin Louis Cauchy ’1847 ...

• descent direction: dt = ≠Òf(xt)• a.k.a. steepest descent, since from (2.1) and Cauchy-Schwarz,

arg mind:ÎdÎ2Æ1

f Õ(x;d) = ≠ Òf(x)ÎÒf(x)Î2¸ ˚˙ ˝

direction with greatest rate of cost improvement

Gradient methods (unconstrained) 2-5

Gradient descent (GD)

One of most important examples of (2.3): gradient descent

xt+1 = xt ≠ ÷tÒf(xt) (2.3)

• traced to Augustin Louis Cauchy ’1847 ...

• descent direction: dt = ≠Òf(xt)• a.k.a. steepest descent, since from (2.1) and Cauchy-Schwarz,

arg mind:ÎdÎ2Æ1

f Õ(x;d) = ≠ Òf(x)ÎÒf(x)Î2¸ ˚˙ ˝

direction with greatest rate of cost improvement

Gradient methods (unconstrained) 2-5

Descent direction

Let yt = xt ≠ ÷tÒf(xt) be gradient update before projection. ThenFact 3.2 implies

Òf(xt)€(xt+1 ≠ xt) = ÷≠1t (xt ≠ yt)€(xt ≠ xt+1) Æ 0

xt+1 = PC(yt)

Gradient methods (constrained) 3-15

Descent direction

Let yt = xt ≠ ÷tÒf(xt) be gradient update before projection. ThenFact 3.2 implies

Òf(xt)€(xt+1 ≠ xt) = ÷≠1t (xt ≠ yt)€(xt ≠ xt+1) Æ 0

Gradient methods (constrained) 3-15

Descent direction

Fact 3.2 (Projection theorem)

Let C be closed convex set. Then xC is projection of x onto C i

(x ≠ xC)€(z ≠ xC) Æ 0, ’z œ C

Gradient methods (constrained) 3-14

From the above figure, we know

−∇f(xt)>(xt+1 − xt) ≥ 0

xt+1 − xt is positively correlated with the steepest descent direction

Gradient methods (constrained case) 3-19

Page 21: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Strongly convex and smooth problems

minimizex f(x)subject to x ∈ C

• f(·): µ-strongly convex and L-smooth• C ⊆ Rn: closed and convex

Gradient methods (constrained case) 3-20

Page 22: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Convergence for strongly convex and smoothproblems

Proof of Lemma 2.5

It follows that

Îxt+1 ≠ xúÎ22 =

..xt ≠ xú ≠ ÷(Òf(xt) ≠ Òf(xú)¸ ˚˙ ˝=0

)..2

2

=..xt ≠ xú..2

2 ≠ 2÷Èxt ≠ xú, Òf(xt) ≠ Òf(xú)͸ ˚˙ ˝Ø 2÷

L ÎÒf(xt)≠Òf(xú)Î22 (smoothness)

+ ÷2..Òf(xt) ≠ Òf(xú)..2

2

Æ..xt ≠ xú..2

2 ≠ ÷2..Òf(xt) ≠ Òf(xú)..2

2

Æ..xt ≠ xú..2

2

Gradient methods 2-36

Let’s start with the simple case when x∗ lies in the interior of C (sothat ∇f(x∗) = 0)

Gradient methods (constrained case) 3-21

Page 23: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Convergence for strongly convex and smoothproblems

Theorem 3.5

Suppose x∗ ∈ int(C), and let f be µ-strongly convex and L-smooth.If ηt = 2

µ+L , then

‖xt − x∗‖2 ≤(κ− 1κ+ 1

)t‖x0 − x∗‖2

where κ = L/µ is condition number

• the same convergence rate as for the unconstrained case

Gradient methods (constrained case) 3-22

Page 24: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Aside: nonexpansiveness of projection operator

Nonexpansiveness of projection operator

PC(x) PC(z) x z

Gradient methods (constrained) 3-13

The claim would follow immediately if

(x x)>(z z) kz zk2 (together with Cauchy-Schwarz)

(= (x z x + z)>(z z) 0

(=

8>>><>>>:

h(z) h (z) + hx z| z 2@h(z)

, z zi

h(z) h(z) + hx z| z 2@h(z)

, z zi

1

2k 1k2 + c1 1

2k 2k2 + c2 1 2 prox(1) prox(2)

relative errorf(t) + g(t) minf() + g()minf() + g()

iteration t

=

11 12

>12 22

S =

S11 s12

s>12 s22

W =

W11 w12

w>12 w22

0 2 W11 s12 + @k12k1

C PC(1) PC(2)

4

Nonexpansiveness of projection operator

PC(x) PC(z) x z

Gradient methods (constrained) 3-13

Nonexpansiveness of projection operator

PC(x) PC(z) x z

Gradient methods (constrained) 3-13

Nonexpansiveness of projection operator

PC(x) PC(z) x z

Gradient methods (constrained) 3-13

Fact 3.6 (Nonexpansivness of projection)

For any x and z, one has ‖PC(x)− PC(z)‖2 ≤ ‖x− z‖2

Gradient methods (constrained case) 3-23

Page 25: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Proof of Theorem 3.5

We have shown for the unconstrained case that

‖xt − ηt∇f(xt)− x∗‖2 ≤κ− 1κ+ 1‖x

t − x∗‖2

From the nonexpansiveness of PC , we know

‖xt+1 − x∗‖2 = ‖PC(xt − ηt∇f(xt))− PC(x∗)‖2≤ ‖xt − ηt∇f(xt)− x∗‖2≤ κ− 1κ+ 1‖x

t − x∗‖2

Apply it recursively to conclude the proof

Gradient methods (constrained case) 3-24

Page 26: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Convergence for strongly convex and smoothproblems

Proof of Lemma 2.5

It follows that

Îxt+1 ≠ xúÎ22 =

..xt ≠ xú ≠ ÷(Òf(xt) ≠ Òf(xú)¸ ˚˙ ˝=0

)..2

2

=..xt ≠ xú..2

2 ≠ 2÷Èxt ≠ xú, Òf(xt) ≠ Òf(xú)͸ ˚˙ ˝Ø 2÷

L ÎÒf(xt)≠Òf(xú)Î22 (smoothness)

+ ÷2..Òf(xt) ≠ Òf(xú)..2

2

Æ..xt ≠ xú..2

2 ≠ ÷2..Òf(xt) ≠ Òf(xú)..2

2

Æ..xt ≠ xú..2

2

Gradient methods 2-36

What happens if we don’t know whether x∗ ∈ int(C)?• main issue: ∇f(x∗) may not be 0 (so prior analysis might fail)

Gradient methods (constrained case) 3-25

Page 27: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Convergence for strongly convex and smoothproblems

Theorem 3.7 (projected GD for strongly convex and smoothproblems)

Let f be µ-strongly convex and L-smooth. If ηt ≡ η = 1L , then

‖xt − x∗‖22 ≤(

1− µ

L

)t‖x0 − x∗‖22

• slightly weaker convergence guarantees than Theorem 3.5

Gradient methods (constrained case) 3-26

Page 28: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Proof of Theorem 3.7Let x+ := PC(x− 1

L∇f(x)) and gC(x) := 1η (x− x+) = L(x− x+)

︸ ︷︷ ︸negative descent direction

• gC(x) generalizes ∇f(x) and obeys gC(x∗) = 0

Main pillar:

〈gC(x),x− x∗〉 ≥ µ

2 ‖x− x∗‖22 + 12L‖gC(x)‖22 (3.3)

• this generalizes the regularity condition for GD

With (3.3) in place, repeating GD analysis under the regularitycondition gives

‖xt+1 − x∗‖22 ≤(

1− µ

L

)‖xt − x∗‖22

which immediately establishes Theorem 3.7Gradient methods (constrained case) 3-27

Page 29: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Proof of Theorem 3.7 (cont.)It remains to justify (3.3). To this end, it is seen that

0 ≤ f(x+)− f(x∗) = f(x+)− f(x) + f(x)− f(x∗)

≤ ∇f(x)>(x+ − x) + L

2 ‖x+ − x‖2

2︸ ︷︷ ︸

smoothness

+∇f(x)>(x− x∗)− µ

2 ‖x− x∗‖22

︸ ︷︷ ︸strong convexity

= ∇f(x)>(x+ − x∗) + 12L‖gC(x)‖2

2 −µ

2 ‖x− x∗‖22,

which would establish (3.3) if

∇f(x)>(x+−x∗) ≤ gC(x)>(x+ − x∗)︸ ︷︷ ︸=gC(x)>(x−x∗)− 1

L‖gC(x)‖22

(projection only makes it better)

(3.4)This inequality is equivalent to

(x+ −

(x− L−1∇f(x)

))> (x+ − x∗) ≤ 0 (3.5)

This fact (3.5) follows directly from Fact 3.4Gradient methods (constrained case) 3-28

Page 30: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

RemarkGeneralized Pythagorean Theorem

[PICTURE]

Fact 5.2

If xC = PC,Ï(x), thenDÏ(z,x) Ø DÏ(z,xC) + DÏ(xC ,x) ’z œ C

• in squared Euclidean case, it means angle \zxCx is obtuse• if C is ane plane, then

DÏ(z,x) = DÏ(z,xC) + DÏ(xC ,x) ’z œ C

Mirror descent 5-26

Three-point lemma

[PICTURE]

Fact 5.1

For every three points x,y,z,

DÏ(x,z) = DÏ(x,y) + DÏ(y,z) ≠ ÈÒÏ(z) ≠ ÒÏ(y),x ≠ yÍ

• for Euclidean case with Ï(x) = ÎxÎ22, this is law of cosine

Îx ≠ zÎ22 = Îx ≠ yÎ2

2 + Îy ≠ zÎ22 ≠ 2 Èz ≠ y,x ≠ y͸ ˚˙ ˝

Îz≠yÎ2Îx≠yÎ2 cos\zyx

Mirror descent 5-21

rn

p. minp

4µr log n

Since kM?k1 µrn min, we have

minp4µr log n

nkM?k1µr

p4µr log n

Therefore, we have

rn

p. nkM?k1

µrp4µr log n

() .r

np

6µ3r3 log nkM?k1

x 1Lrf(x)

1

rn

p. minp

4µr log n

Since kM?k1 µrn min, we have

minp4µr log n

nkM?k1µr

p4µr log n

Therefore, we have

rn

p. nkM?k1

µrp4µr log n

() .r

np

6µ3r3 log nkM?k1

x 1Lrf(x) x+

1

rn

p. minp

4µr log n

Since kM?k1 µrn min, we have

minp4µr log n

nkM?k1µr

p4µr log n

Therefore, we have

rn

p. nkM?k1

µrp

4µr log n

() .r

np

6µ3r3 log nkM?k1

x 1Lrf(x) x+ x

1

rn

p. minp

4µr log n

Since kM?k1 µrn min, we have

minp4µr log n

nkM?k1µr

p4µr log n

Therefore, we have

rn

p. nkM?k1

µrp

4µr log n

() .r

np

6µ3r3 log nkM?k1

x 1Lrf(x) x+ x y

1

One can easily generalize (3.4) to (via the same proof)

∇f(x)>(x+ − y) ≤ gC(x)>(x+ − y), ∀y ∈ C (3.6)

This proves useful for subsequent analysisGradient methods (constrained case) 3-29

Page 31: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Convex and smooth problems

minimizex f(x)subject to x ∈ C

• f(·): convex and L-smooth• C ⊆ Rn: closed and convex

Gradient methods (constrained case) 3-30

Page 32: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Convergence for convex and smooth problems

Theorem 3.8 (projected GD for convex and smooth problems)

Let f be convex and L-smooth. If ηt ≡ η = 1L , then

f(xt)− f(x∗) ≤ 3L‖x0 − x∗‖22 + f(x0)− f(x∗)t+ 1

• similar convergence rate as for the unconstrained case

Gradient methods (constrained case) 3-31

Page 33: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Proof of Theorem 3.8

We first recall our main steps when handling the unconstrained case

Step 1: show cost improvement

f(xt+1) ≤ f(xt)− 12L‖∇f(xt)‖2

2

Step 2: connect ‖∇f(xt)‖2 with f(xt)

‖∇f(xt)‖2 ≥f(xt)− f(x∗)‖xt − x∗‖2

≥ f(xt)− f(x∗)‖x0 − x∗‖2

Step 3: let ∆t := f(xt)− f(x∗) to get

∆t+1 −∆t ≤ −∆2t

2L‖x0 − x∗‖22

and complete the proof by inductionGradient methods (constrained case) 3-32

Page 34: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Proof of Theorem 3.8 (cont.)We then modify these steps for the constrained case. As before, setgC(xt) = L(xt − xt+1), which generalizes ∇f(xt) in constrained case

Step 1: show cost improvement

f(xt+1) ≤ f(xt)− 12L‖gC(x

t)‖22

Step 2: connect ‖gC(xt)‖2 with f(xt)

‖gC(xt)‖2 ≥f(xt+1)− f(x∗)‖xt − x∗‖2

≥ f(xt+1)− f(x∗)‖x0 − x∗‖2

Step 3: let ∆t := f(xt)− f(x∗) to get

∆t+1 −∆t ≤ −∆2t+1

2L‖x0 − x∗‖22

and complete the proof by inductionGradient methods (constrained case) 3-33

Page 35: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Proof of Theorem 3.8 (cont.)

Main pillar: generalize smoothness condition as follows

Lemma 3.9

Suppose f is convex and L-smooth. For any x,y ∈ C, letx+ = PC(x− 1

L∇f(x)) and gC(x) = L(x− x+). Then

f(y) ≥ f(x+) + gC(x)>(y − x) + 12L‖gC(x)‖22

Gradient methods (constrained case) 3-34

Page 36: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Proof of Theorem 3.8 (cont.)Step 1: set x = y = xt in Lemma 3.9 to reach

f(xt) ≥ f(xt+1) + 12L‖gC(x

t)‖22

as desired

Step 2: set x = xt and y = x∗ in Lemma 3.9 to get

0 ≥ f(x∗)− f(xt+1) ≥ gC(xt)>(x∗ − xt) + 12L‖gC(x

t)‖22≥ gC(xt)>(x∗ − xt)

which together with Cauchy-Schwarz yields

‖gC(xt)‖2 ≥f(xt+1)− f(x∗)‖xt − x∗‖2

(3.7)

Gradient methods (constrained case) 3-35

Page 37: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Proof of Theorem 3.8 (cont.)

It also follows from our analysis for the strongly convex case that (bytaking µ = 0 in Theorem 3.7)

‖xt − x∗‖2 ≤ ‖x0 − x∗‖2

which combined with (3.7) reveals

‖gC(xt)‖2 ≥f(xt+1)− f(x∗)‖x0 − x∗‖2

Step 3: letting ∆t = f(xt)− f(x∗), the previous bounds togethergive

∆t+1 −∆t ≤ −∆2t+1

2L‖x0 − x∗‖22Use induction to finish the proof (which we omit here)

Gradient methods (constrained case) 3-36

Page 38: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Proof of Lemma 3.9

f(y)− f(x+) = f(y)− f(x)−(f(x+)− f(x)

)

≥ ∇f(x)>(y − x)︸ ︷︷ ︸convexity

−(∇f(x)>(x+ − x) + L

2 ‖x+ − x‖2

2

)

︸ ︷︷ ︸smoothness

= ∇f(x)>(y − x+)− L

2 ‖x+ − x‖2

2

≥ gC(x)>(y − x+)− L

2 ‖x+ − x‖2

2 (by (3.6))

= gC(x)>(y − x) + gC(x)>(x− x+)︸ ︷︷ ︸= 1

LgC(x)

− L

2 ‖ x+ − x︸ ︷︷ ︸=− 1

LgC(x)

‖22

= gC(x)>(y − x) + 12L‖gC(x)‖2

2

Gradient methods (constrained case) 3-37

Page 39: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Summary

• Frank-Wolfe: projection-free

stepsize convergence iterationrule rate complexity

convex & smoothηt 1

t O( 1t

)O( 1ε

)problems

• projected gradient descent

stepsize convergence iterationrule rate complexity

convex & smoothηt = 1

L O( 1t

)O( 1ε

)problems

strongly convex &ηt = 1

L O((

1− 1κ

)t)O(κ log 1

ε

)smooth problems

Gradient methods (constrained case) 3-38

Page 40: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Reference

[1] ”Nonlinear programming (3rd edition),” D. Bertsekas, 2016.[2] ”Convex optimization: algorithms and complexity,” S. Bubeck,

Foundations and trends in machine learning, 2015.[3] ”First-order methods in optimization,” A. Beck, Vol. 25, SIAM, 2017.[4] ”Convex optimization and algorithms,” D. Bertsekas, 2015.[5] ”Conditional gradient algorithmsfor rank-one matrix approximations with

a sparsity constraint,” R. Luss, M. Teboulle, SIAM Review, 2013.[6] ”Revisiting Frank-Wolfe: projection-free sparse convex optimization,”

M. Jaggi, ICML, 2013.[7] ”A tight upper bound on the rate of convergence of Frank-Wolfe

algorithm,” M. Canon and C. Cullum, SIAM Journal on Control, 1968.

Gradient methods (constrained case) 3-39

Page 41: Gradient methods for constrained problemsyc5/ele522_optimization/lectures/grad_descent... · Gradient methods for constrained problems Yuxin Chen Princeton University, Fall 2019.

Reference

[8] ”Constrained minimization methods,” E. Levitin and B. Polyak, USSRComputational mathematics and mathematical physics, 1966.

Gradient methods (constrained case) 3-40