Top Banner
Convex Optimization (EE227A: UC Berkeley) Lecture 19 (Stochastic optimization) 02 Apr, 2013 Suvrit Sra
134

Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Aug 30, 2018

Download

Documents

doanque
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Convex Optimization(EE227A: UC Berkeley)

Lecture 19(Stochastic optimization)

02 Apr, 2013

Suvrit Sra

Page 2: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Admin

♠ HW3 due 4/04/2013

♠ HW4 on bSpace later today–due 4/18/2013

♠ Project report (4 pages) due on: 11th April

♠ LATEX template for projects on bSpace

2 / 34

Page 3: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Recap

♠ Convex sets, functions

♠ Convex models, LP, QP, SOCP, SDP

♠ Subdifferentials, basic optimality conditions

♠ Weak duality

♠ Lagrangians, strong duality, KKT conditions

♠ Subgradient method

♠ Gradient descent, feasible descent

♠ Optimal gradients methods

♠ Constrained problems, conditional gradient

♠ Nonsmooth problems, proximal methods

♠ Proximal splitting, Douglas-Rachford

♠ Monotone operators, product-space trick

♠ Incremental gradient methods

3 / 34

Page 4: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Incremental methods

min [f(x) =∑

i fi(x)] + r(x)

xk+1 = xk − αkgi(k), gi(k) ∈ ∂fi(k)(xk)

xk+1 = proxαkfi(k)(xk), k = 0, 1, . . .

xk+1 = proxαkr

(xk − ηk

∑m

i=1∇fi(zi)

), k = 0, 1, . . . ,

z1 = xk

zi+1 = zi − αk∇fi(zi), i = 1, . . . ,m− 1.

4 / 34

Page 5: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Incremental methods

min [f(x) =∑

i fi(x)] + r(x)

xk+1 = xk − αkgi(k), gi(k) ∈ ∂fi(k)(xk)

xk+1 = proxαkfi(k)(xk), k = 0, 1, . . .

xk+1 = proxαkr

(xk − ηk

∑m

i=1∇fi(zi)

), k = 0, 1, . . . ,

z1 = xk

zi+1 = zi − αk∇fi(zi), i = 1, . . . ,m− 1.

4 / 34

Page 6: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Incremental methods

min [f(x) =∑

i fi(x)] + r(x)

xk+1 = xk − αkgi(k), gi(k) ∈ ∂fi(k)(xk)

xk+1 = proxαkfi(k)(xk), k = 0, 1, . . .

xk+1 = proxαkr

(xk − ηk

∑m

i=1∇fi(zi)

), k = 0, 1, . . . ,

z1 = xk

zi+1 = zi − αk∇fi(zi), i = 1, . . . ,m− 1.

4 / 34

Page 7: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Incremental methods

xk+1 = PX (xk − αk∇fi(k)(x

k))

Choices of i(k)

I Cyclic: i(k) = 1 + (k mod m)

I Randomized: Pick i(k) uniformly from 1, . . . ,m

♣ Many other variations of incremental methods

♣ Read (omitting proofs) this nice survey by D. P. Bertsekas

5 / 34

Page 8: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Incremental methods

xk+1 = PX (xk − αk∇fi(k)(x

k))

Choices of i(k)

I Cyclic: i(k) = 1 + (k mod m)

I Randomized: Pick i(k) uniformly from 1, . . . ,m

♣ Many other variations of incremental methods

♣ Read (omitting proofs) this nice survey by D. P. Bertsekas

5 / 34

Page 9: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic Optimization

6 / 34

Page 10: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic gradients

min f(x) = 1m

∑mi=1 fi(x)

Recall the incremental gradient method

I Let x0 ∈ Rn

I For k ≥ 01 Pick i(k) ∈ 1, 2, . . . ,m uniformly at random2 xk+1 = xk − αk∇fi(k)(x

k)

g ≡ ∇fi(k) may be viewed as a stochastic gradient

g := gtrue + e, where e is mean-zero noise: E[e] = 0

7 / 34

Page 11: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic gradients

min f(x) = 1m

∑mi=1 fi(x)

Recall the incremental gradient method

I Let x0 ∈ Rn

I For k ≥ 0

1 Pick i(k) ∈ 1, 2, . . . ,m uniformly at random2 xk+1 = xk − αk∇fi(k)(x

k)

g ≡ ∇fi(k) may be viewed as a stochastic gradient

g := gtrue + e, where e is mean-zero noise: E[e] = 0

7 / 34

Page 12: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic gradients

min f(x) = 1m

∑mi=1 fi(x)

Recall the incremental gradient method

I Let x0 ∈ Rn

I For k ≥ 01 Pick i(k) ∈ 1, 2, . . . ,m uniformly at random2 xk+1 = xk − αk∇fi(k)(x

k)

g ≡ ∇fi(k) may be viewed as a stochastic gradient

g := gtrue + e, where e is mean-zero noise: E[e] = 0

7 / 34

Page 13: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic gradients

min f(x) = 1m

∑mi=1 fi(x)

Recall the incremental gradient method

I Let x0 ∈ Rn

I For k ≥ 01 Pick i(k) ∈ 1, 2, . . . ,m uniformly at random2 xk+1 = xk − αk∇fi(k)(x

k)

g ≡ ∇fi(k) may be viewed as a stochastic gradient

g := gtrue + e, where e is mean-zero noise: E[e] = 0

7 / 34

Page 14: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic gradients

min f(x) = 1m

∑mi=1 fi(x)

Recall the incremental gradient method

I Let x0 ∈ Rn

I For k ≥ 01 Pick i(k) ∈ 1, 2, . . . ,m uniformly at random2 xk+1 = xk − αk∇fi(k)(x

k)

g ≡ ∇fi(k) may be viewed as a stochastic gradient

g := gtrue + e, where e is mean-zero noise: E[e] = 0

7 / 34

Page 15: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic gradients

I Index i(k) chosen uniformly from 1, . . . ,mI Thus, in expectation:

E[g] =

Ei[∇fi(x)] =∑

i

1

m∇fi(x) = ∇f(x)

I Alternatively, E[g − gtrue] = E[e] = 0.

I We call g an unbiased estimate of the gradient

I Here, we obtained g in a two step process:

Sample: pick an index i(k) unif. at random Oracle: Compute a stochastic gradient based on i(k)

8 / 34

Page 16: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic gradients

I Index i(k) chosen uniformly from 1, . . . ,mI Thus, in expectation:

E[g] = Ei[∇fi(x)]

=∑

i

1

m∇fi(x) = ∇f(x)

I Alternatively, E[g − gtrue] = E[e] = 0.

I We call g an unbiased estimate of the gradient

I Here, we obtained g in a two step process:

Sample: pick an index i(k) unif. at random Oracle: Compute a stochastic gradient based on i(k)

8 / 34

Page 17: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic gradients

I Index i(k) chosen uniformly from 1, . . . ,mI Thus, in expectation:

E[g] = Ei[∇fi(x)] =∑

i

1

m∇fi(x) =

∇f(x)

I Alternatively, E[g − gtrue] = E[e] = 0.

I We call g an unbiased estimate of the gradient

I Here, we obtained g in a two step process:

Sample: pick an index i(k) unif. at random Oracle: Compute a stochastic gradient based on i(k)

8 / 34

Page 18: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic gradients

I Index i(k) chosen uniformly from 1, . . . ,mI Thus, in expectation:

E[g] = Ei[∇fi(x)] =∑

i

1

m∇fi(x) = ∇f(x)

I Alternatively, E[g − gtrue] = E[e] = 0.

I We call g an unbiased estimate of the gradient

I Here, we obtained g in a two step process:

Sample: pick an index i(k) unif. at random Oracle: Compute a stochastic gradient based on i(k)

8 / 34

Page 19: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic gradients

I Index i(k) chosen uniformly from 1, . . . ,mI Thus, in expectation:

E[g] = Ei[∇fi(x)] =∑

i

1

m∇fi(x) = ∇f(x)

I Alternatively, E[g − gtrue] = E[e] = 0.

I We call g an unbiased estimate of the gradient

I Here, we obtained g in a two step process:

Sample: pick an index i(k) unif. at random Oracle: Compute a stochastic gradient based on i(k)

8 / 34

Page 20: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic gradients

I Index i(k) chosen uniformly from 1, . . . ,mI Thus, in expectation:

E[g] = Ei[∇fi(x)] =∑

i

1

m∇fi(x) = ∇f(x)

I Alternatively, E[g − gtrue] = E[e] = 0.

I We call g an unbiased estimate of the gradient

I Here, we obtained g in a two step process:

Sample: pick an index i(k) unif. at random Oracle: Compute a stochastic gradient based on i(k)

8 / 34

Page 21: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic gradients

I Index i(k) chosen uniformly from 1, . . . ,mI Thus, in expectation:

E[g] = Ei[∇fi(x)] =∑

i

1

m∇fi(x) = ∇f(x)

I Alternatively, E[g − gtrue] = E[e] = 0.

I We call g an unbiased estimate of the gradient

I Here, we obtained g in a two step process:

Sample: pick an index i(k) unif. at random Oracle: Compute a stochastic gradient based on i(k)

8 / 34

Page 22: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming

min f(x) := Eω[F (x, ω)]

I ω follows some known distribution

I Previous example, omega takes values in a discrete set of sizem (might as well say ω ∈ 1, . . . ,m)

I so that F (x, ω) = fω(x); so assuming uniform distribution, wesee that f(x) = EωF (x, ω) = 1

m

∑mi=1 fi(x)

I Usually ω will be non-discrete, and we won’t be able tocompute the expectation in closed form, since

f(x) =∫F (x, ω)dP (ω),

is going to be a difficult high-dimensional integral.

9 / 34

Page 23: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming

min f(x) := Eω[F (x, ω)]

I ω follows some known distribution

I Previous example, omega takes values in a discrete set of sizem (might as well say ω ∈ 1, . . . ,m)

I so that F (x, ω) = fω(x); so assuming uniform distribution, wesee that f(x) = EωF (x, ω) = 1

m

∑mi=1 fi(x)

I Usually ω will be non-discrete, and we won’t be able tocompute the expectation in closed form, since

f(x) =∫F (x, ω)dP (ω),

is going to be a difficult high-dimensional integral.

9 / 34

Page 24: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming

min f(x) := Eω[F (x, ω)]

I ω follows some known distribution

I Previous example, omega takes values in a discrete set of sizem (might as well say ω ∈ 1, . . . ,m)

I so that F (x, ω) = fω(x); so assuming uniform distribution, wesee that f(x) = EωF (x, ω) = 1

m

∑mi=1 fi(x)

I Usually ω will be non-discrete, and we won’t be able tocompute the expectation in closed form, since

f(x) =∫F (x, ω)dP (ω),

is going to be a difficult high-dimensional integral.

9 / 34

Page 25: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming

min f(x) := Eω[F (x, ω)]

I ω follows some known distribution

I Previous example, omega takes values in a discrete set of sizem (might as well say ω ∈ 1, . . . ,m)

I so that F (x, ω) = fω(x); so assuming uniform distribution, wesee that f(x) = EωF (x, ω) = 1

m

∑mi=1 fi(x)

I Usually ω will be non-discrete, and we won’t be able tocompute the expectation in closed form, since

f(x) =∫F (x, ω)dP (ω),

is going to be a difficult high-dimensional integral.

9 / 34

Page 26: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming – digression

Certainty-equivalent / mean approximation

I Say F (x, ω) is a linear function of x

I Then, we may write F (x, ω) = 〈c(ω), x〉I In this case, f(x) =

∫〈c(ω), x〉dP (ω) = 〈E[c(ω)], x〉

I What if F (x, ω) is convex in x for every ω

I Jensen’s inequality gives us a trivial lower-bound

f(x) =

∫F (x, ω)dP (ω) ≥ F (x,E[ω])

I Bound may be too weak—even useless

I Thus, let us try to directly minimize f(x)

10 / 34

Page 27: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming – digression

Certainty-equivalent / mean approximation

I Say F (x, ω) is a linear function of x

I Then, we may write F (x, ω) = 〈c(ω), x〉

I In this case, f(x) =∫〈c(ω), x〉dP (ω) = 〈E[c(ω)], x〉

I What if F (x, ω) is convex in x for every ω

I Jensen’s inequality gives us a trivial lower-bound

f(x) =

∫F (x, ω)dP (ω) ≥ F (x,E[ω])

I Bound may be too weak—even useless

I Thus, let us try to directly minimize f(x)

10 / 34

Page 28: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming – digression

Certainty-equivalent / mean approximation

I Say F (x, ω) is a linear function of x

I Then, we may write F (x, ω) = 〈c(ω), x〉I In this case, f(x) =

∫〈c(ω), x〉dP (ω) = 〈E[c(ω)], x〉

I What if F (x, ω) is convex in x for every ω

I Jensen’s inequality gives us a trivial lower-bound

f(x) =

∫F (x, ω)dP (ω) ≥ F (x,E[ω])

I Bound may be too weak—even useless

I Thus, let us try to directly minimize f(x)

10 / 34

Page 29: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming – digression

Certainty-equivalent / mean approximation

I Say F (x, ω) is a linear function of x

I Then, we may write F (x, ω) = 〈c(ω), x〉I In this case, f(x) =

∫〈c(ω), x〉dP (ω) = 〈E[c(ω)], x〉

I What if F (x, ω) is convex in x for every ω

I Jensen’s inequality gives us a trivial lower-bound

f(x) =

∫F (x, ω)dP (ω) ≥ F (x,E[ω])

I Bound may be too weak—even useless

I Thus, let us try to directly minimize f(x)

10 / 34

Page 30: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming – digression

Certainty-equivalent / mean approximation

I Say F (x, ω) is a linear function of x

I Then, we may write F (x, ω) = 〈c(ω), x〉I In this case, f(x) =

∫〈c(ω), x〉dP (ω) = 〈E[c(ω)], x〉

I What if F (x, ω) is convex in x for every ω

I Jensen’s inequality gives us a trivial lower-bound

f(x) =

∫F (x, ω)dP (ω) ≥ F (x,E[ω])

I Bound may be too weak—even useless

I Thus, let us try to directly minimize f(x)

10 / 34

Page 31: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming – digression

Certainty-equivalent / mean approximation

I Say F (x, ω) is a linear function of x

I Then, we may write F (x, ω) = 〈c(ω), x〉I In this case, f(x) =

∫〈c(ω), x〉dP (ω) = 〈E[c(ω)], x〉

I What if F (x, ω) is convex in x for every ω

I Jensen’s inequality gives us a trivial lower-bound

f(x) =

∫F (x, ω)dP (ω) ≥ F (x,E[ω])

I Bound may be too weak—even useless

I Thus, let us try to directly minimize f(x)

10 / 34

Page 32: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming – digression

Certainty-equivalent / mean approximation

I Say F (x, ω) is a linear function of x

I Then, we may write F (x, ω) = 〈c(ω), x〉I In this case, f(x) =

∫〈c(ω), x〉dP (ω) = 〈E[c(ω)], x〉

I What if F (x, ω) is convex in x for every ω

I Jensen’s inequality gives us a trivial lower-bound

f(x) =

∫F (x, ω)dP (ω) ≥ F (x,E[ω])

I Bound may be too weak—even useless

I Thus, let us try to directly minimize f(x)

10 / 34

Page 33: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming – setup

minx∈X f(x) := Eω[F (x, ω)]

Setup and Assumptions

1. X ⊂ Rn nonempty, closed, bounded, convex

2. ω is a random vector whose probability distribution P is supportedon Ω ⊂ Rd; so F : X × Ω→ R3. The expectation

E[F (x, ω)] =∫

Ω F (x, ω)dP (ω)

is well-defined and finite valued for every x ∈ X .4. For every ω ∈ Ω, F (·, ω) is convex.

Convex stochastic optimization problem

11 / 34

Page 34: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming – setup

minx∈X f(x) := Eω[F (x, ω)]

Setup and Assumptions

1. X ⊂ Rn nonempty, closed, bounded, convex2. ω is a random vector whose probability distribution P is supportedon Ω ⊂ Rd; so F : X × Ω→ R

3. The expectation

E[F (x, ω)] =∫

Ω F (x, ω)dP (ω)

is well-defined and finite valued for every x ∈ X .4. For every ω ∈ Ω, F (·, ω) is convex.

Convex stochastic optimization problem

11 / 34

Page 35: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming – setup

minx∈X f(x) := Eω[F (x, ω)]

Setup and Assumptions

1. X ⊂ Rn nonempty, closed, bounded, convex2. ω is a random vector whose probability distribution P is supportedon Ω ⊂ Rd; so F : X × Ω→ R3. The expectation

E[F (x, ω)] =∫

Ω F (x, ω)dP (ω)

is well-defined and finite valued for every x ∈ X .

4. For every ω ∈ Ω, F (·, ω) is convex.

Convex stochastic optimization problem

11 / 34

Page 36: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming – setup

minx∈X f(x) := Eω[F (x, ω)]

Setup and Assumptions

1. X ⊂ Rn nonempty, closed, bounded, convex2. ω is a random vector whose probability distribution P is supportedon Ω ⊂ Rd; so F : X × Ω→ R3. The expectation

E[F (x, ω)] =∫

Ω F (x, ω)dP (ω)

is well-defined and finite valued for every x ∈ X .4. For every ω ∈ Ω, F (·, ω) is convex.

Convex stochastic optimization problem

11 / 34

Page 37: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming – setup

I Cannot compute expectation with high-accuracy in general

I So, computational techniques based on Monte Carlo sampling

Assumption 1: Possible to generate independent identically dis-tributed (iid) samples ω1, ω2, . . .Assumption 2: For a given input (x, ω) ∈ X ×Ω, we can compute(oracle) a stochastic gradient G(x, ω)

g(x) := E[G(x, ω)] s.t. g(x) ∈ ∂f(x).

I How to get these stochastic subgradients?

Theorem Let ω ∈ Ω; If F (·, ω) is convex, and f(·) is finite valued ina neighborhood of a point x, then

∂f(x) = E[∂xF (x, ω)].

I So we may pick G(x, ω) ∈ ∂xF (x, ω) as stochastic subgradient.

12 / 34

Page 38: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming – setup

I Cannot compute expectation with high-accuracy in general

I So, computational techniques based on Monte Carlo sampling

Assumption 1: Possible to generate independent identically dis-tributed (iid) samples ω1, ω2, . . .Assumption 2: For a given input (x, ω) ∈ X ×Ω, we can compute(oracle) a stochastic gradient G(x, ω)

g(x) := E[G(x, ω)] s.t. g(x) ∈ ∂f(x).

I How to get these stochastic subgradients?

Theorem Let ω ∈ Ω; If F (·, ω) is convex, and f(·) is finite valued ina neighborhood of a point x, then

∂f(x) = E[∂xF (x, ω)].

I So we may pick G(x, ω) ∈ ∂xF (x, ω) as stochastic subgradient.

12 / 34

Page 39: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming – setup

I Cannot compute expectation with high-accuracy in general

I So, computational techniques based on Monte Carlo sampling

Assumption 1: Possible to generate independent identically dis-tributed (iid) samples ω1, ω2, . . .Assumption 2: For a given input (x, ω) ∈ X ×Ω, we can compute(oracle) a stochastic gradient G(x, ω)

g(x) := E[G(x, ω)] s.t. g(x) ∈ ∂f(x).

I How to get these stochastic subgradients?

Theorem Let ω ∈ Ω; If F (·, ω) is convex, and f(·) is finite valued ina neighborhood of a point x, then

∂f(x) = E[∂xF (x, ω)].

I So we may pick G(x, ω) ∈ ∂xF (x, ω) as stochastic subgradient.

12 / 34

Page 40: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming – setup

I Cannot compute expectation with high-accuracy in general

I So, computational techniques based on Monte Carlo sampling

Assumption 1: Possible to generate independent identically dis-tributed (iid) samples ω1, ω2, . . .Assumption 2: For a given input (x, ω) ∈ X ×Ω, we can compute(oracle) a stochastic gradient G(x, ω)

g(x) := E[G(x, ω)] s.t. g(x) ∈ ∂f(x).

I How to get these stochastic subgradients?

Theorem Let ω ∈ Ω; If F (·, ω) is convex, and f(·) is finite valued ina neighborhood of a point x, then

∂f(x) = E[∂xF (x, ω)].

I So we may pick G(x, ω) ∈ ∂xF (x, ω) as stochastic subgradient.

12 / 34

Page 41: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming – setup

I Cannot compute expectation with high-accuracy in general

I So, computational techniques based on Monte Carlo sampling

Assumption 1: Possible to generate independent identically dis-tributed (iid) samples ω1, ω2, . . .Assumption 2: For a given input (x, ω) ∈ X ×Ω, we can compute(oracle) a stochastic gradient G(x, ω)

g(x) := E[G(x, ω)] s.t. g(x) ∈ ∂f(x).

I How to get these stochastic subgradients?

Theorem Let ω ∈ Ω; If F (·, ω) is convex, and f(·) is finite valued ina neighborhood of a point x, then

∂f(x) = E[∂xF (x, ω)].

I So we may pick G(x, ω) ∈ ∂xF (x, ω) as stochastic subgradient.

12 / 34

Page 42: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming – setup

I Cannot compute expectation with high-accuracy in general

I So, computational techniques based on Monte Carlo sampling

Assumption 1: Possible to generate independent identically dis-tributed (iid) samples ω1, ω2, . . .Assumption 2: For a given input (x, ω) ∈ X ×Ω, we can compute(oracle) a stochastic gradient G(x, ω)

g(x) := E[G(x, ω)] s.t. g(x) ∈ ∂f(x).

I How to get these stochastic subgradients?

Theorem Let ω ∈ Ω; If F (·, ω) is convex, and f(·) is finite valued ina neighborhood of a point x, then

∂f(x) = E[∂xF (x, ω)].

I So we may pick G(x, ω) ∈ ∂xF (x, ω) as stochastic subgradient.

12 / 34

Page 43: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming

♣ Stochastic Approximation (SA)

I Sample ωk iid

I Generate stochastic subgradient G(x, ω)I Use that in a subgradient method!

♣ Sample average approximation (SAA)

I Generate N iid samples, ω1, . . . , ωN

I Consider empirical objective fN := N−1∑

i F (x, ωi)I SAA refers to creation of this sample average problemI Minimizing fN still needs to be done!

13 / 34

Page 44: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming

♣ Stochastic Approximation (SA)

I Sample ωk iidI Generate stochastic subgradient G(x, ω)

I Use that in a subgradient method!

♣ Sample average approximation (SAA)

I Generate N iid samples, ω1, . . . , ωN

I Consider empirical objective fN := N−1∑

i F (x, ωi)I SAA refers to creation of this sample average problemI Minimizing fN still needs to be done!

13 / 34

Page 45: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming

♣ Stochastic Approximation (SA)

I Sample ωk iidI Generate stochastic subgradient G(x, ω)I Use that in a subgradient method!

♣ Sample average approximation (SAA)

I Generate N iid samples, ω1, . . . , ωN

I Consider empirical objective fN := N−1∑

i F (x, ωi)I SAA refers to creation of this sample average problemI Minimizing fN still needs to be done!

13 / 34

Page 46: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming

♣ Stochastic Approximation (SA)

I Sample ωk iidI Generate stochastic subgradient G(x, ω)I Use that in a subgradient method!

♣ Sample average approximation (SAA)

I Generate N iid samples, ω1, . . . , ωN

I Consider empirical objective fN := N−1∑

i F (x, ωi)I SAA refers to creation of this sample average problemI Minimizing fN still needs to be done!

13 / 34

Page 47: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming

♣ Stochastic Approximation (SA)

I Sample ωk iidI Generate stochastic subgradient G(x, ω)I Use that in a subgradient method!

♣ Sample average approximation (SAA)

I Generate N iid samples, ω1, . . . , ωN

I Consider empirical objective fN := N−1∑

i F (x, ωi)I SAA refers to creation of this sample average problemI Minimizing fN still needs to be done!

13 / 34

Page 48: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming

♣ Stochastic Approximation (SA)

I Sample ωk iidI Generate stochastic subgradient G(x, ω)I Use that in a subgradient method!

♣ Sample average approximation (SAA)

I Generate N iid samples, ω1, . . . , ωN

I Consider empirical objective fN := N−1∑

i F (x, ωi)

I SAA refers to creation of this sample average problemI Minimizing fN still needs to be done!

13 / 34

Page 49: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic programming

♣ Stochastic Approximation (SA)

I Sample ωk iidI Generate stochastic subgradient G(x, ω)I Use that in a subgradient method!

♣ Sample average approximation (SAA)

I Generate N iid samples, ω1, . . . , ωN

I Consider empirical objective fN := N−1∑

i F (x, ωi)I SAA refers to creation of this sample average problemI Minimizing fN still needs to be done!

13 / 34

Page 50: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – SA

SA or stochastic (sub)-gradient

I Let x0 ∈ XI For k ≥ 0

Sample ωk iid; generate G(xk, ωk) Update xk+1 = PX (xk − αkG(xk, ωk)), where αk > 0

Henceforth, we’ll simply write:

xk+1 = PX(xk − αkGk

)Does this work?

14 / 34

Page 51: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – SA

SA or stochastic (sub)-gradient

I Let x0 ∈ XI For k ≥ 0

Sample ωk iid; generate G(xk, ωk) Update xk+1 = PX (xk − αkG(xk, ωk)), where αk > 0

Henceforth, we’ll simply write:

xk+1 = PX(xk − αkGk

)

Does this work?

14 / 34

Page 52: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – SA

SA or stochastic (sub)-gradient

I Let x0 ∈ XI For k ≥ 0

Sample ωk iid; generate G(xk, ωk) Update xk+1 = PX (xk − αkG(xk, ωk)), where αk > 0

Henceforth, we’ll simply write:

xk+1 = PX(xk − αkGk

)Does this work?

14 / 34

Page 53: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

Setup

I xk depends on rvs ω1, . . . , ωk−1, so itself random

I Of course, xk does not depend on ωk

I Subgradient method analysis hinged upon: ‖xk − x∗‖22I Stochastic subgradient hinges upon: E[‖xk − x∗‖22]

Denote: Rk := ‖xk − x∗‖22 and rk := E[Rk] = E[‖xk − x∗‖22]

Bounding Rk+1

Rk+1 = ‖xk+1 − x∗‖22 = ‖PX (xk − αkGk)− PXx∗‖22≤ ‖xk − x∗ − αkGk‖22= Rk + α2

k‖Gk‖22 − 2αk〈Gk, xk − x∗〉.

15 / 34

Page 54: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

Setup

I xk depends on rvs ω1, . . . , ωk−1, so itself random

I Of course, xk does not depend on ωk

I Subgradient method analysis hinged upon: ‖xk − x∗‖22I Stochastic subgradient hinges upon: E[‖xk − x∗‖22]

Denote: Rk := ‖xk − x∗‖22 and rk := E[Rk] = E[‖xk − x∗‖22]

Bounding Rk+1

Rk+1 = ‖xk+1 − x∗‖22 = ‖PX (xk − αkGk)− PXx∗‖22≤ ‖xk − x∗ − αkGk‖22= Rk + α2

k‖Gk‖22 − 2αk〈Gk, xk − x∗〉.

15 / 34

Page 55: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

Setup

I xk depends on rvs ω1, . . . , ωk−1, so itself random

I Of course, xk does not depend on ωk

I Subgradient method analysis hinged upon: ‖xk − x∗‖22

I Stochastic subgradient hinges upon: E[‖xk − x∗‖22]

Denote: Rk := ‖xk − x∗‖22 and rk := E[Rk] = E[‖xk − x∗‖22]

Bounding Rk+1

Rk+1 = ‖xk+1 − x∗‖22 = ‖PX (xk − αkGk)− PXx∗‖22≤ ‖xk − x∗ − αkGk‖22= Rk + α2

k‖Gk‖22 − 2αk〈Gk, xk − x∗〉.

15 / 34

Page 56: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

Setup

I xk depends on rvs ω1, . . . , ωk−1, so itself random

I Of course, xk does not depend on ωk

I Subgradient method analysis hinged upon: ‖xk − x∗‖22I Stochastic subgradient hinges upon: E[‖xk − x∗‖22]

Denote: Rk := ‖xk − x∗‖22 and rk := E[Rk] = E[‖xk − x∗‖22]

Bounding Rk+1

Rk+1 = ‖xk+1 − x∗‖22 = ‖PX (xk − αkGk)− PXx∗‖22≤ ‖xk − x∗ − αkGk‖22= Rk + α2

k‖Gk‖22 − 2αk〈Gk, xk − x∗〉.

15 / 34

Page 57: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

Setup

I xk depends on rvs ω1, . . . , ωk−1, so itself random

I Of course, xk does not depend on ωk

I Subgradient method analysis hinged upon: ‖xk − x∗‖22I Stochastic subgradient hinges upon: E[‖xk − x∗‖22]

Denote: Rk := ‖xk − x∗‖22 and rk := E[Rk] = E[‖xk − x∗‖22]

Bounding Rk+1

Rk+1 = ‖xk+1 − x∗‖22 = ‖PX (xk − αkGk)− PXx∗‖22≤ ‖xk − x∗ − αkGk‖22= Rk + α2

k‖Gk‖22 − 2αk〈Gk, xk − x∗〉.

15 / 34

Page 58: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

Setup

I xk depends on rvs ω1, . . . , ωk−1, so itself random

I Of course, xk does not depend on ωk

I Subgradient method analysis hinged upon: ‖xk − x∗‖22I Stochastic subgradient hinges upon: E[‖xk − x∗‖22]

Denote: Rk := ‖xk − x∗‖22 and rk := E[Rk] = E[‖xk − x∗‖22]

Bounding Rk+1

Rk+1 = ‖xk+1 − x∗‖22 = ‖PX (xk − αkGk)− PXx∗‖22≤ ‖xk − x∗ − αkGk‖22= Rk + α2

k‖Gk‖22 − 2αk〈Gk, xk − x∗〉.

15 / 34

Page 59: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

Rk+1 ≤ Rk + α2k‖Gk‖22 − 2αk〈Gk, xk − x∗〉

I Assume: ‖Gk‖2 ≤M on XI Taking expectation:

rk+1 ≤ rk + α2kM

2 − 2αkE[〈Gk, xk − x∗〉].

I We need to now get a handle on the last term

I Since xk is independent of ωk, we have

E[〈xk − x∗, G(xk, ωk)〉] = EE[〈xk − x∗, G(xk, ωk)〉 | ω1..(k−1)]

= E

〈xk − x∗, E[G(xk, ωk) | ω1..k−1]〉

= E[〈xk − x∗, gk〉], gk ∈ ∂f(xk).

16 / 34

Page 60: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

Rk+1 ≤ Rk + α2k‖Gk‖22 − 2αk〈Gk, xk − x∗〉

I Assume: ‖Gk‖2 ≤M on XI Taking expectation:

rk+1 ≤ rk + α2kM

2 − 2αkE[〈Gk, xk − x∗〉].

I We need to now get a handle on the last term

I Since xk is independent of ωk, we have

E[〈xk − x∗, G(xk, ωk)〉] = EE[〈xk − x∗, G(xk, ωk)〉 | ω1..(k−1)]

= E

〈xk − x∗, E[G(xk, ωk) | ω1..k−1]〉

= E[〈xk − x∗, gk〉], gk ∈ ∂f(xk).

16 / 34

Page 61: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

Rk+1 ≤ Rk + α2k‖Gk‖22 − 2αk〈Gk, xk − x∗〉

I Assume: ‖Gk‖2 ≤M on XI Taking expectation:

rk+1 ≤ rk + α2kM

2 − 2αkE[〈Gk, xk − x∗〉].

I We need to now get a handle on the last term

I Since xk is independent of ωk, we have

E[〈xk − x∗, G(xk, ωk)〉] = EE[〈xk − x∗, G(xk, ωk)〉 | ω1..(k−1)]

= E

〈xk − x∗, E[G(xk, ωk) | ω1..k−1]〉

= E[〈xk − x∗, gk〉], gk ∈ ∂f(xk).

16 / 34

Page 62: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

Rk+1 ≤ Rk + α2k‖Gk‖22 − 2αk〈Gk, xk − x∗〉

I Assume: ‖Gk‖2 ≤M on XI Taking expectation:

rk+1 ≤ rk + α2kM

2 − 2αkE[〈Gk, xk − x∗〉].

I We need to now get a handle on the last term

I Since xk is independent of ωk, we have

E[〈xk − x∗, G(xk, ωk)〉] =

EE[〈xk − x∗, G(xk, ωk)〉 | ω1..(k−1)]

= E

〈xk − x∗, E[G(xk, ωk) | ω1..k−1]〉

= E[〈xk − x∗, gk〉], gk ∈ ∂f(xk).

16 / 34

Page 63: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

Rk+1 ≤ Rk + α2k‖Gk‖22 − 2αk〈Gk, xk − x∗〉

I Assume: ‖Gk‖2 ≤M on XI Taking expectation:

rk+1 ≤ rk + α2kM

2 − 2αkE[〈Gk, xk − x∗〉].

I We need to now get a handle on the last term

I Since xk is independent of ωk, we have

E[〈xk − x∗, G(xk, ωk)〉] = EE[〈xk − x∗, G(xk, ωk)〉 | ω1..(k−1)]

=

E〈xk − x∗, E[G(xk, ωk) | ω1..k−1]〉

= E[〈xk − x∗, gk〉], gk ∈ ∂f(xk).

16 / 34

Page 64: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

Rk+1 ≤ Rk + α2k‖Gk‖22 − 2αk〈Gk, xk − x∗〉

I Assume: ‖Gk‖2 ≤M on XI Taking expectation:

rk+1 ≤ rk + α2kM

2 − 2αkE[〈Gk, xk − x∗〉].

I We need to now get a handle on the last term

I Since xk is independent of ωk, we have

E[〈xk − x∗, G(xk, ωk)〉] = EE[〈xk − x∗, G(xk, ωk)〉 | ω1..(k−1)]

= E

〈xk − x∗, E[G(xk, ωk) | ω1..k−1]〉

=

E[〈xk − x∗, gk〉], gk ∈ ∂f(xk).

16 / 34

Page 65: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

Rk+1 ≤ Rk + α2k‖Gk‖22 − 2αk〈Gk, xk − x∗〉

I Assume: ‖Gk‖2 ≤M on XI Taking expectation:

rk+1 ≤ rk + α2kM

2 − 2αkE[〈Gk, xk − x∗〉].

I We need to now get a handle on the last term

I Since xk is independent of ωk, we have

E[〈xk − x∗, G(xk, ωk)〉] = EE[〈xk − x∗, G(xk, ωk)〉 | ω1..(k−1)]

= E

〈xk − x∗, E[G(xk, ωk) | ω1..k−1]〉

= E[〈xk − x∗, gk〉], gk ∈ ∂f(xk).

16 / 34

Page 66: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

Thus, we need to bound: E[〈xk − x∗, gk〉]

I Since f is cvx, f(x) ≥ f(xk) + 〈gk, x− xk〉 for any x ∈ X .

I Thus, in particular we have

2αkE[f(x∗)− f(xk)] ≥ 2αkE[〈gk, x∗ − xk〉]

Now plug this bound back into the rk+1 inequality

rk+1 ≤ rk + α2kM

2 − 2αkE[〈gk, xk − x∗〉]2αkE[〈gk, xk − x∗〉] ≤ rk − rk+1 + αkM

2

2αkE[f(xk)− f(x∗)] ≤ rk − rk+1 + αkM2.

What now?

17 / 34

Page 67: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

Thus, we need to bound: E[〈xk − x∗, gk〉]

I Since f is cvx, f(x) ≥ f(xk) + 〈gk, x− xk〉 for any x ∈ X .

I Thus, in particular we have

2αkE[f(x∗)− f(xk)] ≥ 2αkE[〈gk, x∗ − xk〉]

Now plug this bound back into the rk+1 inequality

rk+1 ≤ rk + α2kM

2 − 2αkE[〈gk, xk − x∗〉]2αkE[〈gk, xk − x∗〉] ≤ rk − rk+1 + αkM

2

2αkE[f(xk)− f(x∗)] ≤ rk − rk+1 + αkM2.

What now?

17 / 34

Page 68: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

Thus, we need to bound: E[〈xk − x∗, gk〉]

I Since f is cvx, f(x) ≥ f(xk) + 〈gk, x− xk〉 for any x ∈ X .

I Thus, in particular we have

2αkE[f(x∗)− f(xk)] ≥ 2αkE[〈gk, x∗ − xk〉]

Now plug this bound back into the rk+1 inequality

rk+1 ≤ rk + α2kM

2 − 2αkE[〈gk, xk − x∗〉]2αkE[〈gk, xk − x∗〉] ≤ rk − rk+1 + αkM

2

2αkE[f(xk)− f(x∗)] ≤ rk − rk+1 + αkM2.

What now?

17 / 34

Page 69: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

Thus, we need to bound: E[〈xk − x∗, gk〉]

I Since f is cvx, f(x) ≥ f(xk) + 〈gk, x− xk〉 for any x ∈ X .

I Thus, in particular we have

2αkE[f(x∗)− f(xk)] ≥ 2αkE[〈gk, x∗ − xk〉]

Now plug this bound back into the rk+1 inequality

rk+1 ≤ rk + α2kM

2 − 2αkE[〈gk, xk − x∗〉]

2αkE[〈gk, xk − x∗〉] ≤ rk − rk+1 + αkM2

2αkE[f(xk)− f(x∗)] ≤ rk − rk+1 + αkM2.

What now?

17 / 34

Page 70: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

Thus, we need to bound: E[〈xk − x∗, gk〉]

I Since f is cvx, f(x) ≥ f(xk) + 〈gk, x− xk〉 for any x ∈ X .

I Thus, in particular we have

2αkE[f(x∗)− f(xk)] ≥ 2αkE[〈gk, x∗ − xk〉]

Now plug this bound back into the rk+1 inequality

rk+1 ≤ rk + α2kM

2 − 2αkE[〈gk, xk − x∗〉]2αkE[〈gk, xk − x∗〉] ≤ rk − rk+1 + αkM

2

2αkE[f(xk)− f(x∗)] ≤ rk − rk+1 + αkM2.

What now?

17 / 34

Page 71: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

Thus, we need to bound: E[〈xk − x∗, gk〉]

I Since f is cvx, f(x) ≥ f(xk) + 〈gk, x− xk〉 for any x ∈ X .

I Thus, in particular we have

2αkE[f(x∗)− f(xk)] ≥ 2αkE[〈gk, x∗ − xk〉]

Now plug this bound back into the rk+1 inequality

rk+1 ≤ rk + α2kM

2 − 2αkE[〈gk, xk − x∗〉]2αkE[〈gk, xk − x∗〉] ≤ rk − rk+1 + αkM

2

2αkE[f(xk)− f(x∗)] ≤ rk − rk+1 + αkM2.

What now?

17 / 34

Page 72: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

Thus, we need to bound: E[〈xk − x∗, gk〉]

I Since f is cvx, f(x) ≥ f(xk) + 〈gk, x− xk〉 for any x ∈ X .

I Thus, in particular we have

2αkE[f(x∗)− f(xk)] ≥ 2αkE[〈gk, x∗ − xk〉]

Now plug this bound back into the rk+1 inequality

rk+1 ≤ rk + α2kM

2 − 2αkE[〈gk, xk − x∗〉]2αkE[〈gk, xk − x∗〉] ≤ rk − rk+1 + αkM

2

2αkE[f(xk)− f(x∗)] ≤ rk − rk+1 + αkM2.

What now?

17 / 34

Page 73: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

2αkE[f(xk)− f(x∗)] ≤ rk − rk+1 + αkM2.

Sum up over k = 1, . . . , T , to obtain∑T

k=1(2αkE[f(xk)− f(x∗)]) ≤ r1 − rT+1 +M2

∑kα2k

≤ r1 +M2∑

kα2k.

To further analyze this sum, divide both sides by∑

k αk, soI Set γk = αk∑T

k αk.

I Thus, γk ≥ 0 and∑

k γk = 1; this allows us to write

E[∑

kγkf(xk)− f(x∗)

]≤r1 +M2

∑k α

2k

2∑

k αk

18 / 34

Page 74: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

2αkE[f(xk)− f(x∗)] ≤ rk − rk+1 + αkM2.

Sum up over k = 1, . . . , T , to obtain∑T

k=1(2αkE[f(xk)− f(x∗)]) ≤ r1 − rT+1 +M2

∑kα2k

≤ r1 +M2∑

kα2k.

To further analyze this sum, divide both sides by∑

k αk, soI Set γk = αk∑T

k αk.

I Thus, γk ≥ 0 and∑

k γk = 1; this allows us to write

E[∑

kγkf(xk)− f(x∗)

]≤r1 +M2

∑k α

2k

2∑

k αk

18 / 34

Page 75: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

2αkE[f(xk)− f(x∗)] ≤ rk − rk+1 + αkM2.

Sum up over k = 1, . . . , T , to obtain∑T

k=1(2αkE[f(xk)− f(x∗)]) ≤ r1 − rT+1 +M2

∑kα2k

≤ r1 +M2∑

kα2k.

To further analyze this sum, divide both sides by∑

k αk, soI Set γk = αk∑T

k αk.

I Thus, γk ≥ 0 and∑

k γk = 1; this allows us to write

E[∑

kγkf(xk)− f(x∗)

]≤r1 +M2

∑k α

2k

2∑

k αk

18 / 34

Page 76: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

2αkE[f(xk)− f(x∗)] ≤ rk − rk+1 + αkM2.

Sum up over k = 1, . . . , T , to obtain∑T

k=1(2αkE[f(xk)− f(x∗)]) ≤ r1 − rT+1 +M2

∑kα2k

≤ r1 +M2∑

kα2k.

To further analyze this sum, divide both sides by∑

k αk, so

I Set γk = αk∑Tk αk

.

I Thus, γk ≥ 0 and∑

k γk = 1; this allows us to write

E[∑

kγkf(xk)− f(x∗)

]≤r1 +M2

∑k α

2k

2∑

k αk

18 / 34

Page 77: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

2αkE[f(xk)− f(x∗)] ≤ rk − rk+1 + αkM2.

Sum up over k = 1, . . . , T , to obtain∑T

k=1(2αkE[f(xk)− f(x∗)]) ≤ r1 − rT+1 +M2

∑kα2k

≤ r1 +M2∑

kα2k.

To further analyze this sum, divide both sides by∑

k αk, soI Set γk = αk∑T

k αk.

I Thus, γk ≥ 0 and∑

k γk = 1; this allows us to write

E[∑

kγkf(xk)− f(x∗)

]≤r1 +M2

∑k α

2k

2∑

k αk

18 / 34

Page 78: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

2αkE[f(xk)− f(x∗)] ≤ rk − rk+1 + αkM2.

Sum up over k = 1, . . . , T , to obtain∑T

k=1(2αkE[f(xk)− f(x∗)]) ≤ r1 − rT+1 +M2

∑kα2k

≤ r1 +M2∑

kα2k.

To further analyze this sum, divide both sides by∑

k αk, soI Set γk = αk∑T

k αk.

I Thus, γk ≥ 0 and∑

k γk = 1; this allows us to write

E[∑

kγkf(xk)− f(x∗)

]≤r1 +M2

∑k α

2k

2∑

k αk

18 / 34

Page 79: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

I Bound looks similar to bound in subgradient method!

I But we wish to say something about xT

I Since γk ≥ 0 and∑

k γk = 1, and we have γkf(xk)

I Easier to talk about average iterate

xTav :=∑T

kγkx

k.

I f(xTav) ≤∑

m γkf(xk) due to convexity

I So we finally obtain the inequality

E[f(xav)− f(x∗)

]≤r1 +M2

∑k α

2k

2∑

k αk.

19 / 34

Page 80: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

I Bound looks similar to bound in subgradient method!

I But we wish to say something about xT

I Since γk ≥ 0 and∑

k γk = 1, and we have γkf(xk)

I Easier to talk about average iterate

xTav :=∑T

kγkx

k.

I f(xTav) ≤∑

m γkf(xk) due to convexity

I So we finally obtain the inequality

E[f(xav)− f(x∗)

]≤r1 +M2

∑k α

2k

2∑

k αk.

19 / 34

Page 81: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

I Bound looks similar to bound in subgradient method!

I But we wish to say something about xT

I Since γk ≥ 0 and∑

k γk = 1, and we have γkf(xk)

I Easier to talk about average iterate

xTav :=∑T

kγkx

k.

I f(xTav) ≤∑

m γkf(xk) due to convexity

I So we finally obtain the inequality

E[f(xav)− f(x∗)

]≤r1 +M2

∑k α

2k

2∑

k αk.

19 / 34

Page 82: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

I Bound looks similar to bound in subgradient method!

I But we wish to say something about xT

I Since γk ≥ 0 and∑

k γk = 1, and we have γkf(xk)

I Easier to talk about average iterate

xTav :=∑T

kγkx

k.

I f(xTav) ≤∑

m γkf(xk) due to convexity

I So we finally obtain the inequality

E[f(xav)− f(x∗)

]≤r1 +M2

∑k α

2k

2∑

k αk.

19 / 34

Page 83: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

I Bound looks similar to bound in subgradient method!

I But we wish to say something about xT

I Since γk ≥ 0 and∑

k γk = 1, and we have γkf(xk)

I Easier to talk about average iterate

xTav :=∑T

kγkx

k.

I f(xTav) ≤∑

m γkf(xk) due to convexity

I So we finally obtain the inequality

E[f(xav)− f(x∗)

]≤r1 +M2

∑k α

2k

2∑

k αk.

19 / 34

Page 84: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

I Bound looks similar to bound in subgradient method!

I But we wish to say something about xT

I Since γk ≥ 0 and∑

k γk = 1, and we have γkf(xk)

I Easier to talk about average iterate

xTav :=∑T

kγkx

k.

I f(xTav) ≤∑

m γkf(xk) due to convexity

I So we finally obtain the inequality

E[f(xav)− f(x∗)

]≤r1 +M2

∑k α

2k

2∑

k αk.

19 / 34

Page 85: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic approximation – analysis

Exercise

♠ Let DX := maxx∈X ‖x− x∗‖2♠ Assume αk = α is a constant. Then, observe that

E[f(xTav)− f(x∗)] ≤D2X +M2Tα2

2Tα

♠ Minimize the rhs over α > 0 to obtain the best stepsize

♠ Show that this choice then yields: E[f(xTav)− f(x∗)] ≤ DXM√T

♠ If T is not fixed in advance, then choose

αk =θDX

M√k, k = 1, 2, . . .

♠ Analyze E[f(xTav)− f(x∗)] with this choice of stepsize

20 / 34

Page 86: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Sample average approximation

Assumption: regularization ‖x‖2 ≤ B; ω ∈ Ω closed, bounded.

Function estimate: f(x) = E[F (x, ω)]Subgradient in ∂f(x) = E[G(x, ω)]

Sample Average Approximation (SAA):

Collect samples ω1, . . . , ωN

Empirical objective: fN (x) := 1N

∑Ni=1 F (x, ωi)

aka Empirical Risk Minimization

Confusing: Machine learners often optimize fN usingstochastic subgradient; but theoretical guarantees are thenonly on the empirical suboptimality E[fN (xk)] ≤ . . .For guarantees on f(xk), extra work is needed regularization+ unif. concentration usedf(xk)− f(x∗) ≤ O(1/

√k) +O(1/

√N)

21 / 34

Page 87: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Sample average approximation

Assumption: regularization ‖x‖2 ≤ B; ω ∈ Ω closed, bounded.

Function estimate: f(x) = E[F (x, ω)]Subgradient in ∂f(x) = E[G(x, ω)]

Sample Average Approximation (SAA):

Collect samples ω1, . . . , ωN

Empirical objective: fN (x) := 1N

∑Ni=1 F (x, ωi)

aka Empirical Risk Minimization

Confusing: Machine learners often optimize fN usingstochastic subgradient; but theoretical guarantees are thenonly on the empirical suboptimality E[fN (xk)] ≤ . . .For guarantees on f(xk), extra work is needed regularization+ unif. concentration usedf(xk)− f(x∗) ≤ O(1/

√k) +O(1/

√N)

21 / 34

Page 88: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Sample average approximation

Assumption: regularization ‖x‖2 ≤ B; ω ∈ Ω closed, bounded.

Function estimate: f(x) = E[F (x, ω)]Subgradient in ∂f(x) = E[G(x, ω)]

Sample Average Approximation (SAA):

Collect samples ω1, . . . , ωN

Empirical objective: fN (x) := 1N

∑Ni=1 F (x, ωi)

aka Empirical Risk Minimization

Confusing: Machine learners often optimize fN usingstochastic subgradient; but theoretical guarantees are thenonly on the empirical suboptimality E[fN (xk)] ≤ . . .

For guarantees on f(xk), extra work is needed regularization+ unif. concentration usedf(xk)− f(x∗) ≤ O(1/

√k) +O(1/

√N)

21 / 34

Page 89: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Sample average approximation

Assumption: regularization ‖x‖2 ≤ B; ω ∈ Ω closed, bounded.

Function estimate: f(x) = E[F (x, ω)]Subgradient in ∂f(x) = E[G(x, ω)]

Sample Average Approximation (SAA):

Collect samples ω1, . . . , ωN

Empirical objective: fN (x) := 1N

∑Ni=1 F (x, ωi)

aka Empirical Risk Minimization

Confusing: Machine learners often optimize fN usingstochastic subgradient; but theoretical guarantees are thenonly on the empirical suboptimality E[fN (xk)] ≤ . . .For guarantees on f(xk), extra work is needed regularization+ unif. concentration usedf(xk)− f(x∗) ≤ O(1/

√k) +O(1/

√N)

21 / 34

Page 90: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic Programming – modeling

Stochastic LP

min x1 + x2

ω1x1 + x2 ≥ 10

ω2x1 + x2 ≥ 5

x1, x2 ≥ 0,

where ω1 ∼ U [1, 5] and ω2 ∼ U [1/3, 1]

I The constraints are not deterministic!

I But we have an idea about what randomness is there

I How do we solve this LP?

I What does it even mean to solve it?

I If ω has been observed, problem becomes deterministic, andcan be solved as a usual LP (aka wait-and-watch)

22 / 34

Page 91: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic Programming – modeling

Stochastic LP

min x1 + x2

ω1x1 + x2 ≥ 10

ω2x1 + x2 ≥ 5

x1, x2 ≥ 0,

where ω1 ∼ U [1, 5] and ω2 ∼ U [1/3, 1]

I The constraints are not deterministic!

I But we have an idea about what randomness is there

I How do we solve this LP?

I What does it even mean to solve it?

I If ω has been observed, problem becomes deterministic, andcan be solved as a usual LP (aka wait-and-watch)

22 / 34

Page 92: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic Programming – modeling

Stochastic LP

min x1 + x2

ω1x1 + x2 ≥ 10

ω2x1 + x2 ≥ 5

x1, x2 ≥ 0,

where ω1 ∼ U [1, 5] and ω2 ∼ U [1/3, 1]

I The constraints are not deterministic!

I But we have an idea about what randomness is there

I How do we solve this LP?

I What does it even mean to solve it?

I If ω has been observed, problem becomes deterministic, andcan be solved as a usual LP (aka wait-and-watch)

22 / 34

Page 93: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic Programming – modeling

Stochastic LP

min x1 + x2

ω1x1 + x2 ≥ 10

ω2x1 + x2 ≥ 5

x1, x2 ≥ 0,

where ω1 ∼ U [1, 5] and ω2 ∼ U [1/3, 1]

I The constraints are not deterministic!

I But we have an idea about what randomness is there

I How do we solve this LP?

I What does it even mean to solve it?

I If ω has been observed, problem becomes deterministic, andcan be solved as a usual LP (aka wait-and-watch)

22 / 34

Page 94: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic Programming – modeling

Stochastic LP

min x1 + x2

ω1x1 + x2 ≥ 10

ω2x1 + x2 ≥ 5

x1, x2 ≥ 0,

where ω1 ∼ U [1, 5] and ω2 ∼ U [1/3, 1]

I The constraints are not deterministic!

I But we have an idea about what randomness is there

I How do we solve this LP?

I What does it even mean to solve it?

I If ω has been observed, problem becomes deterministic, andcan be solved as a usual LP (aka wait-and-watch)

22 / 34

Page 95: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic Programming – modeling

I But we cannot “wait-and-watch” —

we need to decide on xbefore knowing the value of ω

I What to do without knowing exact values for ω1, ω2?

I Some ideas

Guess the uncertainty Probabilistic / Chance constraints . . .

23 / 34

Page 96: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic Programming – modeling

I But we cannot “wait-and-watch” — we need to decide on xbefore knowing the value of ω

I What to do without knowing exact values for ω1, ω2?

I Some ideas

Guess the uncertainty Probabilistic / Chance constraints . . .

23 / 34

Page 97: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic Programming – modeling

I But we cannot “wait-and-watch” — we need to decide on xbefore knowing the value of ω

I What to do without knowing exact values for ω1, ω2?

I Some ideas

Guess the uncertainty Probabilistic / Chance constraints . . .

23 / 34

Page 98: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic Programming – modeling

I But we cannot “wait-and-watch” — we need to decide on xbefore knowing the value of ω

I What to do without knowing exact values for ω1, ω2?

I Some ideas

Guess the uncertainty Probabilistic / Chance constraints . . .

23 / 34

Page 99: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic Programming – modeling

Some guesses

♠ Unbiased / Average case: Choose mean values for each r.v.

♠ Robust / Worst case: Choose worst case values

♠ Explorative / Best case: Choose best case values

24 / 34

Page 100: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic Programming – Example

min x1 + x2

ω1x1 + x2 ≥ 10

ω2x1 + x2 ≥ 5

x1, x2 ≥ 0,

where ω1 ∼ U [1, 5] and ω2 ∼ U [1/3, 1]

Unbiased / Average case:

E[ω1] = 3, E[ω2] = 2/3

min x1 + x2

3x1 + x2 ≥ 10

(2/3)x1 + x2 ≥ 5

x1, x2 ≥ 0,

x∗1 + x∗2 = 5.7143...

(x∗1, x∗2) ≈ (15/7, 25/7).

25 / 34

Page 101: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic Programming – Example

min x1 + x2

ω1x1 + x2 ≥ 10

ω2x1 + x2 ≥ 5

x1, x2 ≥ 0,

where ω1 ∼ U [1, 5] and ω2 ∼ U [1/3, 1]

Worst case:

E[ω1] = 3, E[ω2] = 2/3

min x1 + x2

1x1 + x2 ≥ 10

(1/3)x1 + x2 ≥ 5

x1, x2 ≥ 0,

x∗1 + x∗2 = 10

(x∗1, x∗2) ≈ (41/12, 79/12).

26 / 34

Page 102: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Stochastic Programming – Example

min x1 + x2

ω1x1 + x2 ≥ 10

ω2x1 + x2 ≥ 5

x1, x2 ≥ 0,

where ω1 ∼ U [1, 5] and ω2 ∼ U [1/3, 1]

Best case:

E[ω1] = 3, E[ω2] = 2/3

min x1 + x2

5x1 + x2 ≥ 10

1x1 + x2 ≥ 5

x1, x2 ≥ 0,

x∗1 + x∗2 = 5

(x∗1, x∗2) ≈ (17/8, 23/8).

27 / 34

Page 103: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Online optimization

28 / 34

Page 104: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Online optimization

• We have fixed and known F (x, ω)

• ω1, ω2, . . . presented to us sequentially

Can be chosen adversarially!

• Guess xk; Observe ωk; incur cost F (xk, ωk); Update to xk+1

• We get to see things only sequentially, and the sequence ofsamples shown to us by nature may depend on our guesses

• So a typical goal is to minimize Regret

1T

∑Tk=1 F (xk, zk)−minx∈X

1T

∑Tk=1 F (x, zk)

• That is, difference from the best possible solution we could haveattained, had we been shown all the examples (zk).

• Online optimization is an important idea in machine learning,game theory, decision making, etc.

29 / 34

Page 105: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Online optimization

• We have fixed and known F (x, ω)

• ω1, ω2, . . . presented to us sequentially

Can be chosen adversarially!

• Guess xk; Observe ωk; incur cost F (xk, ωk); Update to xk+1

• We get to see things only sequentially, and the sequence ofsamples shown to us by nature may depend on our guesses

• So a typical goal is to minimize Regret

1T

∑Tk=1 F (xk, zk)−minx∈X

1T

∑Tk=1 F (x, zk)

• That is, difference from the best possible solution we could haveattained, had we been shown all the examples (zk).

• Online optimization is an important idea in machine learning,game theory, decision making, etc.

29 / 34

Page 106: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Online optimization

• We have fixed and known F (x, ω)

• ω1, ω2, . . . presented to us sequentially

Can be chosen adversarially!

• Guess xk;

Observe ωk; incur cost F (xk, ωk); Update to xk+1

• We get to see things only sequentially, and the sequence ofsamples shown to us by nature may depend on our guesses

• So a typical goal is to minimize Regret

1T

∑Tk=1 F (xk, zk)−minx∈X

1T

∑Tk=1 F (x, zk)

• That is, difference from the best possible solution we could haveattained, had we been shown all the examples (zk).

• Online optimization is an important idea in machine learning,game theory, decision making, etc.

29 / 34

Page 107: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Online optimization

• We have fixed and known F (x, ω)

• ω1, ω2, . . . presented to us sequentially

Can be chosen adversarially!

• Guess xk; Observe ωk;

incur cost F (xk, ωk); Update to xk+1

• We get to see things only sequentially, and the sequence ofsamples shown to us by nature may depend on our guesses

• So a typical goal is to minimize Regret

1T

∑Tk=1 F (xk, zk)−minx∈X

1T

∑Tk=1 F (x, zk)

• That is, difference from the best possible solution we could haveattained, had we been shown all the examples (zk).

• Online optimization is an important idea in machine learning,game theory, decision making, etc.

29 / 34

Page 108: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Online optimization

• We have fixed and known F (x, ω)

• ω1, ω2, . . . presented to us sequentially

Can be chosen adversarially!

• Guess xk; Observe ωk; incur cost F (xk, ωk);

Update to xk+1

• We get to see things only sequentially, and the sequence ofsamples shown to us by nature may depend on our guesses

• So a typical goal is to minimize Regret

1T

∑Tk=1 F (xk, zk)−minx∈X

1T

∑Tk=1 F (x, zk)

• That is, difference from the best possible solution we could haveattained, had we been shown all the examples (zk).

• Online optimization is an important idea in machine learning,game theory, decision making, etc.

29 / 34

Page 109: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Online optimization

• We have fixed and known F (x, ω)

• ω1, ω2, . . . presented to us sequentially

Can be chosen adversarially!

• Guess xk; Observe ωk; incur cost F (xk, ωk); Update to xk+1

• We get to see things only sequentially, and the sequence ofsamples shown to us by nature may depend on our guesses

• So a typical goal is to minimize Regret

1T

∑Tk=1 F (xk, zk)−minx∈X

1T

∑Tk=1 F (x, zk)

• That is, difference from the best possible solution we could haveattained, had we been shown all the examples (zk).

• Online optimization is an important idea in machine learning,game theory, decision making, etc.

29 / 34

Page 110: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Online optimization

• We have fixed and known F (x, ω)

• ω1, ω2, . . . presented to us sequentially

Can be chosen adversarially!

• Guess xk; Observe ωk; incur cost F (xk, ωk); Update to xk+1

• We get to see things only sequentially, and the sequence ofsamples shown to us by nature may depend on our guesses

• So a typical goal is to minimize Regret

1T

∑Tk=1 F (xk, zk)−minx∈X

1T

∑Tk=1 F (x, zk)

• That is, difference from the best possible solution we could haveattained, had we been shown all the examples (zk).

• Online optimization is an important idea in machine learning,game theory, decision making, etc.

29 / 34

Page 111: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Online optimization

• We have fixed and known F (x, ω)

• ω1, ω2, . . . presented to us sequentially

Can be chosen adversarially!

• Guess xk; Observe ωk; incur cost F (xk, ωk); Update to xk+1

• We get to see things only sequentially, and the sequence ofsamples shown to us by nature may depend on our guesses

• So a typical goal is to minimize Regret

1T

∑Tk=1 F (xk, zk)−minx∈X

1T

∑Tk=1 F (x, zk)

• That is, difference from the best possible solution we could haveattained, had we been shown all the examples (zk).

• Online optimization is an important idea in machine learning,game theory, decision making, etc.

29 / 34

Page 112: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Online optimization

• We have fixed and known F (x, ω)

• ω1, ω2, . . . presented to us sequentially

Can be chosen adversarially!

• Guess xk; Observe ωk; incur cost F (xk, ωk); Update to xk+1

• We get to see things only sequentially, and the sequence ofsamples shown to us by nature may depend on our guesses

• So a typical goal is to minimize Regret

1T

∑Tk=1 F (xk, zk)−minx∈X

1T

∑Tk=1 F (x, zk)

• That is, difference from the best possible solution we could haveattained, had we been shown all the examples (zk).

• Online optimization is an important idea in machine learning,game theory, decision making, etc.

29 / 34

Page 113: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Online optimization

• We have fixed and known F (x, ω)

• ω1, ω2, . . . presented to us sequentially

Can be chosen adversarially!

• Guess xk; Observe ωk; incur cost F (xk, ωk); Update to xk+1

• We get to see things only sequentially, and the sequence ofsamples shown to us by nature may depend on our guesses

• So a typical goal is to minimize Regret

1T

∑Tk=1 F (xk, zk)−minx∈X

1T

∑Tk=1 F (x, zk)

• That is, difference from the best possible solution we could haveattained, had we been shown all the examples (zk).

• Online optimization is an important idea in machine learning,game theory, decision making, etc.

29 / 34

Page 114: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Online optimization

• We have fixed and known F (x, ω)

• ω1, ω2, . . . presented to us sequentially

Can be chosen adversarially!

• Guess xk; Observe ωk; incur cost F (xk, ωk); Update to xk+1

• We get to see things only sequentially, and the sequence ofsamples shown to us by nature may depend on our guesses

• So a typical goal is to minimize Regret

1T

∑Tk=1 F (xk, zk)−minx∈X

1T

∑Tk=1 F (x, zk)

• That is, difference from the best possible solution we could haveattained, had we been shown all the examples (zk).

• Online optimization is an important idea in machine learning,game theory, decision making, etc.

29 / 34

Page 115: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Online gradient descent

Based on Zinkevich (2003)

Slight generalization:F (x, ω) convex (in x); possibly nonsmooth

x ∈ X , a closed, bounded set

Simplify notation: fk(x) ≡ F (x, ωk)

Regret RT :=∑T

k=1 fk(xk)−minx∈X

∑Tk=1 fk(x)

30 / 34

Page 116: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Online gradient descent

Based on Zinkevich (2003)

Slight generalization:F (x, ω) convex (in x); possibly nonsmooth

x ∈ X , a closed, bounded set

Simplify notation: fk(x) ≡ F (x, ωk)

Regret RT :=∑T

k=1 fk(xk)−minx∈X

∑Tk=1 fk(x)

30 / 34

Page 117: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Online gradient descent

Algorithm:

1 Select some x0 ∈ X , and α0 > 0

2 Round k of algo (k ≥ 0):

Output xk

Receive k-th function fkIncur loss fk(xk)Pick gk ∈ ∂fk(xk)Update: xk+1 = PX (xk − αkg

k)

Using αk = c/√k + 1 and assuming ‖gk‖2 ≤ G, can be

shown that average regret 1TRT ≤ O(1/

√T )

31 / 34

Page 118: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Online gradient descent

Algorithm:

1 Select some x0 ∈ X , and α0 > 0

2 Round k of algo (k ≥ 0):

Output xk

Receive k-th function fkIncur loss fk(xk)Pick gk ∈ ∂fk(xk)Update: xk+1 = PX (xk − αkg

k)

Using αk = c/√k + 1 and assuming ‖gk‖2 ≤ G, can be

shown that average regret 1TRT ≤ O(1/

√T )

31 / 34

Page 119: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Online gradient descent

Algorithm:

1 Select some x0 ∈ X , and α0 > 0

2 Round k of algo (k ≥ 0):

Output xk

Receive k-th function fk

Incur loss fk(xk)Pick gk ∈ ∂fk(xk)Update: xk+1 = PX (xk − αkg

k)

Using αk = c/√k + 1 and assuming ‖gk‖2 ≤ G, can be

shown that average regret 1TRT ≤ O(1/

√T )

31 / 34

Page 120: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Online gradient descent

Algorithm:

1 Select some x0 ∈ X , and α0 > 0

2 Round k of algo (k ≥ 0):

Output xk

Receive k-th function fkIncur loss fk(xk)

Pick gk ∈ ∂fk(xk)Update: xk+1 = PX (xk − αkg

k)

Using αk = c/√k + 1 and assuming ‖gk‖2 ≤ G, can be

shown that average regret 1TRT ≤ O(1/

√T )

31 / 34

Page 121: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Online gradient descent

Algorithm:

1 Select some x0 ∈ X , and α0 > 0

2 Round k of algo (k ≥ 0):

Output xk

Receive k-th function fkIncur loss fk(xk)Pick gk ∈ ∂fk(xk)

Update: xk+1 = PX (xk − αkgk)

Using αk = c/√k + 1 and assuming ‖gk‖2 ≤ G, can be

shown that average regret 1TRT ≤ O(1/

√T )

31 / 34

Page 122: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Online gradient descent

Algorithm:

1 Select some x0 ∈ X , and α0 > 0

2 Round k of algo (k ≥ 0):

Output xk

Receive k-th function fkIncur loss fk(xk)Pick gk ∈ ∂fk(xk)Update: xk+1 = PX (xk − αkg

k)

Using αk = c/√k + 1 and assuming ‖gk‖2 ≤ G, can be

shown that average regret 1TRT ≤ O(1/

√T )

31 / 34

Page 123: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

Online gradient descent

Algorithm:

1 Select some x0 ∈ X , and α0 > 0

2 Round k of algo (k ≥ 0):

Output xk

Receive k-th function fkIncur loss fk(xk)Pick gk ∈ ∂fk(xk)Update: xk+1 = PX (xk − αkg

k)

Using αk = c/√k + 1 and assuming ‖gk‖2 ≤ G, can be

shown that average regret 1TRT ≤ O(1/

√T )

31 / 34

Page 124: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

OGD – regret bound

Assumption: Lipschitz condition ‖∂f‖2 ≤ G

x∗ = argminx∈X

T∑k=1

fk(x)

Since gk ∈ ∂fk(xk), we have

fk(x∗) ≥ fk(xk) + 〈gk, x∗ − xk〉, or

fk(xk)− fk(x∗) ≤ 〈gk, xk − x∗〉

Further analysis depends on bounding

‖xk+1 − x∗‖22

32 / 34

Page 125: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

OGD – regret bound

Assumption: Lipschitz condition ‖∂f‖2 ≤ G

x∗ = argminx∈X

T∑k=1

fk(x)

Since gk ∈ ∂fk(xk), we have

fk(x∗) ≥ fk(xk) + 〈gk, x∗ − xk〉, or

fk(xk)− fk(x∗) ≤ 〈gk, xk − x∗〉

Further analysis depends on bounding

‖xk+1 − x∗‖22

32 / 34

Page 126: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

OGD – regret bound

Assumption: Lipschitz condition ‖∂f‖2 ≤ G

x∗ = argminx∈X

T∑k=1

fk(x)

Since gk ∈ ∂fk(xk), we have

fk(x∗) ≥ fk(xk) + 〈gk, x∗ − xk〉, or

fk(xk)− fk(x∗) ≤ 〈gk, xk − x∗〉

Further analysis depends on bounding

‖xk+1 − x∗‖22

32 / 34

Page 127: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

OGD – regret bound

Assumption: Lipschitz condition ‖∂f‖2 ≤ G

x∗ = argminx∈X

T∑k=1

fk(x)

Since gk ∈ ∂fk(xk), we have

fk(x∗) ≥ fk(xk) + 〈gk, x∗ − xk〉, or

fk(xk)− fk(x∗) ≤ 〈gk, xk − x∗〉

Further analysis depends on bounding

‖xk+1 − x∗‖22

32 / 34

Page 128: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

OGD regret – bounding distance

Recall: xk+1 = PX (xk − αkgk). Thus,

‖xk+1 − x∗‖22 = ‖PX (xk − αkgk)− x∗‖22= ‖PX (xk − αkgk)− PX (x∗)‖22

(PX is nonexpan.) ≤ ‖xk − x∗ − αkgk‖22= ‖xk − x∗‖22 + α2

k‖gk‖22 − 2αk〈gk, xk − x∗〉

〈gk, xk − x∗〉 ≤‖xk − x∗‖22 − ‖xk+1 − x∗‖22

2αk+αk

2‖gk‖22

Now invoke fk(xk)− fk(x∗) ≤ 〈gk, xk − x∗〉

fk(xk)− fk(x∗) ≤ ‖xk − x∗‖22 − ‖xk+1 − x∗‖22

2αk+αk

2‖gk‖22

Sum over k = 1, . . . , T , let αk = c/√k + 1, use ‖gk‖2 ≤ G

Obtain RT ≤ O(√T )

33 / 34

Page 129: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

OGD regret – bounding distance

Recall: xk+1 = PX (xk − αkgk). Thus,

‖xk+1 − x∗‖22 = ‖PX (xk − αkgk)− x∗‖22= ‖PX (xk − αkgk)− PX (x∗)‖22

(PX is nonexpan.) ≤ ‖xk − x∗ − αkgk‖22

= ‖xk − x∗‖22 + α2k‖gk‖22 − 2αk〈gk, xk − x∗〉

〈gk, xk − x∗〉 ≤‖xk − x∗‖22 − ‖xk+1 − x∗‖22

2αk+αk

2‖gk‖22

Now invoke fk(xk)− fk(x∗) ≤ 〈gk, xk − x∗〉

fk(xk)− fk(x∗) ≤ ‖xk − x∗‖22 − ‖xk+1 − x∗‖22

2αk+αk

2‖gk‖22

Sum over k = 1, . . . , T , let αk = c/√k + 1, use ‖gk‖2 ≤ G

Obtain RT ≤ O(√T )

33 / 34

Page 130: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

OGD regret – bounding distance

Recall: xk+1 = PX (xk − αkgk). Thus,

‖xk+1 − x∗‖22 = ‖PX (xk − αkgk)− x∗‖22= ‖PX (xk − αkgk)− PX (x∗)‖22

(PX is nonexpan.) ≤ ‖xk − x∗ − αkgk‖22= ‖xk − x∗‖22 + α2

k‖gk‖22 − 2αk〈gk, xk − x∗〉

〈gk, xk − x∗〉 ≤‖xk − x∗‖22 − ‖xk+1 − x∗‖22

2αk+αk

2‖gk‖22

Now invoke fk(xk)− fk(x∗) ≤ 〈gk, xk − x∗〉

fk(xk)− fk(x∗) ≤ ‖xk − x∗‖22 − ‖xk+1 − x∗‖22

2αk+αk

2‖gk‖22

Sum over k = 1, . . . , T , let αk = c/√k + 1, use ‖gk‖2 ≤ G

Obtain RT ≤ O(√T )

33 / 34

Page 131: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

OGD regret – bounding distance

Recall: xk+1 = PX (xk − αkgk). Thus,

‖xk+1 − x∗‖22 = ‖PX (xk − αkgk)− x∗‖22= ‖PX (xk − αkgk)− PX (x∗)‖22

(PX is nonexpan.) ≤ ‖xk − x∗ − αkgk‖22= ‖xk − x∗‖22 + α2

k‖gk‖22 − 2αk〈gk, xk − x∗〉

〈gk, xk − x∗〉 ≤‖xk − x∗‖22 − ‖xk+1 − x∗‖22

2αk+αk

2‖gk‖22

Now invoke fk(xk)− fk(x∗) ≤ 〈gk, xk − x∗〉

fk(xk)− fk(x∗) ≤ ‖xk − x∗‖22 − ‖xk+1 − x∗‖22

2αk+αk

2‖gk‖22

Sum over k = 1, . . . , T , let αk = c/√k + 1, use ‖gk‖2 ≤ G

Obtain RT ≤ O(√T )

33 / 34

Page 132: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

OGD regret – bounding distance

Recall: xk+1 = PX (xk − αkgk). Thus,

‖xk+1 − x∗‖22 = ‖PX (xk − αkgk)− x∗‖22= ‖PX (xk − αkgk)− PX (x∗)‖22

(PX is nonexpan.) ≤ ‖xk − x∗ − αkgk‖22= ‖xk − x∗‖22 + α2

k‖gk‖22 − 2αk〈gk, xk − x∗〉

〈gk, xk − x∗〉 ≤‖xk − x∗‖22 − ‖xk+1 − x∗‖22

2αk+αk

2‖gk‖22

Now invoke fk(xk)− fk(x∗) ≤ 〈gk, xk − x∗〉

fk(xk)− fk(x∗) ≤ ‖xk − x∗‖22 − ‖xk+1 − x∗‖22

2αk+αk

2‖gk‖22

Sum over k = 1, . . . , T , let αk = c/√k + 1, use ‖gk‖2 ≤ G

Obtain RT ≤ O(√T )

33 / 34

Page 133: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

OGD regret – bounding distance

Recall: xk+1 = PX (xk − αkgk). Thus,

‖xk+1 − x∗‖22 = ‖PX (xk − αkgk)− x∗‖22= ‖PX (xk − αkgk)− PX (x∗)‖22

(PX is nonexpan.) ≤ ‖xk − x∗ − αkgk‖22= ‖xk − x∗‖22 + α2

k‖gk‖22 − 2αk〈gk, xk − x∗〉

〈gk, xk − x∗〉 ≤‖xk − x∗‖22 − ‖xk+1 − x∗‖22

2αk+αk

2‖gk‖22

Now invoke fk(xk)− fk(x∗) ≤ 〈gk, xk − x∗〉

fk(xk)− fk(x∗) ≤ ‖xk − x∗‖22 − ‖xk+1 − x∗‖22

2αk+αk

2‖gk‖22

Sum over k = 1, . . . , T , let αk = c/√k + 1, use ‖gk‖2 ≤ G

Obtain RT ≤ O(√T )

33 / 34

Page 134: Convex Optimization - Suvritsuvrit.de/teach/ee227a/lect19.pdf · Convex Optimization (EE227A: UC Berkeley) ... [rf i(x)] = X i 1 m rf ... I Here, we obtained gin a two step process:

References

♠ A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robuststochastic approximation approach to stochastic programming.(2009)

♠ J. Linderoth. Lecture slides on Stochastic Programming (2003).

34 / 34