Introduction to Optimization Theory

Post on 10-Jun-2022

17 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Introduction to Optimization Theory

Lecture #7 - 10/6/20MS&E 213 / CS 2690

Aaron Sidfordsidford@stanford.edu

ℝ

𝑓

ℝ

π‘“βˆ—

π‘₯βˆ—

𝑓

1

1

00

Plan for Today

Recap β€’ Accelerated Gradient Descent (AGD)

Proof β€’ Approximately optimal AGD for smooth strongly convex functions.

β€’ Non-strongly convexβ€’ Optimal complexityβ€’ Momentum

Extensions Thursday

Generalizations and applications

Recap

Regularity Oracle Goal Algorithm Iterations

𝑛 = 1, 𝑓 π‘₯ ∈ [0,1], π‘₯βˆ— ∈ [0,1] value Β½-optimal anything βˆžπ‘› = 1, π‘₯βˆ— ∈ [0,1], 𝐿-Lipschitz value πœ–-optimal πœ–-net Θ 𝐿/πœ–

π‘₯βˆ— ∈ [0,1], 𝐿-Lipschitz in β€– β‹… β€–" value πœ–-optimal πœ–-net Θ 𝐿/πœ–#

𝐿-smooth and bounded value, gradient πœ–-optimal πœ–-net exponential

𝐿-smooth gradient πœ–-critical gradient descent 𝑂𝐿 𝑓 π‘₯$ βˆ’ π‘“βˆ—

πœ–%

𝐿-smooth πœ‡-strongly convex gradient πœ–-optimal gradient descent π‘‚πΏπœ‡log

𝑓 π‘₯$ βˆ’ π‘“βˆ—πœ–

𝐿-smooth convex gradient πœ–-optimal gradient descent 𝑂𝐿 π‘₯$ βˆ’ π‘₯βˆ— %

%

πœ–

Problemmin!βˆˆβ„&

𝑓(π‘₯)

Today: prove and discuss improvements to 𝑢 𝑳𝝁 π₯𝐨𝐠

𝒇 π’™πŸŽ (π’‡βˆ—π and 𝑢 𝑳 π’™πŸŽ(π’™βˆ— 𝟐

𝟐

𝝐

Recap

Theorem: 𝑓:ℝ: β†’ ℝ is 𝐿-smooth and πœ‡-strongly convex (with respect to β€– β‹… β€–;) if and only if the following hold for all π‘₯, 𝑦

β€’ 𝑓 𝑦 ≀ 𝑼𝒙 π’š ≝ 𝑓(π‘₯) + βˆ‡π‘“ π‘₯ = 𝑦 βˆ’ π‘₯ + >;𝑦 βˆ’ π‘₯ ;

;

β€’ 𝑓 𝑦 β‰₯ 𝑳𝒙 π’š ≝ 𝑓(π‘₯) + βˆ‡π‘“ π‘₯ = 𝑦 βˆ’ π‘₯ + ?;𝑦 βˆ’ π‘₯ ;

;

Goal #1: improve

𝑂 >?log @ A! B@βˆ—

Cto ~𝑂 >

?log @ A! B@βˆ—

C

ℝ

𝑓

x

Approach

Approachβ€’ Maintain point (π‘₯- ∈ ℝ.)β€’ Maintain lower boundβ€’ 𝐿-: ℝ. β†’ ℝ s.t.β€’ 𝐿- π‘₯ ≀ 𝑓(π‘₯) for all π‘₯ ∈ ℝ.

β€’ Update both in each iteration

Progress Measureβ€’ 𝑓 π‘₯- βˆ’ min

/βˆˆβ„!𝐿-(π‘₯)

β€’ β‰₯ 𝑓 π‘₯- βˆ’ 𝐿- π‘₯βˆ— = 𝑓 π‘₯- βˆ’ π‘“βˆ—

β€’ Are many to choose fromβ€’ This one is intuitive and mechanical and will let us touch on ideas of other proofβ€’ Will lose a logarithmic factor and will explain how to remove

Tools: Quadratic Lower Bounds

Lemma 1: 𝐿* π‘₯ = 𝑓 𝑦 + βˆ‡π‘“ 𝑦 + π‘₯ βˆ’ 𝑦 + ,-β€–π‘₯ βˆ’ 𝑦‖-

- = πœ“* +.- β€–π‘₯ βˆ’ 𝑣*β€–-

-

for πœ“* = 𝑓 𝑦 βˆ’ /-.

βˆ‡π‘“ 𝑦 -- and 𝑣* = 𝑦 βˆ’ /

.βˆ‡π‘“(𝑦).

Lemma 2: If 𝑓/, 𝑓-: ℝ0 β†’ ℝ are defined for all π‘₯ ∈ ℝ0 by 𝑓/ π‘₯ = πœ“/ +

πœ‡2 π‘₯ βˆ’ 𝑣/ -

- and 𝑓- π‘₯ = πœ“- +πœ‡2 π‘₯ βˆ’ 𝑣- -

-

Then for all 𝛼 ∈ [0,1] we have𝑓1 π‘₯ = 𝛼 β‹… 𝑓/ π‘₯ + 1 βˆ’ 𝛼 β‹… 𝑓- π‘₯ = πœ“1 +

πœ‡2π‘₯ βˆ’ 𝑣1 -

-

Whereβ€’ 𝑣1 = 𝛼 β‹… 𝑣/ + 1 βˆ’ 𝛼 β‹… 𝑣-β€’ πœ“1 = 𝛼 β‹… πœ“/ + 1 βˆ’ 𝛼 β‹… πœ“- +

.-𝛼 1 βˆ’ 𝛼 𝑣/ βˆ’ 𝑣- -

-

Accelerated Gradient Descent (AGD)

β€’ Initial π‘₯2 ∈ ℝ0, 𝐿2 π‘₯ = πœ“2 +.- π‘₯ βˆ’ 𝑣2 -

- s.t. 𝑓 π‘₯ β‰₯ 𝐿2(π‘₯) for all π‘₯β€’ Repeat for π‘˜ = 0,1,2, …

β€’ 𝑦3 = 𝛼 β‹… π‘₯3 + 1 βˆ’ 𝛼 β‹… 𝑣3 where 𝛼 ∈ [0,1]β€’ 𝐿*) π‘₯ = 𝑓 𝑦3 + βˆ‡π‘“ 𝑦3 + π‘₯ βˆ’ 𝑦3 + .

-π‘₯ βˆ’ 𝑦3 -

-

β€’ 𝐿34/ π‘₯ = πœ“34/ +.-π‘₯ βˆ’ 𝑣34/ -

- = 𝛽𝐿3 π‘₯ + 1 βˆ’ 𝛽 𝐿*)*+(π‘₯) where 𝛽 ∈ [0,1]

β€’ π‘₯34/ = 𝑦3 βˆ’/,βˆ‡π‘“(𝑦3)

Theorem: 𝐿3 π‘₯ β‰₯ 𝑓(π‘₯) for all π‘˜ β‰₯ 0 and π‘₯ ∈ ℝ0. If πœ… = ,. , 𝛼 = 5

54/, and 𝛽 = 1 βˆ’ πœ…(//-, then

𝑓 π‘₯34/ βˆ’ πœ“34/ ≀ 1 βˆ’1πœ…

𝑓 π‘₯3 βˆ’ πœ“3

and ~ πœ… iterations suffices

Proof?

Plan for Today

Recap β€’ Accelerated Gradient Descent (AGD)

Proof β€’ Approximately optimal AGD for smooth strongly convex functions.

β€’ Non-strongly convexβ€’ Optimal complexityβ€’ Momentum

Extensions Thursday

ΓΌ

Generalizations and applications

Accelerated Gradient Descent (AGD)

β€’ Initial π‘₯2 ∈ ℝ0, 𝐿2 π‘₯ = πœ“2 +.- π‘₯ βˆ’ 𝑣2 -

- s.t. 𝑓 π‘₯ β‰₯ 𝐿2(π‘₯) for all π‘₯β€’ Repeat for π‘˜ = 0,1,2, …

β€’ 𝑦3 = 𝛼 β‹… π‘₯3 + 1 βˆ’ 𝛼 β‹… 𝑣3 where 𝛼 ∈ [0,1]β€’ 𝐿*) π‘₯ = 𝑓 𝑦3 + βˆ‡π‘“ 𝑦3 + π‘₯ βˆ’ 𝑦3 + .

-π‘₯ βˆ’ 𝑦3 -

-

β€’ 𝐿34/ π‘₯ = πœ“34/ +.-π‘₯ βˆ’ 𝑣34/ -

- = 𝛽𝐿3 π‘₯ + 1 βˆ’ 𝛽 𝐿*)*+(π‘₯) where 𝛽 ∈ [0,1]

β€’ π‘₯34/ = 𝑦3 βˆ’/,βˆ‡π‘“(𝑦3)

Theorem: 𝐿3 π‘₯ β‰₯ 𝑓(π‘₯) for all π‘˜ β‰₯ 0 and π‘₯ ∈ ℝ0. If πœ… = ,. , 𝛼 = 5

54/, and 𝛽 = 1 βˆ’ πœ…(//-, then

𝑓 π‘₯34/ βˆ’ πœ“34/ ≀ 1 βˆ’1πœ…

𝑓 π‘₯3 βˆ’ πœ“3

and ~ πœ… iterations suffices

𝑣,-. = 𝛽𝑣, + 1 βˆ’ 𝛽 𝑦, βˆ’1πœ‡βˆ‡π‘“(𝑦,)

Analysis?

Some Intuition

β€’ Initial π‘₯K ∈ ℝ:, 𝐿K π‘₯ = πœ“K +?;π‘₯ βˆ’ 𝑣K ;

; s.t. 𝑓 π‘₯ β‰₯ 𝐿K(π‘₯) for all π‘₯β€’ Repeat for π‘˜ = 0,1,2, …

β€’ 𝑦L = 𝛼 β‹… π‘₯L + 1 βˆ’ 𝛼 β‹… 𝑣L where 𝛼 = MMNO

and πœ… = >?

β€’ 𝑣LNO = 𝛽𝑣L + 1 βˆ’ 𝛽 𝑦L βˆ’O?βˆ‡π‘“(𝑦L) where 𝛽 = 1 βˆ’ O

M

β€’ π‘₯LNO = 𝑦L βˆ’O>βˆ‡π‘“(𝑦L)

Noteβ€’ ↑ πœ… β‡’ ↑ 𝛼 (i.e. the more use gradient point)β€’ ↑ πœ… β‡’ ↑ 𝛽 (i.e. the less use lower bound)β€’ ↑ πœ… β‡’ ↑ (1 βˆ’ 𝛽)/πœ‡ (i.e. the bigger the β€œgradient step” for 𝑣LNO)

Analysis?

Proof Plan

Theorem: 𝐿3 π‘₯ β‰₯ 𝑓 π‘₯ for all π‘˜ β‰₯ 0 and π‘₯ ∈ ℝ0 and if πœ… = ,. , 𝛼 = 5

54/, and 𝛽 = 1 βˆ’ /√5, then

𝑓 π‘₯34/ βˆ’ πœ“34/ ≀ 1 βˆ’1πœ…

𝑓 π‘₯3 βˆ’ πœ“3

and ~ πœ… iterations suffices

Plan (since 𝐿3 π‘₯ β‰₯ 𝑓(π‘₯) fact is immediate)β€’ Upper bound 𝑓(π‘₯34/) (gradient descent step)β€’ Lower bound πœ“3 (lower bound combination analysis)β€’ Leverage choice of 𝑦3 (algebra)β€’ Pick 𝛼 and 𝛽 so everything cancels (more algebra)

β€’ 𝑦, = 𝛼 β‹… π‘₯, + 1 βˆ’ 𝛼 β‹… 𝑣,β€’ 𝐿/! π‘₯ = 𝑓 𝑦, = βˆ‡π‘“ 𝑦, 0 π‘₯ βˆ’ 𝑦, + 1

%π‘₯ βˆ’ 𝑦, %

%

β€’ 𝐿,-. π‘₯ = πœ“,-. +1%π‘₯ βˆ’ 𝑣,-. %

% = 𝛽𝐿, π‘₯ + 1 βˆ’ 𝛽 𝐿/!"# π‘₯

β€’ π‘₯,-. = 𝑦, βˆ’.2βˆ‡π‘“(𝑦,)

Upper bound

β€’ 𝑓 π‘₯-OP ≀ ? ? ?β€’ Gradient descent!

β€’ 𝑓 π‘₯-OP ≀ 𝑓 𝑦- βˆ’ PQR

βˆ‡π‘“ 𝑦- QQ

β€’ 𝑦, = 𝛼 β‹… π‘₯, + 1 βˆ’ 𝛼 β‹… 𝑣,β€’ 𝐿/! π‘₯ = 𝑓 𝑦, = βˆ‡π‘“ 𝑦, 0 π‘₯ βˆ’ 𝑦, + 1

%π‘₯ βˆ’ 𝑦, %

%

β€’ 𝐿,-. π‘₯ = πœ“,-. +1%π‘₯ βˆ’ 𝑣,-. %

% = 𝛽𝐿, π‘₯ + 1 βˆ’ 𝛽 𝐿/!"# π‘₯

β€’ π‘₯,-. = 𝑦, βˆ’.2βˆ‡π‘“(𝑦,)

Proof Plan

Theorem: 𝐿3 π‘₯ β‰₯ 𝑓 π‘₯ for all π‘˜ β‰₯ 0 and π‘₯ ∈ ℝ0 and if πœ… = ,. , 𝛼 = 5

54/, and 𝛽 = 1 βˆ’ /√5, then

𝑓 π‘₯34/ βˆ’ πœ“34/ ≀ 1 βˆ’1πœ…

𝑓 π‘₯3 βˆ’ πœ“3

and ~ πœ… iterations suffices

Plan (since 𝐿3 π‘₯ β‰₯ 𝑓(π‘₯) fact is immediate)

β€’ Upper bound: 𝑓 π‘₯34/ ≀ 𝑓 𝑦3 βˆ’ /-, βˆ‡π‘“ 𝑦3 -

-

β€’ Lower bound πœ“3 (lower bound combination analysis)β€’ Leverage choice of 𝑦3 (algebra)β€’ Pick 𝛼 and 𝛽 so everything cancels (more algebra)

ΓΌ

β€’ 𝑦, = 𝛼 β‹… π‘₯, + 1 βˆ’ 𝛼 β‹… 𝑣,β€’ 𝐿/! π‘₯ = 𝑓 𝑦, = βˆ‡π‘“ 𝑦, 0 π‘₯ βˆ’ 𝑦, + 1

%π‘₯ βˆ’ 𝑦, %

%

β€’ 𝐿,-. π‘₯ = πœ“,-. +1%π‘₯ βˆ’ 𝑣,-. %

% = 𝛽𝐿, π‘₯ + 1 βˆ’ 𝛽 𝐿/!"# π‘₯

β€’ π‘₯,-. = 𝑦, βˆ’.2βˆ‡π‘“(𝑦,)

Lower Bound

β€’ Apply Tool #1β€’ 𝐿R# π‘₯ = πœ“R# +

?;β€–π‘₯ βˆ’ 𝑣R#β€–;

;

β€’ πœ“R# = 𝑓 𝑦L βˆ’ O;?

βˆ‡π‘“ 𝑦L ;; and 𝑣R# = 𝑦L βˆ’

O?βˆ‡π‘“(𝑦L).

β€’ Apply Tool #2β€’ πœ“LNO = π›½πœ“L + 1 βˆ’ 𝛽 πœ“R# +

?;𝛽 1 βˆ’ 𝛽 𝑣L βˆ’ 𝑣R# ;

;

β€’ Algebraβ€’ 𝑣L βˆ’ 𝑣R# ;

;= 𝑣L βˆ’ 𝑦L ;

; + ;?βˆ‡π‘“ 𝑦L = 𝑣L βˆ’ 𝑦L + O

?$βˆ‡π‘“ 𝑦L ;

;

β€’ More algebraβ€’ πœ“LNO β‰₯ π›½πœ“L + 1 βˆ’ 𝛽 𝑓 𝑦L βˆ’ OBS

;?βˆ‡π‘“ 𝑦L ;

; + π›½βˆ‡π‘“ 𝑦L = 𝑣L βˆ’ 𝑦L

β€’ 𝑦, = 𝛼 β‹… π‘₯, + 1 βˆ’ 𝛼 β‹… 𝑣,β€’ 𝐿/! π‘₯ = 𝑓 𝑦, = βˆ‡π‘“ 𝑦, 0 π‘₯ βˆ’ 𝑦, + 1

%π‘₯ βˆ’ 𝑦, %

%

β€’ 𝐿,-. π‘₯ = πœ“,-. +1%π‘₯ βˆ’ 𝑣,-. %

% = 𝛽𝐿, π‘₯ + 1 βˆ’ 𝛽 𝐿/!"# π‘₯

β€’ π‘₯,-. = 𝑦, βˆ’.2βˆ‡π‘“(𝑦,)

Proof Plan

Theorem: 𝐿3 π‘₯ β‰₯ 𝑓 π‘₯ for all π‘˜ β‰₯ 0 and π‘₯ ∈ ℝ0 and if πœ… = ,. , 𝛼 = 5

54/, and 𝛽 = 1 βˆ’ /√5, then

𝑓 π‘₯34/ βˆ’ πœ“34/ ≀ 1 βˆ’1πœ…

𝑓 π‘₯3 βˆ’ πœ“3

and ~ πœ… iterations suffices

Plan (since 𝐿3 π‘₯ β‰₯ 𝑓(π‘₯) fact is immediate)β€’ Upper bound: 𝑓 π‘₯34/ ≀ 𝑓 𝑦3 βˆ’ /

-, βˆ‡π‘“ 𝑦3 --

β€’ Lower bound: πœ“34/ β‰₯ π›½πœ“3 + 1 βˆ’ 𝛽 𝑓 𝑦3 βˆ’ /(8-.

βˆ‡π‘“ 𝑦3 -- + π›½βˆ‡π‘“ 𝑦3 + 𝑣3 βˆ’ 𝑦3

β€’ Leverage choice of 𝑦3 (algebra)β€’ Pick 𝛼 and 𝛽 so everything cancels (more algebra)

ΓΌΓΌ

β€’ 𝑦, = 𝛼 β‹… π‘₯, + 1 βˆ’ 𝛼 β‹… 𝑣,β€’ 𝐿/! π‘₯ = 𝑓 𝑦, = βˆ‡π‘“ 𝑦, 0 π‘₯ βˆ’ 𝑦, + 1

%π‘₯ βˆ’ 𝑦, %

%

β€’ 𝐿,-. π‘₯ = πœ“,-. +1%π‘₯ βˆ’ 𝑣,-. %

% = 𝛽𝐿, π‘₯ + 1 βˆ’ 𝛽 𝐿/!"# π‘₯

β€’ π‘₯,-. = 𝑦, βˆ’.2βˆ‡π‘“(𝑦,)

Choice of π’šπ’Œβ€’ Goal

β€’ Lower bound βˆ‡π‘“ 𝑦" # 𝑣" βˆ’π‘¦"β€’ Note

β€’ 1 βˆ’ 𝛼 𝑣" βˆ’π‘¦" +𝛼 π‘₯" βˆ’π‘¦" = 0β€’ 𝑣" βˆ’π‘¦" =

$%&$

(𝑦" βˆ’ π‘₯")β€’ (note there is an 𝛼 ∈ [0,1] s.t. $

%&$= 𝛾 for all 𝛾 > 0)

β€’ Convexityβ€’ 𝑓 π‘₯" β‰₯ 𝑓 𝑦" +βˆ‡π‘“ 𝑦" #(π‘₯" βˆ’π‘¦")β€’ (note, this is the first time we have used convexity between two points where one of

the points is not π‘₯βˆ—)β€’ Algebra

β€’ βˆ‡π‘“ 𝑦" # 𝑣" βˆ’π‘¦" β‰₯ $%&$ 𝑓 𝑦" βˆ’π‘“(π‘₯")

β€’ 𝑦, = 𝛼 β‹… π‘₯, + 1 βˆ’ 𝛼 β‹… 𝑣,β€’ 𝐿/! π‘₯ = 𝑓 𝑦, = βˆ‡π‘“ 𝑦, 0 π‘₯ βˆ’ 𝑦, + 1

%π‘₯ βˆ’ 𝑦, %

%

β€’ 𝐿,-. π‘₯ = πœ“,-. +1%π‘₯ βˆ’ 𝑣,-. %

% = 𝛽𝐿, π‘₯ + 1 βˆ’ 𝛽 𝐿/!"# π‘₯

β€’ π‘₯,-. = 𝑦, βˆ’.2βˆ‡π‘“(𝑦,)

Proof Plan

Theorem: 𝐿3 π‘₯ β‰₯ 𝑓 π‘₯ for all π‘˜ β‰₯ 0 and π‘₯ ∈ ℝ0 and if πœ… = ,. , 𝛼 = 5

54/, and 𝛽 = 1 βˆ’ /√5, then

𝑓 π‘₯34/ βˆ’ πœ“34/ ≀ 1 βˆ’1πœ…

𝑓 π‘₯3 βˆ’ πœ“3

and ~ πœ… iterations suffices

Plan (since 𝐿3 π‘₯ β‰₯ 𝑓(π‘₯) fact is immediate)β€’ Upper bound: 𝑓 π‘₯34/ ≀ 𝑓 𝑦3 βˆ’ /

-, βˆ‡π‘“ 𝑦3 --

β€’ Lower bound: πœ“34/ β‰₯ π›½πœ“3 + 1 βˆ’ 𝛽 𝑓 𝑦3 βˆ’ /(8-.

βˆ‡π‘“ 𝑦3 -- + π›½βˆ‡π‘“ 𝑦3 + 𝑣3 βˆ’ 𝑦3

β€’ Choice of 𝑦3: βˆ‡π‘“ 𝑦3 + 𝑣3 βˆ’ 𝑦3 β‰₯ 1/(1

𝑓 𝑦3 βˆ’ 𝑓(π‘₯3)β€’ Pick 𝛼 and 𝛽 so everything cancels (more algebra)

β€’ 𝑦, = 𝛼 β‹… π‘₯, + 1 βˆ’ 𝛼 β‹… 𝑣,β€’ 𝐿/! π‘₯ = 𝑓 𝑦, = βˆ‡π‘“ 𝑦, 0 π‘₯ βˆ’ 𝑦, + 1

%π‘₯ βˆ’ 𝑦, %

%

β€’ 𝐿,-. π‘₯ = πœ“,-. +1%π‘₯ βˆ’ 𝑣,-. %

% = 𝛽𝐿, π‘₯ + 1 βˆ’ 𝛽 𝐿/!"# π‘₯

β€’ π‘₯,-. = 𝑦, βˆ’.2βˆ‡π‘“(𝑦,)

ΓΌΓΌΓΌ

AlgebraSo Far (since 𝐿3 π‘₯ β‰₯ 𝑓(π‘₯) fact is immediate)β€’ Upper bound: 𝑓 π‘₯34/ ≀ 𝑓 𝑦3 βˆ’ /

-,βˆ‡π‘“ 𝑦3 -

-

β€’ Lower bound: πœ“34/ β‰₯ π›½πœ“3 + 1 βˆ’ 𝛽 𝑓 𝑦3 βˆ’ /(8-. βˆ‡π‘“ 𝑦3 -

- + π›½βˆ‡π‘“ 𝑦3 + 𝑣3 βˆ’ 𝑦3β€’ Choice of 𝑦3: βˆ‡π‘“ 𝑦3 + 𝑣3 βˆ’ 𝑦3 β‰₯ 1

/(1 𝑓 𝑦3 βˆ’ 𝑓(π‘₯3)

Rearranging

β€’ 𝑓 π‘₯34/ βˆ’ πœ“34/ ≀ 𝑓 𝑦3 βˆ’ /-, βˆ‡π‘“ 𝑦3 -

-

βˆ’π›½πœ“3 βˆ’ 1 βˆ’ 𝛽 𝑓 𝑦3 βˆ’ /(8-.

βˆ‡π‘“ 𝑦3 --

βˆ’π›½(1 βˆ’ 𝛽) 1/(1 𝑓 𝑦3 βˆ’ 𝑓(π‘₯3)

= 𝛽𝛼(1 βˆ’ 𝛽)1 βˆ’ 𝛼 𝑓 π‘₯3 βˆ’ πœ“3 + 𝛽 1 βˆ’

𝛼 1 βˆ’ 𝛽1 βˆ’ 𝛼 𝑓 𝑦3 +

12

1 βˆ’ 𝛽 -

πœ‡ βˆ’1𝐿 βˆ‡π‘“ 𝑦3 -

-

β€’ 𝑦, = 𝛼 β‹… π‘₯, + 1 βˆ’ 𝛼 β‹… 𝑣,β€’ 𝐿/! π‘₯ = 𝑓 𝑦, = βˆ‡π‘“ 𝑦, 0 π‘₯ βˆ’ 𝑦, + 1

%π‘₯ βˆ’ 𝑦, %

%

β€’ 𝐿,-. π‘₯ = πœ“,-. +1%π‘₯ βˆ’ 𝑣,-. %

% = 𝛽𝐿, π‘₯ + 1 βˆ’ 𝛽 𝐿/!"# π‘₯

β€’ π‘₯,-. = 𝑦, βˆ’.2βˆ‡π‘“(𝑦,)

Cancellations

Choice of 𝜷

β€’ %&( $

)βˆ’ %*= 0

β€’ ⇔ 1βˆ’π›½ + = πœ…&%

β€’ ⇔𝛽 = 1βˆ’ πœ…&%/+

Choice of 𝜢

β€’ T OBSOBT

= 1⇔ TOBT

= OOBS

= πœ…

β€’ ⇔𝛼 = --.%

β€’ 𝑦, = 𝛼 β‹… π‘₯, + 1 βˆ’ 𝛼 β‹… 𝑣,β€’ 𝐿/! π‘₯ = 𝑓 𝑦, = βˆ‡π‘“ 𝑦, 0 π‘₯ βˆ’ 𝑦, + 1

%π‘₯ βˆ’ 𝑦, %

%

β€’ 𝐿,-. π‘₯ = πœ“,-. +1%π‘₯ βˆ’ 𝑣,-. %

% = 𝛽𝐿, π‘₯ + 1 βˆ’ 𝛽 𝐿/!"# π‘₯

β€’ π‘₯,-. = 𝑦, βˆ’.2βˆ‡π‘“(𝑦,)

πœ… = 𝐿/πœ‡, 𝛼 = 33-.

, 𝛽 = 1 βˆ’ πœ…4./%

Pick 𝛂 and 𝛃 so extra Terms Cancel

𝑓 π‘₯34/ βˆ’ πœ“34/ ≀ 𝛽𝛼 1 βˆ’ 𝛽1 βˆ’ 𝛼 𝑓 π‘₯3 βˆ’ πœ“3 + 𝛽 1 βˆ’

𝛼 1 βˆ’ 𝛽1 βˆ’ 𝛼 𝑓 𝑦3 +

12

1 βˆ’ 𝛽 -

πœ‡ βˆ’1𝐿 βˆ‡π‘“ 𝑦3 -

-

Proof Plan

Theorem: 𝐿3 π‘₯ β‰₯ 𝑓 π‘₯ for all π‘˜ β‰₯ 0 and π‘₯ ∈ ℝ0 and if πœ… = ,. , 𝛼 = 5

54/, and 𝛽 = 1 βˆ’ /√5, then

𝑓 π‘₯34/ βˆ’ πœ“34/ ≀ 1 βˆ’1πœ…

𝑓 π‘₯3 βˆ’ πœ“3

and ~ πœ… iterations suffices

Plan (since 𝐿3 π‘₯ β‰₯ 𝑓(π‘₯) fact is immediate)β€’ Upper bound: 𝑓 π‘₯34/ ≀ 𝑓 𝑦3 βˆ’ /

-, βˆ‡π‘“ 𝑦3 --

β€’ Lower bound: πœ“34/ β‰₯ π›½πœ“3 + 1 βˆ’ 𝛽 𝑓 𝑦3 βˆ’ /(8-.

βˆ‡π‘“ 𝑦3 -- + π›½βˆ‡π‘“ 𝑦3 + 𝑣3 βˆ’ 𝑦3

β€’ Choice of 𝑦3: βˆ‡π‘“ 𝑦3 + 𝑣3 βˆ’ 𝑦3 β‰₯ 1/(1

𝑓 𝑦3 βˆ’ 𝑓(π‘₯3)β€’ Pick 𝛼 and 𝛽 so everything cancels (more algebra)

β€’ 𝑦, = 𝛼 β‹… π‘₯, + 1 βˆ’ 𝛼 β‹… 𝑣,β€’ 𝐿/! π‘₯ = 𝑓 𝑦, = βˆ‡π‘“ 𝑦, 0 π‘₯ βˆ’ 𝑦, + 1

%π‘₯ βˆ’ 𝑦, %

%

β€’ 𝐿,-. π‘₯ = πœ“,-. +1%π‘₯ βˆ’ 𝑣,-. %

% = 𝛽𝐿, π‘₯ + 1 βˆ’ 𝛽 𝐿/!"# π‘₯

β€’ π‘₯,-. = 𝑦, βˆ’.2βˆ‡π‘“(𝑦,)

ΓΌΓΌΓΌΓΌ

Accelerated Gradient Descent (AGD)

β€’ Initial π‘₯2 ∈ ℝ0, 𝐿2 π‘₯ = πœ“2 +.- π‘₯ βˆ’ 𝑣2 -

- s.t. 𝑓 π‘₯ β‰₯ 𝐿2(π‘₯) for all π‘₯β€’ Repeat for π‘˜ = 0,1,2, …

β€’ 𝑦3 = 𝛼 β‹… π‘₯3 + 1 βˆ’ 𝛼 β‹… 𝑣3 where 𝛼 ∈ [0,1]β€’ 𝐿*) π‘₯ = 𝑓 𝑦3 + βˆ‡π‘“ 𝑦3 + π‘₯ βˆ’ 𝑦3 + .

-π‘₯ βˆ’ 𝑦3 -

-

β€’ 𝐿34/ π‘₯ = πœ“34/ +.-π‘₯ βˆ’ 𝑣34/ -

- = 𝛽𝐿3 π‘₯ + 1 βˆ’ 𝛽 𝐿*)*+(π‘₯) where 𝛽 ∈ [0,1]

β€’ π‘₯34/ = 𝑦3 βˆ’/,βˆ‡π‘“(𝑦3)

Theorem: 𝐿3 π‘₯ β‰₯ 𝑓(π‘₯) for all π‘˜ β‰₯ 0 and π‘₯ ∈ ℝ0. If πœ… = ,. , 𝛼 = 5

54/, and 𝛽 = 1 βˆ’ πœ…(//-, then

𝑓 π‘₯34/ βˆ’ πœ“34/ ≀ 1 βˆ’1πœ…

𝑓 π‘₯3 βˆ’ πœ“3

and ~ πœ… iterations suffices

𝑣,-. = 𝛽𝑣, + 1 βˆ’ 𝛽 𝑦, βˆ’1πœ‡βˆ‡π‘“(𝑦,)

How obtain 𝐿<?

Initial Lower Bound?

β€’ Goal: 𝐿S π‘₯ = πœ“S +TQπ‘₯ βˆ’ 𝑣S Q

Q s.t. 𝑓 π‘₯ β‰₯ 𝐿S(π‘₯)

β€’ Idea: 𝐿// π‘₯ + 𝑓 π‘₯S = βˆ‡π‘“ π‘₯S U π‘₯ βˆ’ π‘₯S + TQπ‘₯ βˆ’ π‘₯S Q

Q

β€’ 𝐿// = πœ“S +TQπ‘₯ βˆ’ 𝑣S Q

Q

β€’ πœ“S = 𝑓 π‘₯S βˆ’ PQT

βˆ‡π‘“ π‘₯S QQ

β€’ 𝑣S = π‘₯S βˆ’PTβˆ‡π‘“(π‘₯S)

β€’ One gradient evaluation!

A Proof!!

β€’ For initial π‘₯2 ∈ ℝ0 compute 𝑣2 = π‘₯2 βˆ’/.βˆ‡π‘“(π‘₯2)

β€’ Repeat for π‘˜ = 0,1,2, …‒ 𝑦3 = 𝛼 β‹… π‘₯3 + 1 βˆ’ 𝛼 β‹… 𝑣3 where 𝛼 = 5

54/ and πœ… = ,.

β€’ 𝑣34/ = 𝛽𝑣3 + 1 βˆ’ 𝛽 𝑦3 βˆ’/.βˆ‡π‘“(𝑦3) where 𝛽 = 1 βˆ’ /

5

β€’ π‘₯34/ = 𝑦3 βˆ’/,βˆ‡π‘“(𝑦3)

β€’ Theorem: 𝑓 π‘₯34/ βˆ’ πœ“34/ ≀ 1 βˆ’ /5 𝑓 π‘₯3 βˆ’ πœ“3 for all π‘˜ β‰₯ 0 where each πœ“3 β‰₯ 𝑓(π‘₯βˆ—) and

πœ“2 = 𝑓 π‘₯2 βˆ’ /-. βˆ‡π‘“ π‘₯2 -

-

β€’ Corollary: Can compute πœ–-optimal point in 𝑂( πœ… log πœ… 𝑓 π‘₯2 βˆ’ π‘“βˆ— /πœ– ) queries !!!β€’ Proof: βˆ‡π‘“ π‘₯2 -

- ≀ 2𝐿[𝑓 π‘₯2 βˆ’ π‘“βˆ—] and 𝑓 π‘₯3 βˆ’ π‘“βˆ— ≀ 1 βˆ’ πœ…(//- 3 β‹… 2πœ… 𝑓 π‘₯2 βˆ’ π‘“βˆ—

Plan for Today

Recap β€’ Accelerated Gradient Descent (AGD)

Proof β€’ Approximately optimal AGD for smooth strongly convex functions.

β€’ Non-strongly convexβ€’ Optimal complexityβ€’ Momentum

Extensions Thursday

ΓΌ

ΓΌ

Generalizations and applications

A Proof!!

β€’ For initial π‘₯2 ∈ ℝ0 compute 𝑣2 = π‘₯2 βˆ’/.βˆ‡π‘“(π‘₯2)

β€’ Repeat for π‘˜ = 0,1,2, …‒ 𝑦3 = 𝛼 β‹… π‘₯3 + 1 βˆ’ 𝛼 β‹… 𝑣3 where 𝛼 = 5

54/ and πœ… = ,.

β€’ 𝑣34/ = 𝛽𝑣3 + 1 βˆ’ 𝛽 𝑦3 βˆ’/.βˆ‡π‘“(𝑦3) where 𝛽 = 1 βˆ’ /

5

β€’ π‘₯34/ = 𝑦3 βˆ’/,βˆ‡π‘“(𝑦3)

β€’ Theorem: 𝑓 π‘₯34/ βˆ’ πœ“34/ ≀ 1 βˆ’ /5 𝑓 π‘₯3 βˆ’ πœ“3 for all π‘˜ β‰₯ 0 where each πœ“3 β‰₯ 𝑓(π‘₯βˆ—) and

πœ“2 = 𝑓 π‘₯2 βˆ’ /-. βˆ‡π‘“ π‘₯2 -

-

β€’ Corollary: Can compute πœ–-optimal point in 𝑂( πœ… log πœ… 𝑓 π‘₯2 βˆ’ π‘“βˆ— /πœ– ) queries !!!β€’ Proof: βˆ‡π‘“ π‘₯2 -

- ≀ 2𝐿[𝑓 π‘₯2 βˆ’ π‘“βˆ—] and 𝑓 π‘₯3 βˆ’ π‘“βˆ— ≀ 1 βˆ’ πœ…(//- 3 β‹… 2πœ… 𝑓 π‘₯2 βˆ’ π‘“βˆ—

How to improve?

Improved Potential Functionβ€’ For initial π‘₯2 ∈ ℝ0 let 𝑣2 = π‘₯2β€’ Repeat for π‘˜ = 0,1,2, …

β€’ 𝑦3 = 𝛼 β‹… π‘₯3 + 1 βˆ’ 𝛼 β‹… 𝑣3 where 𝛼 = 554/ and πœ… = ,

.β€’ 𝑣34/ = 𝛽𝑣3 + 1 βˆ’ 𝛽 𝑦3 βˆ’

/.βˆ‡π‘“(𝑦3) where 𝛽 = 1 βˆ’ /

5

β€’ π‘₯34/ = 𝑦3 βˆ’/,βˆ‡π‘“(𝑦3)

β€’ Theorem: 𝑝3 = 𝑓 π‘₯3 βˆ’ π‘“βˆ— +.- 𝑣3 βˆ’ π‘₯βˆ— -

- satisfies 𝑝34/ ≀ 1 βˆ’ πœ…(//- 𝑝3 for all π‘˜ β‰₯ 0

β€’ Corollary: Can compute πœ–-optimal point in 𝑂( πœ… log 𝑓 π‘₯2 βˆ’ π‘“βˆ— /πœ– ) queries !!!β€’ Proof: .- π‘₯2 βˆ’ π‘₯βˆ— -

- ≀ 𝑓 π‘₯2 βˆ’ π‘“βˆ—

β€’ Proof: 𝑓 π‘₯3 βˆ’ π‘“βˆ— ≀ 𝑝3 ≀ 1 βˆ’ πœ…(+63𝑝2 ≀ 1 βˆ’ πœ…(

+63β‹… 2 𝑓 π‘₯2 βˆ’ π‘“βˆ—

Momentum?

Algorithm 1 (initial π‘₯2 ∈ ℝ0)β€’ Let 𝑣2 = π‘₯2β€’ Repeat for π‘˜ = 0,1,2, …

β€’ 𝑦3 = 𝛼 β‹… π‘₯3 + 1 βˆ’ 𝛼 β‹… 𝑣3β€’ 𝑣34/ = 𝛽𝑣3 + 1 βˆ’ 𝛽 𝑦3 βˆ’

/.βˆ‡π‘“(𝑦3)

β€’ π‘₯34/ = 𝑦3 βˆ’/,βˆ‡π‘“(𝑦3)

Algorithm 2 (initial π‘₯2 ∈ ℝ0)

β€’ Let π‘₯/ = π‘₯2 βˆ’/,βˆ‡π‘“ π‘₯2

β€’ Repeat for π‘˜ = 1,2, …

β€’ 𝑦3 = π‘₯3 +5(/54/

π‘₯3 βˆ’ π‘₯3(/β€’ π‘₯34/ = 𝑦3 βˆ’

/,βˆ‡π‘“(𝑦3)

πœ… = ,. , 𝛼 = 5

54/ , and 𝛽 = 1 βˆ’ /5

These algorithm are equivalent!

The π‘₯, are identical in each algorithm.

What if not strongly convex?Idea

β€’ min0𝑔 π‘₯ = 𝑓 π‘₯ + 1

+π‘₯ βˆ’ π‘₯2 +

+

β€’ 𝑔(π‘₯) is πœ†-strongly convex

β€’ Can compute π‘₯3 an 4+-optimal point in 𝑂 *.1

1log 5 0! &5βˆ—

4steps

β€’ 𝑓 π‘₯ ≀ 𝑔(π‘₯) so π‘”βˆ— β‰₯ π‘“βˆ—β€’ 𝑔 π‘₯2 βˆ’π‘”βˆ— = 𝑓 π‘₯2 βˆ’π‘”βˆ— ≀ 𝑓 π‘₯2 βˆ’π‘“βˆ— ≀

*+ π‘₯2 βˆ’ π‘₯βˆ— +

+

β€’ 𝑓 π‘₯3 ≀ 𝑔 π‘₯3 ≀ π‘”βˆ— + πœ– ≀ 𝑓 π‘₯βˆ— + 1+ π‘₯2 βˆ’ π‘₯βˆ— +

+ + πœ–

β€’ If πœ† = 4β€–0!&0βˆ—β€–$$

have πœ– optimal point in 𝑂 * 0!&0βˆ— $$

4 log * 0!&0βˆ— $$

4 queries

Problemmin!βˆˆβ„&

𝑓(π‘₯)

Can remove the log factor by both a better reduction and a more direct algorithm (see notes)

Plan for Today

Recap β€’ Accelerated Gradient Descent (AGD)

Proof β€’ Approximately optimal AGD for smooth strongly convex functions.

β€’ Non-strongly convexβ€’ Optimal complexityβ€’ Momentum

Extensions Thursday

ΓΌ

ΓΌ

ΓΌGeneralizations and applications

top related