Top Banner
Global rates of convergence of algorithms for nonconvex smooth optimization Coralia Cartis (University of Oxford) joint with Nick Gould (RAL, UK) & Philippe Toint (Namur, Belgium) Katya Scheinberg (Lehigh, USA) ICML Workshop on Optimization Methods for the Next Generation of Machine Learning ICML New York City, June 23–24, 2016 ATI Scoping Workshop: Edinburgh – p. 1/28
36

Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Mar 29, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Global rates of convergence of algorithmsfor nonconvex smooth optimization

Coralia Cartis (University of Oxford)

joint with

Nick Gould (RAL, UK) & Philippe Toint (Namur, Belgium)

Katya Scheinberg (Lehigh, USA)

ICML Workshop on Optimization Methods for the Next Generation of Machine LearningICML New York City, June 23–24, 2016

ATI Scoping Workshop: Edinburgh – p. 1/28

Page 2: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Unconstrained optimization — a “mature” area?

Nonconvex local unconstrained optimization:

minimizex∈IRn

f(x) where f ∈ C1(IRn) or C2(IRn

).

Currently two main competing methodologies:Linesearch methods

Trust-region methods

to globalize gradient and (approximate) Newton steps.Much reliable, efficient software for (large-scale) problems.

Is there anything more to say?...

ATI Scoping Workshop: Edinburgh – p. 2/28

Page 3: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Unconstrained optimization — a “mature” area?

Nonconvex local unconstrained optimization:

minimizex∈IRn

f(x) where f ∈ C1(IRn) or C2(IRn

).

Currently two main competing methodologies:Linesearch methods

Trust-region methods

to globalize gradient and (approximate) Newton steps.Much reliable, efficient software for (large-scale) problems.

Is there anything more to say?...

Global rates of convergence of optimization algorithms

⇐⇒ Evaluation complexity of methods (from any initial guess)[well-studied for convex problems, but unprecedented for nonconvex until recently]

ATI Scoping Workshop: Edinburgh – p. 2/28

Page 4: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Evaluation complexity of unconstrained optimization

Relevant analyses of iterative optimization algorithms:

Global convergence to first/second-order critical points(from any initial guess)

Local convergence and local rates (sufficiently close initialguess, well-behaved minimizer)

[Newton’s method: Q-quadratic; steepest descent: linear]

Global rates of convergence (from any initial guess)⇐⇒ Worst-case function evaluation complexity

evaluations are often expensive in practice (climatemodelling, molecular simulations, etc)black-box/oracle computational model (suitable for thedifferent ‘shapes and sizes’ of nonlinear problems)

[Nemirovskii & Yudin (’83); Vavasis (’92), Sikorski (’01), Nesterov (’04)]

ATI Scoping Workshop: Edinburgh – p. 3/28

Page 5: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Overview

Evaluation complexity of standard methods

Improved complexity for cubic regularization

Regularization - and other methods - with onlyoccasionally accurate information

ATI Scoping Workshop: Edinburgh – p. 4/28

Page 6: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Global efficiency of steepest-descent methods

Steepest descent method (with linesearch or trust-region):

f ∈ C1(IRn) with Lipschitz continuous gradient.

to generate gradient ‖g(xk)‖ ≤ ǫ, requires at most[Nesterov (’04); Gratton, Sartenaer & Toint (’08)]

⌈κsd · Lipsg · (f(x0) − flow) · ǫ−2

⌉function evaluations.

ATI Scoping Workshop: Edinburgh – p. 5/28

Page 7: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Global efficiency of steepest-descent methods

Steepest descent method (with linesearch or trust-region):

f ∈ C1(IRn) with Lipschitz continuous gradient.

to generate gradient ‖g(xk)‖ ≤ ǫ, requires at most[Nesterov (’04); Gratton, Sartenaer & Toint (’08)]

⌈κsd · Lipsg · (f(x0) − flow) · ǫ−2

⌉function evaluations.

The worst-case bound is sharp for steepest descent: [CGT(’10)]

For any ǫ > 0 and τ > 0,(inexact-linesearch) steep-est descent applied to thisf takes precisely⌈ǫ−2+τ

⌉function evaluations

to generate |g(xk)| ≤ ǫ.0 1 2 3 4 5 6 7

2.4998

2.5

2.5002x 10

4

The

obj

ectiv

e fu

nctio

n

0 1 2 3 4 5 6 7−1

−0.5

0

The

gra

dien

t

ATI Scoping Workshop: Edinburgh – p. 5/28

Page 8: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Worst-case bound is sharp for steepest descent

Steepest descent method with exact linesearch

xk+1 = xk − αkg(xk) with αk = argminα≥0 f(xk − αg(xk))

takes⌈ǫ−2+τ

⌉iterations to generate ‖g(xk)‖ ≤ ǫ

0 0.5 1 1.5 2 2.5 3 3.5 4

−1

−0.5

0

0.5

1

1.5

2

Contour lines of f(x1, x2) and path of iterates.

ATI Scoping Workshop: Edinburgh – p. 6/28

Page 9: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Global efficiency of Newton’s method

Newton’s method: xk+1 = xk − H−1k gk with Hk ≻ 0.

Newton’s method: as slow as steepest descent

may require⌈ǫ−2+τ

⌉evaluations/iterations, same as steepest

descent method

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

1

2

3

4

5

6

7

8

9

10

Globally Lipschitz continuous gradient and Hessian

ATI Scoping Workshop: Edinburgh – p. 7/28

Page 10: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Worst-case bound for Newton’s method

when globalized with trust-region or linesearch, Newton’smethod will take at most

⌈κNǫ−2

evaluations to generate ‖gk‖ ≤ ǫ

similar worst-case complexity for classical trust-region andlinesearch methods

Is there any method with better evaluation complexity thansteepest-descent?

ATI Scoping Workshop: Edinburgh – p. 8/28

Page 11: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Improved complexity for cubic

regularization

ATI Scoping Workshop: Edinburgh – p. 9/28

Page 12: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Improved complexity for cubic regularization

A cubic model: [Griewank (’81, TR), Nesterov & Polyak (’06), Weiser et al (’07)]

H is globally Lipschitz continuous with Lipschitz constant 2σ:Taylor, Cauchy-Schwarz and Lipschitz =⇒

f(xk + s) ≤ f(xk) + sT g(xk) + 1

2sTH(xk)s + 1

3σ‖s‖3

2︸ ︷︷ ︸

mk(s)

=⇒ reducing mk from s = 0 decreases f since mk(0) = f(xk).

Cubic regularization method: [Nesterov & Polyak (’06)]

xk+1 = xk + sk

compute sk −→mins mk(s) globally: [tractable, even if mk nonconvex!]

ATI Scoping Workshop: Edinburgh – p. 10/28

Page 13: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Improved complexity for cubic regularization

A cubic model: [Griewank (’81, TR), Nesterov & Polyak (’06), Weiser et al (’07)]

H is globally Lipschitz continuous with Lipschitz constant 2σ:Taylor, Cauchy-Schwarz and Lipschitz =⇒

f(xk + s) ≤ f(xk) + sT g(xk) + 1

2sTH(xk)s + 1

3σ‖s‖3

2︸ ︷︷ ︸

mk(s)

=⇒ reducing mk from s = 0 decreases f since mk(0) = f(xk).

Cubic regularization method: [Nesterov & Polyak (’06)]

xk+1 = xk + sk

compute sk −→mins mk(s) globally: [tractable, even if mk nonconvex!]

Worst-case evaluation complexity: at most⌈κcr · ǫ

−3/2⌉

function evaluations to ensure ‖g(xk)‖ ≤ ǫ. [Nesterov & Polyak (’06)]

Can we make cubic regularization computationally efficient ?

ATI Scoping Workshop: Edinburgh – p. 10/28

Page 14: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Adaptive cubic regularization – a practical method

Use [C, Gould & Toint (CGT): Math Programming (2011) ]

cubic regularization model at xk

mk(s) ≡ f(xk) + sT g(xk) + 12sTBks + 1

3σk‖s‖

3

σk > 0 is the iteration-dependent regularization weight

Bk is an approximate Hessian

ATI Scoping Workshop: Edinburgh – p. 11/28

Page 15: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Adaptive cubic regularization – a practical method

Use [C, Gould & Toint (CGT): Math Programming (2011) ]

cubic regularization model at xk

mk(s) ≡ f(xk) + sT g(xk) + 12sTBks + 1

3σk‖s‖

3

σk > 0 is the iteration-dependent regularization weight

Bk is an approximate Hessian

compute sk ≈ argmins mk(s) [details to follow]

compute ρk =f(xk) − f(xk + sk)

f(xk) − mk(sk)

set xk+1 =

{

xk + sk if ρk > η = 0.1

xk otherwise

σk+1 = σk/γ = 2σk when ρk < η; else σk+1 = max{γσk, σmin}

ATI Scoping Workshop: Edinburgh – p. 11/28

Page 16: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Adaptive Regularization with Cubics (ARC)

ARC: sk =global min of mk(s) over s ∈ S ≤ IRn, with g ∈ S

−→ increase subspaces to satisfy termination criteria:‖∇smk(sk)‖ ≤ min(1, ‖sk‖)‖gk‖

ARC has excellent convergence properties: globally, tosecond-order critical points and locally, Q-quadratically.

ATI Scoping Workshop: Edinburgh – p. 12/28

Page 17: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Adaptive Regularization with Cubics (ARC)

ARC: sk =global min of mk(s) over s ∈ S ≤ IRn, with g ∈ S

−→ increase subspaces to satisfy termination criteria:‖∇smk(sk)‖ ≤ min(1, ‖sk‖)‖gk‖

ARC has excellent convergence properties: globally, tosecond-order critical points and locally, Q-quadratically.

‘Average-case’ performanceof ARC variants(preliminary numerics)

1 1.5 2 2.5 3 3.5 4 4.5 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

α

frac

tion

of p

robl

ems

for

whi

ch m

etho

d w

ithin

α o

f bes

t

Performance Profile: iteration count − 131 CUTEr problems

ARC − g stopping rule (3 failures)ARC − s stopping rule (3 failures)ARC − s/σ stopping rule (3 failures)trust−region (8 failures)

ATI Scoping Workshop: Edinburgh – p. 12/28

Page 18: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Worst-case performance of ARC

If H is Lipschitz continuous on iterates’ path and‖(Bk − Hk)sk‖ = O(‖sk‖

2)(∗), then ARC requires at most⌈

κarc · LH3

2 · (f(x0) − flow) · ǫ−3

2

function evaluations

to ensure ‖gk‖ ≤ ǫ. [cf. Nesterov & Polyak](∗) achievable when Bk = Hk or when Bk is computed by gradient finite differences

ATI Scoping Workshop: Edinburgh – p. 13/28

Page 19: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Worst-case performance of ARC

If H is Lipschitz continuous on iterates’ path and‖(Bk − Hk)sk‖ = O(‖sk‖

2)(∗), then ARC requires at most⌈

κarc · LH3

2 · (f(x0) − flow) · ǫ−3

2

function evaluations

to ensure ‖gk‖ ≤ ǫ. [cf. Nesterov & Polyak](∗) achievable when Bk = Hk or when Bk is computed by gradient finite differences

Key ingredients:

sufficient function decrease: f(xk) − f(xk+1) ≥ η1

6σk‖sk‖

3

[local, approximate model minimization is sufficient here]

long successful steps: ‖sk‖ ≥ C‖gk+1‖1

2 (and σk ≥ σmin > 0)

=⇒ while ‖gk‖ ≥ ǫ and k successful,f(xk) − f(xk+1) ≥ η1

6σminC · ǫ

3

2

summing up over k successful: f(x0) − flow ≥ kSη1σminC

3

2

ATI Scoping Workshop: Edinburgh – p. 13/28

Page 20: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Cubic regularization: worst-case bound is optimal

Sharpness: for any ǫ > 0 and τ > 0, to generate |g(xk)| ≤ ǫ,cubic regularization/ARC applied to this f takes precisely

ǫ−3

2+τ

function evaluations

0 1 2 3 4 5 6 7 8 92.222

2.2222

2.2224x 10

4

The

obj

ectiv

e fu

nctio

n

0 1 2 3 4 5 6 7 8 9−1

−0.5

0

The

gra

dien

t

0 1 2 3 4 5 6 7 8 9−1

0

1

2

The

sec

ond

deriv

ativ

e

0 1 2 3 4 5 6 7 8 9−10

0

10

20

The

third

der

ivat

ive

ARC’s worst-case bound is optimal within a large class ofsecond-order methods for f with Lipschitz continuous H.[CGT’11]

ATI Scoping Workshop: Edinburgh – p. 14/28

Page 21: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Second-order optimality complexity bounds

O(ǫ−3) evaluations for ARC and trust-region to ensureboth ‖gk‖ ≤ ǫ and λmin(Bk) ≥ −ǫ. [CGT’12]

this bound is tight for each method.

-1.5

-1

-0.5

0

0 1 2 3 4 5 6 7 8-6

-4

-2

0

2

4

6

0 1 2 3 4 5 6 7 8

The gradient g. The Hessian H.

ATI Scoping Workshop: Edinburgh – p. 15/28

Page 22: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Regularization methods with only

occasionally accurate models

ATI Scoping Workshop: Edinburgh – p. 16/28

Page 23: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Probabilistic local models and methods

Context/purpose: f ∈ C1 or f ∈ C2 but derivatives areinaccurate/impossible/expensive to compute.

Use model-based derivative-free optimization algorithms

Models may be “good”/ “sufficiently accurate” only withcertain probability, for example:−→ models based on random sampling of function values

(within a ball)

−→ finite-difference schemes in parallel, with total

probability of any processor failing less than 0.5

Consider cubic regularization local models withapproximate first and second derivatives.

Expected number of iterations to generate sufficientlysmall ‘true’ gradients?

ATI Scoping Workshop: Edinburgh – p. 17/28

Page 24: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Probabilistic ARC

In the ARC framework, each (realization of) the cubicregularization model [C & Scheinberg, 2015]

mk(s) = f(xk) + sT gk + 1

2sT bks + 1

3σk‖s‖

3

has gk ≈ ∇f(xk) and bk ≈ H(xk).

Random model/variable Mk −→ mk(ωk) realization;random vars Xk, Sk,Σk −→ xk, sk, σk realizations

ATI Scoping Workshop: Edinburgh – p. 18/28

Page 25: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Probabilistic ARC

In the ARC framework, each (realization of) the cubicregularization model [C & Scheinberg, 2015]

mk(s) = f(xk) + sT gk + 1

2sT bks + 1

3σk‖s‖

3

has gk ≈ ∇f(xk) and bk ≈ H(xk).

Random model/variable Mk −→ mk(ωk) realization;random vars Xk, Sk,Σk −→ xk, sk, σk realizations

{Mk} is (p)-probabilistically ‘sufficiently accurate’ for P-ARC iffor {Σk, Xk}, the events

Ik = {‖∇f(Xk)−Gk‖ ≤ κg‖Sk‖2 & ‖(H(Xk)−Bk)Sk‖ ≤ κH‖Sk‖

2}

hold with probability at least p (conditioned on the past).Ik occurs −→ k true iteration; otherwise, false.

ATI Scoping Workshop: Edinburgh – p. 18/28

Page 26: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Probabilistic ARC - complexity

If ∇f(x) and H are Lipschitz continuous, then the expectednumber of iterations that P-ARC takes until ‖∇f(xk)‖ ≤ ǫ

satisfies

E(Nǫ) ≤1

2p − 1· κp−arc · (f(x0) − flow) · ǫ−

3

2

provided the probability of sufficiently accurate models is p > 12.

Analysis

Four types of iterations (successful, unsuccessful, trueand false)

Analysis of joint stochastic processes {Σk, F (x0) − F (Xk)}

ATI Scoping Workshop: Edinburgh – p. 19/28

Page 27: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Probabilistic ARC - analysis

Let Nǫ hitting time for ‖∇f(Xk)‖ ≤ ǫ

Measure of progress towards optimality: Fk = f(X0) − f(Xk)

As Fk+1 ≤ Fk & Fk ≤ F∗ = f(X0) − flow: E(Nǫ) ≤ E(T Fk

F∗

).

If k is a true and successful iteration, thenfk+1 ≥ fk +

κ

(max{σk, σc})3/2‖∇f(xk+1)‖3/2

and σk+1 = max{γσk, σmin}

If σk ≥ σc, and iteration k is true, then it is also successful.

ATI Scoping Workshop: Edinburgh – p. 20/28

Page 28: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Probabilistic ARC - analysis

Let Nǫ hitting time for ‖∇f(Xk)‖ ≤ ǫ

Measure of progress towards optimality: Fk = f(X0) − f(Xk)

As Fk+1 ≤ Fk & Fk ≤ F∗ = f(X0) − flow: E(Nǫ) ≤ E(T Fk

F∗

).

If k is a true and successful iteration, thenfk+1 ≥ fk +

κ

(max{σk, σc})3/2‖∇f(xk+1)‖3/2

and σk+1 = max{γσk, σmin}

If σk ≥ σc, and iteration k is true, then it is also successful.

Split iterations into K′ = {k : σk > σc} and K′′ = {k : σk ≤ σc};analyze joint stochastic processes {Σk, Fk} for k ∈ K′ and k ∈ K′′.

Over K′: σk is a random walk (goes ’up’ w.p. 1 − p; ’down’ w.p. p).Hence σk = σc on average every 1

2p−1iterations.

Fk increases by κ(

ǫσc

) 3

2 on average every 12p−1

iterations.

ATI Scoping Workshop: Edinburgh – p. 20/28

Page 29: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Linesearch methods with occasionally

accurate models

ATI Scoping Workshop: Edinburgh – p. 21/28

Page 30: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

A probabilistic linesearch method

Initialization: Choose parameters γ, η ∈ (0, 1). At iteration k, do:

(Model and step calculation) Compute random modelmk(s) = f(xk) + sT gk and use it to generate direction gk.Set sk = −αkgk.

(Sufficient decrease) Check if ρk =f(xk) − f(xk + sk)

f(xk) − mk(sk)≥ η

[this is equivalent to the Armijo condition]

(Successful step) If ρk ≥ η, setxk+1 = xk + sk and αk+1 = min{γ−1αk, αmax}.

(Unsuccessful step) Else, setxk+1 = xk and αk+1 = γαk. �

More general models mk and directions dk possible.

ATI Scoping Workshop: Edinburgh – p. 22/28

Page 31: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

A Probabilistic LineSearch (P-LS) method

The model {Mk} is (p)-probabilistically ‘sufficiently accurate’for P-LS if for corresponding {Ak, Xk}, the events

Ik = {‖∇f(Xk) − Gk‖ ≤ κgAk‖Gk‖}

hold with probability at least p (conditioned on the past).Ik occurs −→ k true iteration; otherwise, false.

Complexity: If ∇f is Lipschitz continuous, then the expectednumber of iterations that P-LS takes until ‖∇f(xk)‖ ≤ ǫ

satisfies

E(Nǫ) ≤1

2p − 1· κp−ls · (f(x0) − flow) · ǫ−2

provided the probability of sufficiently accurate models is p > 12.

ATI Scoping Workshop: Edinburgh – p. 23/28

Page 32: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

P-LS method - complexity for special cases

f convex with bounded level sets: the expected number ofiterations that P-LS takes until f(xk) − flow ≤ ǫ is

E(Nǫ) ≤1

2p − 1· κp−ls−c · D

2 · ǫ−1.

measure of progress: Fk =1

f(Xk) − flow

; E(Nǫ) = E(TFk

ǫ−1).

f strongly convex: the expected number of iterations thatP-LS takes until f(xk) − flow ≤ ǫ is

E(Nǫ) ≤1

2p − 1· κpls−cc ·

L

µ· log(ǫ−1).

measure of progress: Fk = log1

f(Xk) − flow

; E(Nǫ) = E(TFk

Fǫ) where Fǫ = log 1

ǫ.

ATI Scoping Workshop: Edinburgh – p. 24/28

Page 33: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

A generic algorithmic framework

Initialization: Choose a class of (possibly random) modelsmk; and parameters γ, η ∈ (0, 1). At iteration k, do:

(Model and step calculation) Compute mk of f around xk;and sk = sk(αk) to reduce mk(s).

(Sufficient decrease) Check if

ρk =f(xk) − f(xk + sk)

f(xk) − mk(sk)≥ η

(Successful step) If ρk ≥ η, setxk+1 = xk + sk and αk+1 = min{γ−1αk, αmax}.

(Unsuccessful step) Else, setxk+1 = xk and αk+1 = γαk. �

Examples: linesearch methods (sk = αkdk); adaptiveregularization (αk = 1/σk); trust-region.

ATI Scoping Workshop: Edinburgh – p. 25/28

Page 34: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

A generic algorithmic framework...

{Mk} is (p)-probabilistically ‘sufficiently accurate’ for P-Alg:Ik={ Mk ‘sufficiently accurate’ | Ak and Xk} holds with prob. p.Ik occurs −→ k true iteration; otherwise, false.Assumption: P-Alg construction and Mk probabilisticallyaccurate must ensure: there exists C > 0 s.t. if αk ≤ C anditeration k is true then k is also successful. Henceαk+1 = min{γ−1αk, αmax} and fk+1 ≥ fk + h(αk).

Result: For P-Alg with (p)-probabilistically accurate models,the expected number of iterations to reach desired accuracycan be founded as follows

E(Nǫ) ≤1

2p − 1· κp−alg ·

h(C),

where p > 12

and Fǫ ≥ Fk total function decrease.

ATI Scoping Workshop: Edinburgh – p. 26/28

Page 35: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Generating (p)-sufficiently accurate models

Stochastic gradient and batch sampling [Byrd et al, 2012]

‖∇fSk(xk) − ∇f(xk)‖ ≤ µ‖∇fSk

(xk)‖

with µ ∈ (0, 1) and fixed, sufficiently small αk.

Models formed by sampling of function values in a ballB(xk,∆k) (model-based dfo)Mk (p)-fully linear model: if the event

Ilk = {‖∇f(Xk) − Gk‖ ≤ κg∆k}

holds at least w.p. p (conditioned on the past).Mk (p)-fully quadratic model: if the event

Iqk = {‖∇f(Xk)−Gk‖ ≤ κg∆

2k and ‖H(Xk)−Bk‖ ≤ κH∆k}

holds at least w.p. p (conditioned on the past).

ATI Scoping Workshop: Edinburgh – p. 27/28

Page 36: Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms

Conclusions and future directions

Some results not covered (but existent/in progress):

high-order adaptive regularization methods: [Birgin et al. (’15)]

mk(s) = Tp−1(xk, s) +σk

p‖s‖p

where Tp−1(xk, s) (p-1)-order Taylor polynomial of f .Complexity: O(ǫ−

p

p−1 ) to ensure ‖g(xk)‖ ≤ ǫ [approx, local model min]Complexity of pth order criticality? [in progress]

Complexity of constrained optimization (with convex,nonconvex constraints): for carefully devised methods, itis the same as for the unconstrained case [CGT (’12,’16)]

Optimization with occasionally accurate models:second-order criticality (in progress)stochastic function values - trust-region approach [Arxiv,

with Blanchet, Meninckelly, Scheinberg’16]

ATI Scoping Workshop: Edinburgh – p. 28/28