Top Banner
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization 1 of 55
84

Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Jul 10, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Algorithms for Nonsmooth Optimization

Frank E. Curtis, Lehigh University

presented at

Center for Optimization and Statistical Learning,

Northwestern University

2 March 2018

Algorithms for Nonsmooth Optimization 1 of 55

Page 2: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Outline

Motivating Examples

Subdifferential Theory

Fundamental Algorithms

Nonconvex Nonsmooth Functions

General Framework

Algorithms for Nonsmooth Optimization 2 of 55

Page 3: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Outline

Motivating Examples

Subdifferential Theory

Fundamental Algorithms

Nonconvex Nonsmooth Functions

General Framework

Algorithms for Nonsmooth Optimization 3 of 55

Page 4: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Nonsmooth optimization

In mathematical optimization, one wants to

I minimize an objective

I subject to constraints

i.e.,minx∈X

f(x)

Why nonsmooth optimization?

Nonsmoothness can arise for different reasons:I physical

(phenomena can be nonsmooth)I phase changes in materials

I technological

(constraints impose nonsmoothness)I obstacles in shape design

I methodological

(nonsmoothness introduced by solution method)I decompositions; penalty formulations

I numerical

(analytically smooth, but practically nonsmooth)I “stiff” problems

(Bagirov, Karmitsa, Makela (2014))

Algorithms for Nonsmooth Optimization 4 of 55

Page 5: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Nonsmooth optimization

In mathematical optimization, one wants to

I minimize an objective

I subject to constraints

i.e.,minx∈X

f(x)

Why nonsmooth optimization? Nonsmoothness can arise for different reasons:I physical

(phenomena can be nonsmooth)I phase changes in materials

I technological

(constraints impose nonsmoothness)I obstacles in shape design

I methodological

(nonsmoothness introduced by solution method)I decompositions; penalty formulations

I numerical

(analytically smooth, but practically nonsmooth)I “stiff” problems

(Bagirov, Karmitsa, Makela (2014))

Algorithms for Nonsmooth Optimization 4 of 55

Page 6: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Nonsmooth optimization

In mathematical optimization, one wants to

I minimize an objective

I subject to constraints

i.e.,minx∈X

f(x)

Why nonsmooth optimization? Nonsmoothness can arise for different reasons:I physical (phenomena can be nonsmooth)

I phase changes in materials

I technological

(constraints impose nonsmoothness)I obstacles in shape design

I methodological

(nonsmoothness introduced by solution method)I decompositions; penalty formulations

I numerical

(analytically smooth, but practically nonsmooth)I “stiff” problems

(Bagirov, Karmitsa, Makela (2014))

Algorithms for Nonsmooth Optimization 4 of 55

Page 7: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Nonsmooth optimization

In mathematical optimization, one wants to

I minimize an objective

I subject to constraints

i.e.,minx∈X

f(x)

Why nonsmooth optimization? Nonsmoothness can arise for different reasons:I physical (phenomena can be nonsmooth)

I phase changes in materials

I technological (constraints impose nonsmoothness)I obstacles in shape design

I methodological

(nonsmoothness introduced by solution method)I decompositions; penalty formulations

I numerical

(analytically smooth, but practically nonsmooth)I “stiff” problems

(Bagirov, Karmitsa, Makela (2014))

Algorithms for Nonsmooth Optimization 4 of 55

Page 8: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Nonsmooth optimization

In mathematical optimization, one wants to

I minimize an objective

I subject to constraints

i.e.,minx∈X

f(x)

Why nonsmooth optimization? Nonsmoothness can arise for different reasons:I physical (phenomena can be nonsmooth)

I phase changes in materials

I technological (constraints impose nonsmoothness)I obstacles in shape design

I methodological (nonsmoothness introduced by solution method)I decompositions; penalty formulations

I numerical

(analytically smooth, but practically nonsmooth)I “stiff” problems

(Bagirov, Karmitsa, Makela (2014))

Algorithms for Nonsmooth Optimization 4 of 55

Page 9: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Nonsmooth optimization

In mathematical optimization, one wants to

I minimize an objective

I subject to constraints

i.e.,minx∈X

f(x)

Why nonsmooth optimization? Nonsmoothness can arise for different reasons:I physical (phenomena can be nonsmooth)

I phase changes in materials

I technological (constraints impose nonsmoothness)I obstacles in shape design

I methodological (nonsmoothness introduced by solution method)I decompositions; penalty formulations

I numerical (analytically smooth, but practically nonsmooth)I “stiff” problems

(Bagirov, Karmitsa, Makela (2014))

Algorithms for Nonsmooth Optimization 4 of 55

Page 10: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Data fitting

minx∈Rn

θ(x) + ψ(x) where, e.g., θ(x) = ‖Ax− b‖22

and ψ(x) =n∑i=1

φ(xi) with

φ1(t) =α|t|

1 + α|t|,

φ2(t) = log(α|t|+ 1),

φ3(t) = |t|q , or

φ4(t) = α−(α− t)2+

α

Algorithms for Nonsmooth Optimization 5 of 55

Page 11: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Clusterwise linear regression (CLR)

Given a dataset of pairs A := {(ai, bi)}li=1, the goal of CLR is to simultaneously

I partition the dataset into k disjoint clusters, and

I find regression coefficients {(xj , yj)}kj=1 for each cluster

in order to minimize overall error in the fit; e.g.,

min{(xj ,yj)}

fk({xj , yj}), where fk({xj , yj}) =l∑i=1

minj∈{1,...,k}

|xTj ai − yj − bi|p.

This objective is nonconvex (though it is a difference of convex functions).

Algorithms for Nonsmooth Optimization 6 of 55

Page 12: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Decomposition

Various types of decomposition strategies introduce nonsmoothness.I Primal decomposition can be used for

min(x1,x2,y)

f1(x1, y) + f2(x2, y),

where y is the complicating/linking variable; equivalent to

miny

φ1(y) + φ2(y) where

φ1(y) := min

x1f1(x1, y)

φ2(y) := minx2

f2(x2, y)

This master problem may be nonsmooth in y.

I Dual decomposition can be used for same problem, reformulating as

min(x1,x2,y)

f1(x1, y1) + f2(x2, y2) s.t. y1 = y2

The Lagrangian is separable, meaning the dual function decomposes:

g1(λ) = inf(x1,y1)

(f1(x1, y1) + λT y1)

g2(λ) = inf(x2,y2)

(f2(x2, y2)− λT y2)

Dual problem to maximize g(λ) = g1(λ) + g2(λ) may be nonsmooth in λ.

Algorithms for Nonsmooth Optimization 7 of 55

Page 13: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Decomposition

Various types of decomposition strategies introduce nonsmoothness.I Primal decomposition can be used for

min(x1,x2,y)

f1(x1, y) + f2(x2, y),

where y is the complicating/linking variable; equivalent to

miny

φ1(y) + φ2(y) where

φ1(y) := min

x1f1(x1, y)

φ2(y) := minx2

f2(x2, y)

This master problem may be nonsmooth in y.I Dual decomposition can be used for same problem, reformulating as

min(x1,x2,y)

f1(x1, y1) + f2(x2, y2) s.t. y1 = y2

The Lagrangian is separable, meaning the dual function decomposes:

g1(λ) = inf(x1,y1)

(f1(x1, y1) + λT y1)

g2(λ) = inf(x2,y2)

(f2(x2, y2)− λT y2)

Dual problem to maximize g(λ) = g1(λ) + g2(λ) may be nonsmooth in λ.

Algorithms for Nonsmooth Optimization 7 of 55

Page 14: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Dual decomposition with constraints

Consider the nearly separable problem

min(x1,...,xm)

m∑i=1

fi(xi)

s.t. xi ∈ Xi for all i ∈ {1, . . . ,m}m∑i=1

Aixi ≤ b (e.g., shared resource constraint)

where the last are complicating/linking constraints; “dualizing” leads to

g(λ) := min(x1,...,xm)

m∑i=1

fi(xi) + λT

(m∑i=1

Aixi − b)

s.t. xi ∈ Xi for all i ∈ {1, . . . ,m}.

Given λ ∈ Rm, the value g(λ) comes from solving separable problems; the dual

maxλ≥0

g(λ)

is typically nonsmooth (and people often use poor algorithms!).

Algorithms for Nonsmooth Optimization 8 of 55

Page 15: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Control of dynamical systems

Consider the discrete time linear dynamical system:

yk+1 = Ayk +Buk (state equation)

zk = Cyk (observation equation)

Supposing we want to “design” a control such that

uk = XCyk (where X is our variable)

consider the “closed loop system” given by

yk+1 = Ayk +Buk

= Ayk +BXCyk

= (A+BXC)yk.

Common objectives are to minimize a stability measure

ρ(A+BXC),

which are often functions of the eigenvalues of A+BXC.

Algorithms for Nonsmooth Optimization 9 of 55

Page 16: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Eigenvalue optimization

Plots of ordered eigenvalues as matrix is perturbed along a given direction:

Algorithms for Nonsmooth Optimization 10 of 55

Page 17: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Other sources of nonsmooth optimization problems

I Lagrangian relaxation

I Composite optimization (e.g., penalty methods for “soft constraints”)

I Parametric optimization (e.g., for model predictive control)

I Multilevel optimization

Algorithms for Nonsmooth Optimization 11 of 55

Page 18: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Outline

Motivating Examples

Subdifferential Theory

Fundamental Algorithms

Nonconvex Nonsmooth Functions

General Framework

Algorithms for Nonsmooth Optimization 12 of 55

Page 19: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Derivatives

When I teach an optimization class, I always start with the same question:

What is a derivative? (f : R → R)

Answer I get: “slope of the tangent line”

x

f(x)

slope = f ′(x)

Algorithms for Nonsmooth Optimization 13 of 55

Page 20: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Derivatives

When I teach an optimization class, I always start with the same question:

What is a derivative? (f : R → R)

Answer I get: “slope of the tangent line”

x

f(x)

slope = f ′(x)

Algorithms for Nonsmooth Optimization 13 of 55

Page 21: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Gradients

Then I ask:

What is a gradient? (f : Rn → R)

Answer I get: “direction along which the function increases at the fastest rate”

Algorithms for Nonsmooth Optimization 14 of 55

Page 22: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Gradients

Then I ask:

What is a gradient? (f : Rn → R)

Answer I get: “direction along which the function increases at the fastest rate”

Algorithms for Nonsmooth Optimization 14 of 55

Page 23: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Derivative vs. gradient

So if a derivative is a magnitude (here, a slope), then why does it generalize inmultiple dimensions to something that is a direction?

(n = 1) f ′(x) =df

dx(x) =

∂f

∂x(x)

(n ≥ 1) ∇f(x) =

∂f∂x1

(x)

...∂f∂xn

(x)

What’s important? Magnitude? direction?

Answer: The gradient is a vector in Rn, which

I has magnitude (e.g., its 2-norm)

I can be viewed as a direction

I and gives us a way to compute directional derivatives

Algorithms for Nonsmooth Optimization 15 of 55

Page 24: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Derivative vs. gradient

So if a derivative is a magnitude (here, a slope), then why does it generalize inmultiple dimensions to something that is a direction?

(n = 1) f ′(x) =df

dx(x) =

∂f

∂x(x)

(n ≥ 1) ∇f(x) =

∂f∂x1

(x)

...∂f∂xn

(x)

What’s important? Magnitude? direction?

Answer: The gradient is a vector in Rn, which

I has magnitude (e.g., its 2-norm)

I can be viewed as a direction

I and gives us a way to compute directional derivatives

Algorithms for Nonsmooth Optimization 15 of 55

Page 25: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Differentiable f

How should we think about the gradient?

If f is continuously differentiable (i.e., f ∈ C1),

then ∇f(x) is the unique vector in the linear (Taylor) approximation of f at x.

x x

f(x) +∇f(x)T (x− x)

f(x)

Both are graphs of functions of x!

Algorithms for Nonsmooth Optimization 16 of 55

Page 26: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Differentiable and convex f

If f ∈ C1 is convex, then

f(x) ≥ f(x) +∇f(x)T (x− x) for all (x, x) ∈ Rn × Rn

x x

f(x) +∇f(x)T (x− x)

f(x)

Algorithms for Nonsmooth Optimization 17 of 55

Page 27: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Graphs and epigraphs

There is another interpretation of a gradient that is also useful. First. . .

What is a graph?

A set of points in Rn+1, namely, {(x, z) : f(x) = z}

A related quantity, another set, is the epigraph: {(x, z) : f(x) ≤ z}

Algorithms for Nonsmooth Optimization 18 of 55

Page 28: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Graphs and epigraphs

There is another interpretation of a gradient that is also useful. First. . .

What is a graph?

A set of points in Rn+1, namely, {(x, z) : f(x) = z}

A related quantity, another set, is the epigraph: {(x, z) : f(x) ≤ z}

x

{(x, f(x))}

Algorithms for Nonsmooth Optimization 18 of 55

Page 29: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Graphs and epigraphs

There is another interpretation of a gradient that is also useful. First. . .

What is a graph?

A set of points in Rn+1, namely, {(x, z) : f(x) = z}

A related quantity, another set, is the epigraph: {(x, z) : f(x) ≤ z}

x

{(x, f(x))}

Algorithms for Nonsmooth Optimization 18 of 55

Page 30: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Differentiable and convex f

If f ∈ C1 is convex, then, for all (x, x) ∈ Rn × Rn,

f(x) ≥ f(x) +∇f(x)T (x− x)

⇐⇒ f(x)−∇f(x)T x ≥ f(x)−∇f(x)T x

⇐⇒[−∇f(x)

1

]T [x

f(x)

]≥[−∇f(x)

1

]T [x

f(x)

]

Note: Given x, the vector

[−∇f(x)

1

]is given,

so the inequality above involves a linear function over Rn+1 and says

the value at any point

[x

f(x)

]in the graph is at least the value at

[x

f(x)

]

Algorithms for Nonsmooth Optimization 19 of 55

Page 31: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Differentiable and convex f

If f ∈ C1 is convex, then, for all (x, x) ∈ Rn × Rn,

f(x) ≥ f(x) +∇f(x)T (x− x)

⇐⇒ f(x)−∇f(x)T x ≥ f(x)−∇f(x)T x

⇐⇒[−∇f(x)

1

]T [x

f(x)

]≥[−∇f(x)

1

]T [x

f(x)

]

Note: Given x, the vector

[−∇f(x)

1

]is given,

so the inequality above involves a linear function over Rn+1 and says

the value at any point

[x

f(x)

]in the graph is at least the value at

[x

f(x)

]

Algorithms for Nonsmooth Optimization 19 of 55

Page 32: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Linearization

and supporting hyperplane for epigraph

x x

f(x) +∇f(x)T (x− x)

f(x)

Algorithms for Nonsmooth Optimization 20 of 55

Page 33: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Linearization and supporting hyperplane for epigraph

x x

f(x) +∇f(x)T (x− x)

{(x, f(x))}

[x

f(x)

]+

[−∇f(x)

1

]

Algorithms for Nonsmooth Optimization 20 of 55

Page 34: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Subgradients (convex f)

Why was that useful?

We can generalize this idea when the function is not differentiable somewhere.

[x

f(x)

]

[x

f(x)

]+

[−g1

]

A vector g ∈ Rn is a subgradient of a convex f : Rn → R at x ∈ Rn if

f(x) ≥ f(x) + gT (x− x)

⇐⇒[−g1

]T [x

f(x)

]≥[−g1

]T [x

f(x)

]

Algorithms for Nonsmooth Optimization 21 of 55

Page 35: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Subdifferentials

Theorem

If f is convex and differentiable at x, then ∇f(x) is its unique subgradient at x.

But in general,

the set of all subgradients for a convex f at x is the subdifferential of f at x:

∂f(x) := {g ∈ Rn : g is a subgradient of f at x}.

From the definition, it is easily seen that

x∗ is a minimizer of f if and only if 0 ∈ ∂f(x∗)

Algorithms for Nonsmooth Optimization 22 of 55

Page 36: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

What about nonconvex, nonsmooth?

We need to generalize the idea of a subgradient further.

I Directional derivatives

I Subgradients

I Subdifferentials

Let’s return to this after we discuss some algorithms. . .

Algorithms for Nonsmooth Optimization 23 of 55

Page 37: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Outline

Motivating Examples

Subdifferential Theory

Fundamental Algorithms

Nonconvex Nonsmooth Functions

General Framework

Algorithms for Nonsmooth Optimization 24 of 55

Page 38: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

A fundamental iteration

Thinking of −∇f(xk), we have a vector that

I directs us in a direction of descent, and

I vanishes as we approach a minimizer

Algorithm : Gradient Descent

1: Choose an initial point x0 ∈ Rn and stepsize α ∈ (0, 1/L]2: for k = 0, 1, 2, . . . do3: if ‖∇f(xk)‖ ≈ 0, then return xk4: else set

xk+1 ← xk − α∇f(xk)

I call this a fundamental iteration.

Here, we suppose ∇f is Lipschitz continuous, i.e., there exists L ≥ 0 such that

‖∇f(x)−∇f(x)‖2 ≤ L‖x− x‖2 for all (x, x) ∈ Rn × Rn

=⇒ f(x) ≤ f(x) +∇f(x)T (x− x) + 12L‖x− x‖22 for all (x, x) ∈ Rn × Rn.

Algorithms for Nonsmooth Optimization 25 of 55

Page 39: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

A fundamental iteration

Thinking of −∇f(xk), we have a vector that

I directs us in a direction of descent, and

I vanishes as we approach a minimizer

Algorithm : Gradient Descent

1: Choose an initial point x0 ∈ Rn and stepsize α ∈ (0, 1/L]2: for k = 0, 1, 2, . . . do3: if ‖∇f(xk)‖ ≈ 0, then return xk4: else set

xk+1 ← xk − α∇f(xk)

I call this a fundamental iteration.

Here, we suppose ∇f is Lipschitz continuous, i.e., there exists L ≥ 0 such that

‖∇f(x)−∇f(x)‖2 ≤ L‖x− x‖2 for all (x, x) ∈ Rn × Rn

=⇒ f(x) ≤ f(x) +∇f(x)T (x− x) + 12L‖x− x‖22 for all (x, x) ∈ Rn × Rn.

Algorithms for Nonsmooth Optimization 25 of 55

Page 40: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

A fundamental iteration

Thinking of −∇f(xk), we have a vector that

I directs us in a direction of descent, and

I vanishes as we approach a minimizer

Algorithm : Gradient Descent

1: Choose an initial point x0 ∈ Rn and stepsize α ∈ (0, 1/L]2: for k = 0, 1, 2, . . . do3: if ‖∇f(xk)‖ ≈ 0, then return xk4: else set

xk+1 ← xk − α∇f(xk)

I call this a fundamental iteration.

Here, we suppose ∇f is Lipschitz continuous, i.e., there exists L ≥ 0 such that

‖∇f(x)−∇f(x)‖2 ≤ L‖x− x‖2 for all (x, x) ∈ Rn × Rn

=⇒ f(x) ≤ f(x) +∇f(x)T (x− x) + 12L‖x− x‖22 for all (x, x) ∈ Rn × Rn.

Algorithms for Nonsmooth Optimization 25 of 55

Page 41: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Convergence of gradient descent

xk x

f(xk)

f(xk) +∇f(xk)T (x − xk) + 12L‖x − xk‖

22

f(x)? f(x)?

Algorithms for Nonsmooth Optimization 26 of 55

Page 42: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Convergence of gradient descent

xk x

f(xk)

f(xk) +∇f(xk)T (x − xk) + 12L‖x − xk‖

22

f(x)? f(x)?

Algorithms for Nonsmooth Optimization 26 of 55

Page 43: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Convergence of gradient descent

xk x

f(xk)

f(xk) +∇f(xk)T (x − xk) + 12L‖x − xk‖

22

f(x)? f(x)?

Algorithms for Nonsmooth Optimization 26 of 55

Page 44: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Gradient descent for f

Theorem

If ∇f is Lipschitz continuous with constant L > 0 and α ∈ (0, 1/L], then

∞∑j=0

‖∇f(xj)‖22 <∞ which implies {∇f(xj)} → 0.

Proof.

Let k ∈ N and recall that xk+1 − xk = −α∇f(xk). Then, since α ∈ (0, 1/L],

f(xk+1) ≤ f(xk) +∇f(xk)T (xk+1 − xk) + 12L‖xk+1 − xk‖22

= f(xk)− α‖∇f(xk)‖22 + 12α2L‖∇f(xk)‖22

= f(xk)− α(1− 12αL)‖∇f(xk)‖22

≤ f(xk)− 12α‖∇f(xk)‖22.

Thus, summing over j ∈ {0, . . . , k}, one finds

∞ > f(x0)− finf ≥ f(x0)− f(xk+1) ≥ 12α∑kj=0 ‖∇f(xj)‖22.

Algorithms for Nonsmooth Optimization 27 of 55

Page 45: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Gradient descent for f

Theorem

If ∇f is Lipschitz continuous with constant L > 0 and α ∈ (0, 1/L], then

∞∑j=0

‖∇f(xj)‖22 <∞ which implies {∇f(xj)} → 0.

Proof.

Let k ∈ N and recall that xk+1 − xk = −α∇f(xk). Then, since α ∈ (0, 1/L],

f(xk+1) ≤ f(xk) +∇f(xk)T (xk+1 − xk) + 12L‖xk+1 − xk‖22

= f(xk)− α‖∇f(xk)‖22 + 12α2L‖∇f(xk)‖22

= f(xk)− α(1− 12αL)‖∇f(xk)‖22

≤ f(xk)− 12α‖∇f(xk)‖22.

Thus, summing over j ∈ {0, . . . , k}, one finds

∞ > f(x0)− finf ≥ f(x0)− f(xk+1) ≥ 12α∑kj=0 ‖∇f(xj)‖22.

Algorithms for Nonsmooth Optimization 27 of 55

Page 46: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Strong convexity

Now suppose that f is c-strongly convex, which means that

f(x) ≥ f(x) +∇f(x)T (x− x) + 12c‖x− x‖22 for all (x, x) ∈ Rn × Rn.

Important consequences of this are that

I f has a unique global minimizer, call it x∗ with f∗ := f(x∗), and

I the gradient norm grows with the optimality error in that

2c(f(x)− f∗) ≤ ‖∇f(x)‖22 for all x ∈ Rn.

Algorithms for Nonsmooth Optimization 28 of 55

Page 47: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Strong convexity, lower bound

xk x

f(xk)

f(xk) +∇f(xk)T (x − xk) + 12L‖x − xk‖

22

f(xk) +∇f(xk)T (x − xk) + 12c‖x − xk‖

22

Algorithms for Nonsmooth Optimization 29 of 55

Page 48: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Strong convexity, lower bound

xk x

f(xk)

f(xk) +∇f(xk)T (x − xk) + 12L‖x − xk‖

22

f(xk) +∇f(xk)T (x − xk) + 12c‖x − xk‖

22

Algorithms for Nonsmooth Optimization 29 of 55

Page 49: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Gradient descent for strongly convex f

Theorem

If ∇f is Lipschitz with L > 0, f is c-strongly convex, and α ∈ (0, 1/L], then

f(xj+1)− f∗ ≤ (1− αc)j+1(f(x0)− f∗) for all j ∈ N.

Proof.

Let k ∈ N. Following the previous proof, one finds

f(xk+1) ≤ f(xk)− 12α‖∇f(xk)‖22

≤ f(xk)− αc(f(xk)− f∗).

Subtracting f∗ from both sides, one finds

f(xk+1)− f∗ ≤ (1− αc)(f(xk)− f∗).

Applying the result repeatedly over j ∈ {0, . . . , k} yields the result.

Algorithms for Nonsmooth Optimization 30 of 55

Page 50: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Gradient descent for strongly convex f

Theorem

If ∇f is Lipschitz with L > 0, f is c-strongly convex, and α ∈ (0, 1/L], then

f(xj+1)− f∗ ≤ (1− αc)j+1(f(x0)− f∗) for all j ∈ N.

Proof.

Let k ∈ N. Following the previous proof, one finds

f(xk+1) ≤ f(xk)− 12α‖∇f(xk)‖22

≤ f(xk)− αc(f(xk)− f∗).

Subtracting f∗ from both sides, one finds

f(xk+1)− f∗ ≤ (1− αc)(f(xk)− f∗).

Applying the result repeatedly over j ∈ {0, . . . , k} yields the result.

Algorithms for Nonsmooth Optimization 30 of 55

Page 51: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

A fundamental iteration when f is nonsmooth?

What is a fundamental iteration for nonsmooth optimization?

Steepest descent!

For convex f , the directional derivative of f at x along s is

f ′(x; s) = maxg∈∂f(x)

gT s

Along which direction is f decreasing at the fastest rate?

The solution of an optimization problem!

min‖s‖2≤1

f ′(x; s) = min‖s‖2≤1

maxg∈∂f(x)

gT s

= maxg∈∂f(x)

min‖s‖2≤1

gT s (von Neumann minimax theorem)

= maxg∈∂f(x)

(−‖g‖2)

= − ming∈∂f(x)

‖g‖2 =⇒ (need minimum norm subgradient)

Algorithms for Nonsmooth Optimization 31 of 55

Page 52: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

A fundamental iteration when f is nonsmooth?

What is a fundamental iteration for nonsmooth optimization?

Steepest descent!

For convex f , the directional derivative of f at x along s is

f ′(x; s) = maxg∈∂f(x)

gT s

Along which direction is f decreasing at the fastest rate?

The solution of an optimization problem!

min‖s‖2≤1

f ′(x; s) = min‖s‖2≤1

maxg∈∂f(x)

gT s

= maxg∈∂f(x)

min‖s‖2≤1

gT s (von Neumann minimax theorem)

= maxg∈∂f(x)

(−‖g‖2)

= − ming∈∂f(x)

‖g‖2 =⇒ (need minimum norm subgradient)

Algorithms for Nonsmooth Optimization 31 of 55

Page 53: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

A fundamental iteration when f is nonsmooth?

What is a fundamental iteration for nonsmooth optimization?

Steepest descent!

For convex f , the directional derivative of f at x along s is

f ′(x; s) = maxg∈∂f(x)

gT s

Along which direction is f decreasing at the fastest rate?

The solution of an optimization problem!

min‖s‖2≤1

f ′(x; s) = min‖s‖2≤1

maxg∈∂f(x)

gT s

= maxg∈∂f(x)

min‖s‖2≤1

gT s (von Neumann minimax theorem)

= maxg∈∂f(x)

(−‖g‖2)

= − ming∈∂f(x)

‖g‖2 =⇒ (need minimum norm subgradient)

Algorithms for Nonsmooth Optimization 31 of 55

Page 54: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Main challenge

But, typically, we can only access g ∈ ∂f(x), not all of ∂f(x)

I would argue:

no practical fundamental iteration for general nonsmooth optimization

(no computable descent direction that vanishes near a minimizer)

What are our options?

There are a few ways to design a convergent algorithm:

I algorithmically (e.g., subgradient method)

I iteratively (e.g., cutting plane / bundle methods)

I randomly (e.g., gradient sampling)

Algorithms for Nonsmooth Optimization 32 of 55

Page 55: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Main challenge

But, typically, we can only access g ∈ ∂f(x), not all of ∂f(x)

I would argue:

no practical fundamental iteration for general nonsmooth optimization

(no computable descent direction that vanishes near a minimizer)

What are our options?

There are a few ways to design a convergent algorithm:

I algorithmically (e.g., subgradient method)

I iteratively (e.g., cutting plane / bundle methods)

I randomly (e.g., gradient sampling)

Algorithms for Nonsmooth Optimization 32 of 55

Page 56: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Subgradient method

Algorithm : Subgradient method (not descent)

1: Choose an initial point x0 ∈ Rn.2: for k = 0, 1, 2, . . . do3: if a termination condition is satisfied, then return xk4: else compute gk ∈ ∂f(xk), choose αk ∈ R>0, and set

xk+1 ← xk − αkgk

Algorithms for Nonsmooth Optimization 33 of 55

Page 57: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Why not “subgradient descent”?

Consider

minx∈R2

f(x), where f(x1, x2) := x1 + x2 + max{0, x21 + x22 − 4}.

At x = (0,−2), we have

∂f(x) = conv

{[11

],

[1−3

]}, but −

[11

]and −

[1−3

]are both directions of ascent for f from x!

Algorithms for Nonsmooth Optimization 34 of 55

Page 58: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Decreasing the distance to a solution

The objective f is not the only measure of progress.

I Given an arbitrary subgradient gk for f at xk, we have

f(x) ≥ f(xk) + gTk (x− xk) for all x ∈ Rn, (1)

which means that all points with an objective value lower than f(xk) lie in

Hk := {x ∈ Rn : gTk (x− xk) ≤ 0}

I Thus, a small step along −gk should decrease the distance to a solution

I (Convexity is crucial for this idea)

Algorithms for Nonsmooth Optimization 35 of 55

Page 59: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

“Algorithmic convergence”

Theorem

If f has a minimizer, ‖gk‖2 ≤ G ∈ R>0 for all k ∈ N, and the stepsizes satisfy

∞∑k=1

αk =∞ and∞∑k=1

α2k <∞, (2)

then

limk→∞

{min

j∈{0,...,k}fj

}= f∗.

I An example sequence satisfying (2) is αk = α/k for k = 1, 2, . . .

Algorithms for Nonsmooth Optimization 36 of 55

Page 60: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Proof, limk→∞{

minj∈{0,...,k} fj}

= f∗, part 1.

Let k ∈ N. By (1), the iterates satisfy

‖xk+1 − x∗‖22 = ‖xk − αkgk − x∗‖22= ‖xk − x∗‖22 − 2αkg

Tk (xk − x∗) + α2

k‖gk‖22

≤ ‖xk − x∗‖22 − 2αk(fk − f∗) + α2k‖gk‖

22.

Applying this inequality recursively, we have

0 ≤ ‖xk+1 − x∗‖22 ≤ ‖x0 − x∗‖22 − 2k∑j=0

αj(fj − f∗) +k∑j=0

α2j‖gj‖22,

which implies that

2k∑j=0

αj(fj − f∗) ≤ ‖x0 − x∗‖22 +k∑j=1

α2j‖gj‖22

⇒ minj∈{0,...,k}

fj − f∗ ≤‖x0 − x∗‖22 +G2

∑kj=1 α

2j

2∑kj=0 αj

. (3)

Algorithms for Nonsmooth Optimization 37 of 55

Page 61: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Proof, limk→∞{

minj∈{0,...,k} fj}

= f∗, part 2.

Now consider an arbitrary scalar ε > 0. By (2), there exists a nonnegative integerK such that, for all k > K,

αk ≤ε

G2and

k∑j=0

αj ≥1

ε

‖x0 − x∗‖22 +G2K∑j=0

α2j

.

Then, by (3), it follows that for all k > K we have

minj∈{0,...,k}

fj − f∗ ≤‖x0 − x∗‖22 +G2

∑Kj=0 α

2j

2∑kj=0 αj

+G2∑kj=K+1 α

2j

2∑Kj=0 αj + 2

∑kj=K+1 αj

≤‖x0 − x∗‖22 +G2

∑Kj=0 α

2j

(‖x0 − x∗‖22 +G2

∑Kj=0 α

2j

) +G2∑kj=K+1

εG2 αj

2∑kj=K+1 αj

2+ε

2= ε.

The result follows since ε > 0 was chosen arbitrarily.

Algorithms for Nonsmooth Optimization 38 of 55

Page 62: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Cutting plane method

Subgradient methods lose previously computed information in every iteration.

I Suppose, after a sequence of iterates, we have the affine underestimators

fi(x) = f(xi) + gTi (x− xi) for all i ∈ {0, . . . , k}.

x

f(x)

x0 x1x2

f(x1) + gT1 (x− x1)

f(x0) + gT0 (x− x0)

I At iteration k, we can compute the next iterate by solving the master problem

xk+1 ← arg minx∈X

fk(x), where fk(x) := maxi∈{1,...,k}

(f(xi) + gTi (x− xi)).

Algorithms for Nonsmooth Optimization 39 of 55

Page 63: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Cutting plane method

Subgradient methods lose previously computed information in every iteration.

I Suppose, after a sequence of iterates, we have the affine underestimators

fi(x) = f(xi) + gTi (x− xi) for all i ∈ {0, . . . , k}.

x

f(x)

x0 x1x2

f(x1) + gT1 (x− x1)

f(x0) + gT0 (x− x0)

I At iteration k, we can compute the next iterate by solving the master problem

xk+1 ← arg minx∈X

fk(x), where fk(x) := maxi∈{1,...,k}

(f(xi) + gTi (x− xi)).

Algorithms for Nonsmooth Optimization 39 of 55

Page 64: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Cutting plane method convergence

The iterates of the cutting plane method yield lower bounds of the optimal value:

vk+1 := minx∈X

fk(x) ≤ minx∈X

f(x) =: f∗.

Therefore, if vk+1 = f(xk+1), then we terminate since f(xk+1) = f∗.

I If f is piecewise linear, then convergence occurs in finitely many iterations!

x

f(x)

x0 x1x2

f(x1) + gT1 (x− x1)

f(x0) + gT0 (x− x0)

However, in general, we have the following theorem.

Theorem

The cutting plane method yields {xk} satisfying {f(xk)} → f∗.

Algorithms for Nonsmooth Optimization 40 of 55

Page 65: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Cutting plane method convergence

The iterates of the cutting plane method yield lower bounds of the optimal value:

vk+1 := minx∈X

fk(x) ≤ minx∈X

f(x) =: f∗.

Therefore, if vk+1 = f(xk+1), then we terminate since f(xk+1) = f∗.

I If f is piecewise linear, then convergence occurs in finitely many iterations!

x

f(x)

x0 x1x2

f(x1) + gT1 (x− x1)

f(x0) + gT0 (x− x0)

However, in general, we have the following theorem.

Theorem

The cutting plane method yields {xk} satisfying {f(xk)} → f∗.

Algorithms for Nonsmooth Optimization 40 of 55

Page 66: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Bundle method

A bundle method attempts to combine the practical advantages of a cutting planemethod with the theoretical strengths of a proximal point method.

I Given xk, consider the regularized master problem

minx∈Rn

(fk(x) +

γ

2‖x− xk‖22

), where fk(x) := max

i∈Ik(f(xi) + gTi (x− xi)).

Here, Ik ⊆ {1, . . . , k − 1} indicates a subset of previous iterations.

I This problem is equivalent to the quadratic optimization problem

min(x,v)∈Rn×R

v +γ

2‖x− xk‖22

s.t. f(xi) + gTi (x− xi) ≤ v for all i ∈ Ik.

I Only move to a “new” point when a sufficient decrease is obtained.

Convergence rate analyses are limited; O( 1ε

log( 1ε)) for strongly convex f

Algorithms for Nonsmooth Optimization 41 of 55

Page 67: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Bundle method convergence

Analysis makes use of the Moreau-Yosida regularization function

fγ(x) = minx∈Rn

(f(x) + 1

2γ‖x− x‖22

).

Theorem

If xk is not a minimizer, then fγ(xk) < f(xk).

Algorithms for Nonsmooth Optimization 42 of 55

Page 68: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Bundle method convergence

Theorem

For all (k, j) ∈ N × N in a bundle method,

vk,j + 12γ‖xk,j − xk‖22 ≤ fγ(xk) < f(xk).

Algorithms for Nonsmooth Optimization 43 of 55

Page 69: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Outline

Motivating Examples

Subdifferential Theory

Fundamental Algorithms

Nonconvex Nonsmooth Functions

General Framework

Algorithms for Nonsmooth Optimization 44 of 55

Page 70: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Clarke subdifferential

What if f is nonconvex and nonsmooth? What are subgradients?

We still need some structure; we assume

I f is locally Lipschitz and

I f is differentiable on a full measure set D

The Clarke subdifferential of f at x is

∂f(x) = conv

{limj→∞

∇f(xj) : xj → x and xj ∈ D},

i.e., convex hull of limits of gradients of f at points in D converging to x

Theorem

If f is continuously differentiable at x, then ∂f(x) = {∇f(x)}

Algorithms for Nonsmooth Optimization 45 of 55

Page 71: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Clarke subdifferential

What if f is nonconvex and nonsmooth? What are subgradients?

We still need some structure; we assume

I f is locally Lipschitz and

I f is differentiable on a full measure set DThe Clarke subdifferential of f at x is

∂f(x) = conv

{limj→∞

∇f(xj) : xj → x and xj ∈ D},

i.e., convex hull of limits of gradients of f at points in D converging to x

Theorem

If f is continuously differentiable at x, then ∂f(x) = {∇f(x)}

Algorithms for Nonsmooth Optimization 45 of 55

Page 72: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Clarke subdifferential

What if f is nonconvex and nonsmooth? What are subgradients?

We still need some structure; we assume

I f is locally Lipschitz and

I f is differentiable on a full measure set DThe Clarke subdifferential of f at x is

∂f(x) = conv

{limj→∞

∇f(xj) : xj → x and xj ∈ D},

i.e., convex hull of limits of gradients of f at points in D converging to x

Theorem

If f is continuously differentiable at x, then ∂f(x) = {∇f(x)}

Algorithms for Nonsmooth Optimization 45 of 55

Page 73: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Differentiable, but nonsmooth

Theorem

If f is differentiable at x, then {∇f(x)} ⊆ ∂f(x) (not necessarily equal)

Considering

f(x) =

{x2 cos( 1

x) if x 6= 0

0 if x = 0

one finds that

f ′(0) = 0

yet [−1, 1] ⊆ ∂f(0)

Algorithms for Nonsmooth Optimization 46 of 55

Page 74: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Clarke ε-subdifferential

and gradient sampling

As before, we typically cannot compute ∂f(x).

It is approximated by the Clarke ε-subdifferential, namely,

∂εf(x) = conv{∂f(B(x, ε))},

which in turn can be approximated as in

∂εf(x) ≈ conv{∇f(xk),∇f(xk,1), . . . ,∇f(xk,m)},where {xk,1, . . . , xk,m} ⊂ B(xk, ε).

In gradient sampling, we compute the minimum norm element in

conv{∇f(xk),∇f(xk,1), . . . ,∇f(xk,m)},

which is equivalent to solving

min(x,v)∈Rn×R

v + ‖x− xk‖22

s.t. f(xk) +∇f(xk,i)T (x− xk) ≤ v for all i ∈ {1, . . . ,m}

Algorithms for Nonsmooth Optimization 47 of 55

Page 75: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Clarke ε-subdifferential and gradient sampling

As before, we typically cannot compute ∂f(x).

It is approximated by the Clarke ε-subdifferential, namely,

∂εf(x) = conv{∂f(B(x, ε))},

which in turn can be approximated as in

∂εf(x) ≈ conv{∇f(xk),∇f(xk,1), . . . ,∇f(xk,m)},where {xk,1, . . . , xk,m} ⊂ B(xk, ε).

In gradient sampling, we compute the minimum norm element in

conv{∇f(xk),∇f(xk,1), . . . ,∇f(xk,m)},

which is equivalent to solving

min(x,v)∈Rn×R

v + ‖x− xk‖22

s.t. f(xk) +∇f(xk,i)T (x− xk) ≤ v for all i ∈ {1, . . . ,m}

Algorithms for Nonsmooth Optimization 47 of 55

Page 76: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Outline

Motivating Examples

Subdifferential Theory

Fundamental Algorithms

Nonconvex Nonsmooth Functions

General Framework

Algorithms for Nonsmooth Optimization 48 of 55

Page 77: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Popular and effective method

Despite all I’ve talked about, a very effective method: BFGS

Approximate second-order information with gradient displacements:

x

xkxk+1

Secant equation Hkyk = sk to match gradient of f at xk, where

sk := xk+1 − xk and yk := ∇f(xk+1)−∇f(xk)

Algorithms for Nonsmooth Optimization 49 of 55

Page 78: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Popular and effective method

Despite all I’ve talked about, a very effective method: BFGS

Approximate second-order information with gradient displacements:

x

xkxk+1

Secant equation Hkyk = sk to match gradient of f at xk, where

sk := xk+1 − xk and yk := ∇f(xk+1)−∇f(xk)

Algorithms for Nonsmooth Optimization 49 of 55

Page 79: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

BFGS-type updates

Inverse Hessian and Hessian approximation updating formulas (sTk vk > 0):

Wk+1 ←(I −

vksTk

sTk vk

)TWk

(I −

vksTk

sTk vk

)+sks

Tk

sTk vk

Hk+1 ←(I −

sksTkHk

sTkHksk

)THk

(I −

sksTkHk

sTkHksk

)+vkv

Tk

sTk vk

With an appropriate technique for choosing vk, we attain

I self-correcting properties for {Hk} and {Wk}I (inverse) Hessian approximations that can be used in other algorithms

Algorithms for Nonsmooth Optimization 50 of 55

Page 80: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Subproblems in nonsmooth optimization algorithms

With sets of points, scalars, and (sub)gradients

{xk,j}mj=1, {fk,j}mj=1, {gk,j}mj=1,

nonsmooth optimization methods involve the primal subproblem

minx∈Rn

(max

j∈{1,...,m}{fk,j + gTk,j(x− xk,j)}+ 1

2(x− xk)THk(x− xk)

)s.t. ‖x− xk‖ ≤ δk,

(P)

but, with Gk ← [gk,1 · · · gk,m], it is typically more efficient to solve the dual

sup(ω,γ)∈Rm

+×Rn− 1

2(Gkω + γ)TWk(Gkω + γ) + bTk ω − δk‖γ‖∗

s.t. 1Tmω = 1.

(D)

The primal solution can then be recovered by

x∗k ← xk −Wk (Gkωk + γk)︸ ︷︷ ︸gk

.

Algorithms for Nonsmooth Optimization 51 of 55

Page 81: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Algorithm Self-Correcting Variable-Metric Alg. for Nonsmooth Opt.

1: Choose x1 ∈ Rn.2: Choose a symmetric positive definite W1 ∈ Rn×n.3: Choose α ∈ (0, 1)4: for k = 1, 2, . . . do5: Solve (P)–(D) such that setting

Gk ←[gk,1 · · · gk,m

],

sk ← −Wk(Gkωk + γk),

and xk+1 ← xk + sk

6: yields

f(xk+1) ≤ f(xk)− 12α(Gkωk + γk)TWk(Gkωk + γk).

7: Choose vk (details omitted, but very simple)8: Set

Wk+1 ←(I −

vksTk

sTk vk

)TWk

(I −

vksTk

sTk vk

)+sks

Tk

sTk vk.

Algorithms for Nonsmooth Optimization 52 of 55

Page 82: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Instances of the framework

Cutting plane / bundle methods

I Points added incrementally until sufficient decrease obtained

I Finite number of additions until accepted step

Gradient sampling methods

I Points added randomly / incrementally until sufficient decrease obtained

I Sufficient number of iterations with “good” steps

In any case: convergence guarantees require {Wk} to be uniformly positivedefinite and bounded on a sufficient number of accepted steps

Algorithms for Nonsmooth Optimization 53 of 55

Page 83: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

C++ implementation: NonOpt

BFGS w/ weak Wolfe line search

Name Exit εend f(xend) #iter #func #grad #subs

maxq Stationary +9.77e-05 +2.26e-07 450 1017 452 451

mxhilb Stepsize +3.13e-03 +9.26e-02 101 1886 113 102

chained lq Stepsize +5.00e-02 -6.93e+01 205 4754 207 206

chained cb3 1 Stepsize +1.00e-01 +9.80e+01 347 7469 348 348

chained cb3 2 Stepsize +1.00e-01 +9.80e+01 64 1496 69 65

active faces Stepsize +2.50e-02 +2.22e-16 24 672 27 25

brown function 2 Stepsize +1.00e-01 +2.04e-05 395 17259 396 396

chained mifflin 2 Stepsize +5.00e-02 -3.47e+01 476 10808 508 477

chained crescent 1 Stepsize +1.00e-01 +2.18e-01 74 2278 91 75

chained crescent 2 Stepsize +1.00e-01 +5.86e-02 313 7585 334 314

Bundle method with self-correcting properties

Name Exit εend f(xend) #iter #func #grad #subs

maxq Stationary +9.77e-05 +1.04e-06 193 441 635 440

mxhilb Stationary +9.77e-05 +2.25e-05 39 338 351 137

chained lq Stationary +9.77e-05 -6.93e+01 29 374 398 366

chained cb3 1 Stationary +9.77e-05 +9.80e+01 50 1038 1069 1017

chained cb3 2 Stationary +9.77e-05 +9.80e+01 29 174 204 173

active faces Stationary +9.77e-05 +2.09e-02 17 387 165 32

brown function 2 Stationary +9.77e-05 +2.49e-03 232 10094 9674 9438

chained mifflin 2 Stationary +9.77e-05 -3.48e+01 393 24410 19493 18924

chained crescent 1 Stationary +9.77e-05 +2.73e-04 30 66 92 59

chained crescent 2 Stationary +9.77e-05 +4.36e-05 137 6679 6140 5997

Algorithms for Nonsmooth Optimization 54 of 55

Page 84: Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Thanks!

NonOpt coming soon. . .

I Andreas could finish in a day. . .

I . . . what has taken me 6 months on sabbatical, so

I it’ll be done when he has a free day ;-)

Thanks for listening!

Algorithms for Nonsmooth Optimization 55 of 55