Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization 1 of 55
84
Embed
Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Algorithms for Nonsmooth Optimization
Frank E. Curtis, Lehigh University
presented at
Center for Optimization and Statistical Learning,
Northwestern University
2 March 2018
Algorithms for Nonsmooth Optimization 1 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Outline
Motivating Examples
Subdifferential Theory
Fundamental Algorithms
Nonconvex Nonsmooth Functions
General Framework
Algorithms for Nonsmooth Optimization 2 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Outline
Motivating Examples
Subdifferential Theory
Fundamental Algorithms
Nonconvex Nonsmooth Functions
General Framework
Algorithms for Nonsmooth Optimization 3 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Nonsmooth optimization
In mathematical optimization, one wants to
I minimize an objective
I subject to constraints
i.e.,minx∈X
f(x)
Why nonsmooth optimization?
Nonsmoothness can arise for different reasons:I physical
(phenomena can be nonsmooth)I phase changes in materials
I technological
(constraints impose nonsmoothness)I obstacles in shape design
I methodological
(nonsmoothness introduced by solution method)I decompositions; penalty formulations
I numerical
(analytically smooth, but practically nonsmooth)I “stiff” problems
(Bagirov, Karmitsa, Makela (2014))
Algorithms for Nonsmooth Optimization 4 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Nonsmooth optimization
In mathematical optimization, one wants to
I minimize an objective
I subject to constraints
i.e.,minx∈X
f(x)
Why nonsmooth optimization? Nonsmoothness can arise for different reasons:I physical
(phenomena can be nonsmooth)I phase changes in materials
I technological
(constraints impose nonsmoothness)I obstacles in shape design
I methodological
(nonsmoothness introduced by solution method)I decompositions; penalty formulations
I numerical
(analytically smooth, but practically nonsmooth)I “stiff” problems
(Bagirov, Karmitsa, Makela (2014))
Algorithms for Nonsmooth Optimization 4 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Nonsmooth optimization
In mathematical optimization, one wants to
I minimize an objective
I subject to constraints
i.e.,minx∈X
f(x)
Why nonsmooth optimization? Nonsmoothness can arise for different reasons:I physical (phenomena can be nonsmooth)
I phase changes in materials
I technological
(constraints impose nonsmoothness)I obstacles in shape design
I methodological
(nonsmoothness introduced by solution method)I decompositions; penalty formulations
I numerical
(analytically smooth, but practically nonsmooth)I “stiff” problems
(Bagirov, Karmitsa, Makela (2014))
Algorithms for Nonsmooth Optimization 4 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Nonsmooth optimization
In mathematical optimization, one wants to
I minimize an objective
I subject to constraints
i.e.,minx∈X
f(x)
Why nonsmooth optimization? Nonsmoothness can arise for different reasons:I physical (phenomena can be nonsmooth)
I phase changes in materials
I technological (constraints impose nonsmoothness)I obstacles in shape design
I methodological
(nonsmoothness introduced by solution method)I decompositions; penalty formulations
I numerical
(analytically smooth, but practically nonsmooth)I “stiff” problems
(Bagirov, Karmitsa, Makela (2014))
Algorithms for Nonsmooth Optimization 4 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Nonsmooth optimization
In mathematical optimization, one wants to
I minimize an objective
I subject to constraints
i.e.,minx∈X
f(x)
Why nonsmooth optimization? Nonsmoothness can arise for different reasons:I physical (phenomena can be nonsmooth)
I phase changes in materials
I technological (constraints impose nonsmoothness)I obstacles in shape design
I methodological (nonsmoothness introduced by solution method)I decompositions; penalty formulations
I numerical
(analytically smooth, but practically nonsmooth)I “stiff” problems
(Bagirov, Karmitsa, Makela (2014))
Algorithms for Nonsmooth Optimization 4 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Nonsmooth optimization
In mathematical optimization, one wants to
I minimize an objective
I subject to constraints
i.e.,minx∈X
f(x)
Why nonsmooth optimization? Nonsmoothness can arise for different reasons:I physical (phenomena can be nonsmooth)
I phase changes in materials
I technological (constraints impose nonsmoothness)I obstacles in shape design
I methodological (nonsmoothness introduced by solution method)I decompositions; penalty formulations
I numerical (analytically smooth, but practically nonsmooth)I “stiff” problems
(Bagirov, Karmitsa, Makela (2014))
Algorithms for Nonsmooth Optimization 4 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Data fitting
minx∈Rn
θ(x) + ψ(x) where, e.g., θ(x) = ‖Ax− b‖22
and ψ(x) =n∑i=1
φ(xi) with
φ1(t) =α|t|
1 + α|t|,
φ2(t) = log(α|t|+ 1),
φ3(t) = |t|q , or
φ4(t) = α−(α− t)2+
α
Algorithms for Nonsmooth Optimization 5 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Clusterwise linear regression (CLR)
Given a dataset of pairs A := {(ai, bi)}li=1, the goal of CLR is to simultaneously
I partition the dataset into k disjoint clusters, and
I find regression coefficients {(xj , yj)}kj=1 for each cluster
in order to minimize overall error in the fit; e.g.,
min{(xj ,yj)}
fk({xj , yj}), where fk({xj , yj}) =l∑i=1
minj∈{1,...,k}
|xTj ai − yj − bi|p.
This objective is nonconvex (though it is a difference of convex functions).
Algorithms for Nonsmooth Optimization 6 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Decomposition
Various types of decomposition strategies introduce nonsmoothness.I Primal decomposition can be used for
min(x1,x2,y)
f1(x1, y) + f2(x2, y),
where y is the complicating/linking variable; equivalent to
miny
φ1(y) + φ2(y) where
φ1(y) := min
x1f1(x1, y)
φ2(y) := minx2
f2(x2, y)
This master problem may be nonsmooth in y.
I Dual decomposition can be used for same problem, reformulating as
min(x1,x2,y)
f1(x1, y1) + f2(x2, y2) s.t. y1 = y2
The Lagrangian is separable, meaning the dual function decomposes:
g1(λ) = inf(x1,y1)
(f1(x1, y1) + λT y1)
g2(λ) = inf(x2,y2)
(f2(x2, y2)− λT y2)
Dual problem to maximize g(λ) = g1(λ) + g2(λ) may be nonsmooth in λ.
Algorithms for Nonsmooth Optimization 7 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Decomposition
Various types of decomposition strategies introduce nonsmoothness.I Primal decomposition can be used for
min(x1,x2,y)
f1(x1, y) + f2(x2, y),
where y is the complicating/linking variable; equivalent to
miny
φ1(y) + φ2(y) where
φ1(y) := min
x1f1(x1, y)
φ2(y) := minx2
f2(x2, y)
This master problem may be nonsmooth in y.I Dual decomposition can be used for same problem, reformulating as
min(x1,x2,y)
f1(x1, y1) + f2(x2, y2) s.t. y1 = y2
The Lagrangian is separable, meaning the dual function decomposes:
g1(λ) = inf(x1,y1)
(f1(x1, y1) + λT y1)
g2(λ) = inf(x2,y2)
(f2(x2, y2)− λT y2)
Dual problem to maximize g(λ) = g1(λ) + g2(λ) may be nonsmooth in λ.
Algorithms for Nonsmooth Optimization 7 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Dual decomposition with constraints
Consider the nearly separable problem
min(x1,...,xm)
m∑i=1
fi(xi)
s.t. xi ∈ Xi for all i ∈ {1, . . . ,m}m∑i=1
Aixi ≤ b (e.g., shared resource constraint)
where the last are complicating/linking constraints; “dualizing” leads to
g(λ) := min(x1,...,xm)
m∑i=1
fi(xi) + λT
(m∑i=1
Aixi − b)
s.t. xi ∈ Xi for all i ∈ {1, . . . ,m}.
Given λ ∈ Rm, the value g(λ) comes from solving separable problems; the dual
maxλ≥0
g(λ)
is typically nonsmooth (and people often use poor algorithms!).
Algorithms for Nonsmooth Optimization 8 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Control of dynamical systems
Consider the discrete time linear dynamical system:
yk+1 = Ayk +Buk (state equation)
zk = Cyk (observation equation)
Supposing we want to “design” a control such that
uk = XCyk (where X is our variable)
consider the “closed loop system” given by
yk+1 = Ayk +Buk
= Ayk +BXCyk
= (A+BXC)yk.
Common objectives are to minimize a stability measure
ρ(A+BXC),
which are often functions of the eigenvalues of A+BXC.
Algorithms for Nonsmooth Optimization 9 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Eigenvalue optimization
Plots of ordered eigenvalues as matrix is perturbed along a given direction:
Algorithms for Nonsmooth Optimization 10 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Other sources of nonsmooth optimization problems
I Lagrangian relaxation
I Composite optimization (e.g., penalty methods for “soft constraints”)
I Parametric optimization (e.g., for model predictive control)
I Multilevel optimization
Algorithms for Nonsmooth Optimization 11 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Outline
Motivating Examples
Subdifferential Theory
Fundamental Algorithms
Nonconvex Nonsmooth Functions
General Framework
Algorithms for Nonsmooth Optimization 12 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Derivatives
When I teach an optimization class, I always start with the same question:
What is a derivative? (f : R → R)
Answer I get: “slope of the tangent line”
x
f(x)
slope = f ′(x)
Algorithms for Nonsmooth Optimization 13 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Derivatives
When I teach an optimization class, I always start with the same question:
What is a derivative? (f : R → R)
Answer I get: “slope of the tangent line”
x
f(x)
slope = f ′(x)
Algorithms for Nonsmooth Optimization 13 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Gradients
Then I ask:
What is a gradient? (f : Rn → R)
Answer I get: “direction along which the function increases at the fastest rate”
Algorithms for Nonsmooth Optimization 14 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Gradients
Then I ask:
What is a gradient? (f : Rn → R)
Answer I get: “direction along which the function increases at the fastest rate”
Algorithms for Nonsmooth Optimization 14 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Derivative vs. gradient
So if a derivative is a magnitude (here, a slope), then why does it generalize inmultiple dimensions to something that is a direction?
(n = 1) f ′(x) =df
dx(x) =
∂f
∂x(x)
(n ≥ 1) ∇f(x) =
∂f∂x1
(x)
...∂f∂xn
(x)
What’s important? Magnitude? direction?
Answer: The gradient is a vector in Rn, which
I has magnitude (e.g., its 2-norm)
I can be viewed as a direction
I and gives us a way to compute directional derivatives
Algorithms for Nonsmooth Optimization 15 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Derivative vs. gradient
So if a derivative is a magnitude (here, a slope), then why does it generalize inmultiple dimensions to something that is a direction?
(n = 1) f ′(x) =df
dx(x) =
∂f
∂x(x)
(n ≥ 1) ∇f(x) =
∂f∂x1
(x)
...∂f∂xn
(x)
What’s important? Magnitude? direction?
Answer: The gradient is a vector in Rn, which
I has magnitude (e.g., its 2-norm)
I can be viewed as a direction
I and gives us a way to compute directional derivatives
Algorithms for Nonsmooth Optimization 15 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Differentiable f
How should we think about the gradient?
If f is continuously differentiable (i.e., f ∈ C1),
then ∇f(x) is the unique vector in the linear (Taylor) approximation of f at x.
x x
f(x) +∇f(x)T (x− x)
f(x)
Both are graphs of functions of x!
Algorithms for Nonsmooth Optimization 16 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Differentiable and convex f
If f ∈ C1 is convex, then
f(x) ≥ f(x) +∇f(x)T (x− x) for all (x, x) ∈ Rn × Rn
x x
f(x) +∇f(x)T (x− x)
f(x)
Algorithms for Nonsmooth Optimization 17 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Graphs and epigraphs
There is another interpretation of a gradient that is also useful. First. . .
What is a graph?
A set of points in Rn+1, namely, {(x, z) : f(x) = z}
A related quantity, another set, is the epigraph: {(x, z) : f(x) ≤ z}
Algorithms for Nonsmooth Optimization 18 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Graphs and epigraphs
There is another interpretation of a gradient that is also useful. First. . .
What is a graph?
A set of points in Rn+1, namely, {(x, z) : f(x) = z}
A related quantity, another set, is the epigraph: {(x, z) : f(x) ≤ z}
x
{(x, f(x))}
Algorithms for Nonsmooth Optimization 18 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Graphs and epigraphs
There is another interpretation of a gradient that is also useful. First. . .
What is a graph?
A set of points in Rn+1, namely, {(x, z) : f(x) = z}
A related quantity, another set, is the epigraph: {(x, z) : f(x) ≤ z}
x
{(x, f(x))}
Algorithms for Nonsmooth Optimization 18 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Differentiable and convex f
If f ∈ C1 is convex, then, for all (x, x) ∈ Rn × Rn,
f(x) ≥ f(x) +∇f(x)T (x− x)
⇐⇒ f(x)−∇f(x)T x ≥ f(x)−∇f(x)T x
⇐⇒[−∇f(x)
1
]T [x
f(x)
]≥[−∇f(x)
1
]T [x
f(x)
]
Note: Given x, the vector
[−∇f(x)
1
]is given,
so the inequality above involves a linear function over Rn+1 and says
the value at any point
[x
f(x)
]in the graph is at least the value at
[x
f(x)
]
Algorithms for Nonsmooth Optimization 19 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Differentiable and convex f
If f ∈ C1 is convex, then, for all (x, x) ∈ Rn × Rn,
f(x) ≥ f(x) +∇f(x)T (x− x)
⇐⇒ f(x)−∇f(x)T x ≥ f(x)−∇f(x)T x
⇐⇒[−∇f(x)
1
]T [x
f(x)
]≥[−∇f(x)
1
]T [x
f(x)
]
Note: Given x, the vector
[−∇f(x)
1
]is given,
so the inequality above involves a linear function over Rn+1 and says
the value at any point
[x
f(x)
]in the graph is at least the value at
[x
f(x)
]
Algorithms for Nonsmooth Optimization 19 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Linearization
and supporting hyperplane for epigraph
x x
f(x) +∇f(x)T (x− x)
f(x)
Algorithms for Nonsmooth Optimization 20 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Linearization and supporting hyperplane for epigraph
x x
f(x) +∇f(x)T (x− x)
{(x, f(x))}
[x
f(x)
]+
[−∇f(x)
1
]
Algorithms for Nonsmooth Optimization 20 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Subgradients (convex f)
Why was that useful?
We can generalize this idea when the function is not differentiable somewhere.
[x
f(x)
]
[x
f(x)
]+
[−g1
]
A vector g ∈ Rn is a subgradient of a convex f : Rn → R at x ∈ Rn if
f(x) ≥ f(x) + gT (x− x)
⇐⇒[−g1
]T [x
f(x)
]≥[−g1
]T [x
f(x)
]
Algorithms for Nonsmooth Optimization 21 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Subdifferentials
Theorem
If f is convex and differentiable at x, then ∇f(x) is its unique subgradient at x.
But in general,
the set of all subgradients for a convex f at x is the subdifferential of f at x:
∂f(x) := {g ∈ Rn : g is a subgradient of f at x}.
From the definition, it is easily seen that
x∗ is a minimizer of f if and only if 0 ∈ ∂f(x∗)
Algorithms for Nonsmooth Optimization 22 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
What about nonconvex, nonsmooth?
We need to generalize the idea of a subgradient further.
I Directional derivatives
I Subgradients
I Subdifferentials
Let’s return to this after we discuss some algorithms. . .
Algorithms for Nonsmooth Optimization 23 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Outline
Motivating Examples
Subdifferential Theory
Fundamental Algorithms
Nonconvex Nonsmooth Functions
General Framework
Algorithms for Nonsmooth Optimization 24 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
A fundamental iteration
Thinking of −∇f(xk), we have a vector that
I directs us in a direction of descent, and
I vanishes as we approach a minimizer
Algorithm : Gradient Descent
1: Choose an initial point x0 ∈ Rn and stepsize α ∈ (0, 1/L]2: for k = 0, 1, 2, . . . do3: if ‖∇f(xk)‖ ≈ 0, then return xk4: else set
xk+1 ← xk − α∇f(xk)
I call this a fundamental iteration.
Here, we suppose ∇f is Lipschitz continuous, i.e., there exists L ≥ 0 such that
‖∇f(x)−∇f(x)‖2 ≤ L‖x− x‖2 for all (x, x) ∈ Rn × Rn
=⇒ f(x) ≤ f(x) +∇f(x)T (x− x) + 12L‖x− x‖22 for all (x, x) ∈ Rn × Rn.
Algorithms for Nonsmooth Optimization 25 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
A fundamental iteration
Thinking of −∇f(xk), we have a vector that
I directs us in a direction of descent, and
I vanishes as we approach a minimizer
Algorithm : Gradient Descent
1: Choose an initial point x0 ∈ Rn and stepsize α ∈ (0, 1/L]2: for k = 0, 1, 2, . . . do3: if ‖∇f(xk)‖ ≈ 0, then return xk4: else set
xk+1 ← xk − α∇f(xk)
I call this a fundamental iteration.
Here, we suppose ∇f is Lipschitz continuous, i.e., there exists L ≥ 0 such that
‖∇f(x)−∇f(x)‖2 ≤ L‖x− x‖2 for all (x, x) ∈ Rn × Rn
=⇒ f(x) ≤ f(x) +∇f(x)T (x− x) + 12L‖x− x‖22 for all (x, x) ∈ Rn × Rn.
Algorithms for Nonsmooth Optimization 25 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
A fundamental iteration
Thinking of −∇f(xk), we have a vector that
I directs us in a direction of descent, and
I vanishes as we approach a minimizer
Algorithm : Gradient Descent
1: Choose an initial point x0 ∈ Rn and stepsize α ∈ (0, 1/L]2: for k = 0, 1, 2, . . . do3: if ‖∇f(xk)‖ ≈ 0, then return xk4: else set
xk+1 ← xk − α∇f(xk)
I call this a fundamental iteration.
Here, we suppose ∇f is Lipschitz continuous, i.e., there exists L ≥ 0 such that
‖∇f(x)−∇f(x)‖2 ≤ L‖x− x‖2 for all (x, x) ∈ Rn × Rn
=⇒ f(x) ≤ f(x) +∇f(x)T (x− x) + 12L‖x− x‖22 for all (x, x) ∈ Rn × Rn.
Algorithms for Nonsmooth Optimization 25 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Convergence of gradient descent
xk x
f(xk)
f(xk) +∇f(xk)T (x − xk) + 12L‖x − xk‖
22
f(x)? f(x)?
Algorithms for Nonsmooth Optimization 26 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Convergence of gradient descent
xk x
f(xk)
f(xk) +∇f(xk)T (x − xk) + 12L‖x − xk‖
22
f(x)? f(x)?
Algorithms for Nonsmooth Optimization 26 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Convergence of gradient descent
xk x
f(xk)
f(xk) +∇f(xk)T (x − xk) + 12L‖x − xk‖
22
f(x)? f(x)?
Algorithms for Nonsmooth Optimization 26 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Gradient descent for f
Theorem
If ∇f is Lipschitz continuous with constant L > 0 and α ∈ (0, 1/L], then
∞∑j=0
‖∇f(xj)‖22 <∞ which implies {∇f(xj)} → 0.
Proof.
Let k ∈ N and recall that xk+1 − xk = −α∇f(xk). Then, since α ∈ (0, 1/L],
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Strong convexity
Now suppose that f is c-strongly convex, which means that
f(x) ≥ f(x) +∇f(x)T (x− x) + 12c‖x− x‖22 for all (x, x) ∈ Rn × Rn.
Important consequences of this are that
I f has a unique global minimizer, call it x∗ with f∗ := f(x∗), and
I the gradient norm grows with the optimality error in that
2c(f(x)− f∗) ≤ ‖∇f(x)‖22 for all x ∈ Rn.
Algorithms for Nonsmooth Optimization 28 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Strong convexity, lower bound
xk x
f(xk)
f(xk) +∇f(xk)T (x − xk) + 12L‖x − xk‖
22
f(xk) +∇f(xk)T (x − xk) + 12c‖x − xk‖
22
Algorithms for Nonsmooth Optimization 29 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Strong convexity, lower bound
xk x
f(xk)
f(xk) +∇f(xk)T (x − xk) + 12L‖x − xk‖
22
f(xk) +∇f(xk)T (x − xk) + 12c‖x − xk‖
22
Algorithms for Nonsmooth Optimization 29 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Gradient descent for strongly convex f
Theorem
If ∇f is Lipschitz with L > 0, f is c-strongly convex, and α ∈ (0, 1/L], then
f(xj+1)− f∗ ≤ (1− αc)j+1(f(x0)− f∗) for all j ∈ N.
Proof.
Let k ∈ N. Following the previous proof, one finds
f(xk+1) ≤ f(xk)− 12α‖∇f(xk)‖22
≤ f(xk)− αc(f(xk)− f∗).
Subtracting f∗ from both sides, one finds
f(xk+1)− f∗ ≤ (1− αc)(f(xk)− f∗).
Applying the result repeatedly over j ∈ {0, . . . , k} yields the result.
Algorithms for Nonsmooth Optimization 30 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Gradient descent for strongly convex f
Theorem
If ∇f is Lipschitz with L > 0, f is c-strongly convex, and α ∈ (0, 1/L], then
f(xj+1)− f∗ ≤ (1− αc)j+1(f(x0)− f∗) for all j ∈ N.
Proof.
Let k ∈ N. Following the previous proof, one finds
f(xk+1) ≤ f(xk)− 12α‖∇f(xk)‖22
≤ f(xk)− αc(f(xk)− f∗).
Subtracting f∗ from both sides, one finds
f(xk+1)− f∗ ≤ (1− αc)(f(xk)− f∗).
Applying the result repeatedly over j ∈ {0, . . . , k} yields the result.
Algorithms for Nonsmooth Optimization 30 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
A fundamental iteration when f is nonsmooth?
What is a fundamental iteration for nonsmooth optimization?
Steepest descent!
For convex f , the directional derivative of f at x along s is
f ′(x; s) = maxg∈∂f(x)
gT s
Along which direction is f decreasing at the fastest rate?
The solution of an optimization problem!
min‖s‖2≤1
f ′(x; s) = min‖s‖2≤1
maxg∈∂f(x)
gT s
= maxg∈∂f(x)
min‖s‖2≤1
gT s (von Neumann minimax theorem)
= maxg∈∂f(x)
(−‖g‖2)
= − ming∈∂f(x)
‖g‖2 =⇒ (need minimum norm subgradient)
Algorithms for Nonsmooth Optimization 31 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
A fundamental iteration when f is nonsmooth?
What is a fundamental iteration for nonsmooth optimization?
Steepest descent!
For convex f , the directional derivative of f at x along s is
f ′(x; s) = maxg∈∂f(x)
gT s
Along which direction is f decreasing at the fastest rate?
The solution of an optimization problem!
min‖s‖2≤1
f ′(x; s) = min‖s‖2≤1
maxg∈∂f(x)
gT s
= maxg∈∂f(x)
min‖s‖2≤1
gT s (von Neumann minimax theorem)
= maxg∈∂f(x)
(−‖g‖2)
= − ming∈∂f(x)
‖g‖2 =⇒ (need minimum norm subgradient)
Algorithms for Nonsmooth Optimization 31 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
A fundamental iteration when f is nonsmooth?
What is a fundamental iteration for nonsmooth optimization?
Steepest descent!
For convex f , the directional derivative of f at x along s is
f ′(x; s) = maxg∈∂f(x)
gT s
Along which direction is f decreasing at the fastest rate?
The solution of an optimization problem!
min‖s‖2≤1
f ′(x; s) = min‖s‖2≤1
maxg∈∂f(x)
gT s
= maxg∈∂f(x)
min‖s‖2≤1
gT s (von Neumann minimax theorem)
= maxg∈∂f(x)
(−‖g‖2)
= − ming∈∂f(x)
‖g‖2 =⇒ (need minimum norm subgradient)
Algorithms for Nonsmooth Optimization 31 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Main challenge
But, typically, we can only access g ∈ ∂f(x), not all of ∂f(x)
I would argue:
no practical fundamental iteration for general nonsmooth optimization
(no computable descent direction that vanishes near a minimizer)
What are our options?
There are a few ways to design a convergent algorithm:
I algorithmically (e.g., subgradient method)
I iteratively (e.g., cutting plane / bundle methods)
I randomly (e.g., gradient sampling)
Algorithms for Nonsmooth Optimization 32 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Main challenge
But, typically, we can only access g ∈ ∂f(x), not all of ∂f(x)
I would argue:
no practical fundamental iteration for general nonsmooth optimization
(no computable descent direction that vanishes near a minimizer)
What are our options?
There are a few ways to design a convergent algorithm:
I algorithmically (e.g., subgradient method)
I iteratively (e.g., cutting plane / bundle methods)
I randomly (e.g., gradient sampling)
Algorithms for Nonsmooth Optimization 32 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework
Subgradient method
Algorithm : Subgradient method (not descent)
1: Choose an initial point x0 ∈ Rn.2: for k = 0, 1, 2, . . . do3: if a termination condition is satisfied, then return xk4: else compute gk ∈ ∂f(xk), choose αk ∈ R>0, and set
xk+1 ← xk − αkgk
Algorithms for Nonsmooth Optimization 33 of 55
Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework