Page 1
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Modern Convex Optimization Methods forLarge-Scale Empirical Risk Minimization
(Part I: Primal Methods)International Conference on Machine Learning
Peter Richtarik and Mark Schmidt
July 2015
Page 2
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Context: Big Data and Big Models
We are collecting data at unprecedented rates.
Seen across many fields of science and engineering.Not gigabytes, but terabytes or petabytes (and beyond).
Machine learning can use big data to fit richer models:
Bioinformatics.Computer vision.Speech recognition.Product recommendation.Machine translation.
Page 3
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Context: Big Data and Big Models
We are collecting data at unprecedented rates.
Seen across many fields of science and engineering.Not gigabytes, but terabytes or petabytes (and beyond).
Machine learning can use big data to fit richer models:
Bioinformatics.Computer vision.Speech recognition.Product recommendation.Machine translation.
Page 4
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Context: Big Data and Big Models
We are collecting data at unprecedented rates.
Seen across many fields of science and engineering.Not gigabytes, but terabytes or petabytes (and beyond).
Machine learning can use big data to fit richer models:
Bioinformatics.Computer vision.Speech recognition.Product recommendation.Machine translation.
Page 5
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Common Framework: Empirical Risk Minimization
The most common framework is empirical risk minimization:
minx∈RP
1
N
N∑
i=1
L(x , ai , bi ) + λr(x)
data fitting term + regularizer
We have n observations ai (and possibly labels bi ).We want to find optimal parameters x∗.
Examples range from squared error with 2-norm regularization,
minx∈RP
1
N
N∑
i=1
1
2(aTi x − bi )
2 +λ
2‖x‖2,
but also conditional random fields and deep neural networks.Main practical challenges:
Designing/learning good features ai .Efficiently solving the problem when N or P are very large.
Page 6
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Common Framework: Empirical Risk Minimization
The most common framework is empirical risk minimization:
minx∈RP
1
N
N∑
i=1
L(x , ai , bi ) + λr(x)
data fitting term + regularizer
We have n observations ai (and possibly labels bi ).We want to find optimal parameters x∗.
Examples range from squared error with 2-norm regularization,
minx∈RP
1
N
N∑
i=1
1
2(aTi x − bi )
2 +λ
2‖x‖2,
but also conditional random fields and deep neural networks.Main practical challenges:
Designing/learning good features ai .Efficiently solving the problem when N or P are very large.
Page 7
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Common Framework: Empirical Risk Minimization
The most common framework is empirical risk minimization:
minx∈RP
1
N
N∑
i=1
L(x , ai , bi ) + λr(x)
data fitting term + regularizer
We have n observations ai (and possibly labels bi ).We want to find optimal parameters x∗.
Examples range from squared error with 2-norm regularization,
minx∈RP
1
N
N∑
i=1
1
2(aTi x − bi )
2 +λ
2‖x‖2,
but also conditional random fields and deep neural networks.
Main practical challenges:Designing/learning good features ai .Efficiently solving the problem when N or P are very large.
Page 8
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Common Framework: Empirical Risk Minimization
The most common framework is empirical risk minimization:
minx∈RP
1
N
N∑
i=1
L(x , ai , bi ) + λr(x)
data fitting term + regularizer
We have n observations ai (and possibly labels bi ).We want to find optimal parameters x∗.
Examples range from squared error with 2-norm regularization,
minx∈RP
1
N
N∑
i=1
1
2(aTi x − bi )
2 +λ
2‖x‖2,
but also conditional random fields and deep neural networks.Main practical challenges:
Designing/learning good features ai .Efficiently solving the problem when N or P are very large.
Page 9
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Motivation: Why Learn about Convex Optimization?
Why learn about large-scale optimization?
Optimization is at the core of many ML algorithms.Can’t solve huge problems with traditional techniques.
Why in particular learn about convex optimization?
Among only efficiently-solvable continuous problems.You can do a lot with convex models.
(least squares, lasso, generlized linear models, SVMs, CRFs, etc.)
Empirically effective non-convex methods are often basedmethods with good properties for convex objectives.
(functions are locally convex around minimizers)
Tools from convex analysis are being extended to non-convex.
Page 10
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Motivation: Why Learn about Convex Optimization?
Why learn about large-scale optimization?
Optimization is at the core of many ML algorithms.Can’t solve huge problems with traditional techniques.
Why in particular learn about convex optimization?
Among only efficiently-solvable continuous problems.You can do a lot with convex models.
(least squares, lasso, generlized linear models, SVMs, CRFs, etc.)
Empirically effective non-convex methods are often basedmethods with good properties for convex objectives.
(functions are locally convex around minimizers)
Tools from convex analysis are being extended to non-convex.
Page 11
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Motivation: Why Learn about Convex Optimization?
Why learn about large-scale optimization?
Optimization is at the core of many ML algorithms.Can’t solve huge problems with traditional techniques.
Why in particular learn about convex optimization?
Among only efficiently-solvable continuous problems.
You can do a lot with convex models.(least squares, lasso, generlized linear models, SVMs, CRFs, etc.)
Empirically effective non-convex methods are often basedmethods with good properties for convex objectives.
(functions are locally convex around minimizers)
Tools from convex analysis are being extended to non-convex.
Page 12
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Motivation: Why Learn about Convex Optimization?
Why learn about large-scale optimization?
Optimization is at the core of many ML algorithms.Can’t solve huge problems with traditional techniques.
Why in particular learn about convex optimization?
Among only efficiently-solvable continuous problems.You can do a lot with convex models.
(least squares, lasso, generlized linear models, SVMs, CRFs, etc.)
Empirically effective non-convex methods are often basedmethods with good properties for convex objectives.
(functions are locally convex around minimizers)
Tools from convex analysis are being extended to non-convex.
Page 13
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Motivation: Why Learn about Convex Optimization?
Why learn about large-scale optimization?
Optimization is at the core of many ML algorithms.Can’t solve huge problems with traditional techniques.
Why in particular learn about convex optimization?
Among only efficiently-solvable continuous problems.You can do a lot with convex models.
(least squares, lasso, generlized linear models, SVMs, CRFs, etc.)
Empirically effective non-convex methods are often basedmethods with good properties for convex objectives.
(functions are locally convex around minimizers)
Tools from convex analysis are being extended to non-convex.
Page 14
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
How hard is real-valued optimization?How long to find an ε-optimal minimizer of a real-valued function?
minx∈Rn
f (x).
General function: impossible!
We need to make some assumptions about the function:
Assume f is Lipschitz-continuous: (can not change too quickly)
|f (x)− f (y)| ≤ L‖x − y‖.
Page 15
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
How hard is real-valued optimization?How long to find an ε-optimal minimizer of a real-valued function?
minx∈Rn
f (x).
General function: impossible!
We need to make some assumptions about the function:
Assume f is Lipschitz-continuous: (can not change too quickly)
|f (x)− f (y)| ≤ L‖x − y‖.
Page 16
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
How hard is real-valued optimization?How long to find an ε-optimal minimizer of a real-valued function?
minx∈Rn
f (x).
General function: impossible!
We need to make some assumptions about the function:
Assume f is Lipschitz-continuous: (can not change too quickly)
|f (x)− f (y)| ≤ L‖x − y‖.
Page 17
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
How hard is real-valued optimization?
How long to find an ε-optimal minimizer of a real-valued function?
minx∈Rn
f (x).
General function: impossible!
We need to make some assumptions about the function:
Assume f is Lipschitz-continuous: (can not change too quickly)
|f (x)− f (y)| ≤ L‖x − y‖.
Page 18
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
How hard is real-valued optimization?
How long to find an ε-optimal minimizer of a real-valued function?
minx∈Rn
f (x).
General function: impossible!
We need to make some assumptions about the function:
Assume f is Lipschitz-continuous: (can not change too quickly)
|f (x)− f (y)| ≤ L‖x − y‖.
Page 19
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
How hard is real-valued optimization?
How long to find an ε-optimal minimizer of a real-valued function?
minx∈Rn
f (x).
General function: impossible!
We need to make some assumptions about the function:
Assume f is Lipschitz-continuous: (can not change too quickly)
|f (x)− f (y)| ≤ L‖x − y‖.
Page 20
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
How hard is real-valued optimization?
How long to find an ε-optimal minimizer of a real-valued function?
minx∈Rn
f (x).
General function: impossible!
We need to make some assumptions about the function:
Assume f is Lipschitz-continuous: (can not change too quickly)
|f (x)− f (y)| ≤ L‖x − y‖.
Page 21
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
How hard is real-valued optimization?
How long to find an ε-optimal minimizer of a real-valued function?
minx∈Rn
f (x).
General function: impossible!
We need to make some assumptions about the function:
Assume f is Lipschitz-continuous: (can not change too quickly)
|f (x)− f (y)| ≤ L‖x − y‖.
After t iterations, the error of any algorithm is Ω(1/t1/n).(and grid-search is nearly optimal)
Optimization is hard, but assumptions make a big difference.(we went from impossible to very slow)
Page 22
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
How hard is real-valued optimization?
How long to find an ε-optimal minimizer of a real-valued function?
minx∈Rn
f (x).
General function: impossible!
We need to make some assumptions about the function:
Assume f is Lipschitz-continuous: (can not change too quickly)
|f (x)− f (y)| ≤ L‖x − y‖.
After t iterations, the error of any algorithm is Ω(1/t1/n).(and grid-search is nearly optimal)
Optimization is hard, but assumptions make a big difference.(we went from impossible to very slow)
Page 23
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Convex Functions: Three Characterizations
A function f is convex if for all x and y we have
f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].
Function is below linear interpolation between x and y .
Implies that all local minima are global minima.
Page 24
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Convex Functions: Three Characterizations
A function f is convex if for all x and y we have
f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].
Function is below linear interpolation between x and y .
Implies that all local minima are global minima.
Page 25
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Convex Functions: Three Characterizations
A function f is convex if for all x and y we have
f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].
Function is below linear interpolation between x and y .
Implies that all local minima are global minima.
f(x)
f(y)
Page 26
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Convex Functions: Three Characterizations
A function f is convex if for all x and y we have
f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].
Function is below linear interpolation between x and y .
Implies that all local minima are global minima.
f(x)
f(y)
Page 27
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Convex Functions: Three Characterizations
A function f is convex if for all x and y we have
f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].
Function is below linear interpolation between x and y .
Implies that all local minima are global minima.
f(x)
f(y)
0.5f(x) + 0.5f(y)
Page 28
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Convex Functions: Three Characterizations
A function f is convex if for all x and y we have
f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].
Function is below linear interpolation between x and y .
Implies that all local minima are global minima.
f(x)
f(y)
0.5f(x) + 0.5f(y)
f(0.5x + 0.5y)
Page 29
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Convex Functions: Three Characterizations
A function f is convex if for all x and y we have
f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].
Function is below linear interpolation between x and y .
Implies that all local minima are global minima.
Not convex
Page 30
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Convex Functions: Three Characterizations
A function f is convex if for all x and y we have
f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].
Function is below linear interpolation between x and y .
Implies that all local minima are global minima.
f(x)f(y)
Not convexNon-global
local minima
Page 31
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Convex Functions: Three Characterizations
A function f is convex if for all x and y we have
f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].
A differentiable function f is convex if for all x and y we have
f (y) ≥ f (x) +∇f (x)T (y − x),
The function is globally above the tangent at x .
If ∇f (y) = 0, implies y is a a global minimizer.
Page 32
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Convex Functions: Three Characterizations
A function f is convex if for all x and y we have
f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].
A differentiable function f is convex if for all x and y we have
f (y) ≥ f (x) +∇f (x)T (y − x),
The function is globally above the tangent at x .
If ∇f (y) = 0, implies y is a a global minimizer.
Page 33
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Convex Functions: Three Characterizations
A function f is convex if for all x and y we have
f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].
A differentiable function f is convex if for all x and y we have
f (y) ≥ f (x) +∇f (x)T (y − x),
The function is globally above the tangent at x .
f(x)
If ∇f (y) = 0, implies y is a a global minimizer.
Page 34
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Convex Functions: Three Characterizations
A function f is convex if for all x and y we have
f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].
A differentiable function f is convex if for all x and y we have
f (y) ≥ f (x) +∇f (x)T (y − x),
The function is globally above the tangent at x .
f(x)
f(x) + ∇f(x)T(y-x)
If ∇f (y) = 0, implies y is a a global minimizer.
Page 35
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Convex Functions: Three Characterizations
A function f is convex if for all x and y we have
f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].
A differentiable function f is convex if for all x and y we have
f (y) ≥ f (x) +∇f (x)T (y − x),
The function is globally above the tangent at x .
f(x)
f(x) + ∇f(x)T(y-x)
f(y)
If ∇f (y) = 0, implies y is a a global minimizer.
Page 36
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Convex Functions: Three Characterizations
A function f is convex if for all x and y we have
f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].
A differentiable function f is convex if for all x and y we have
f (y) ≥ f (x) +∇f (x)T (y − x),
The function is globally above the tangent at x .
f(x)
f(x) + ∇f(x)T(y-x)
f(y)
If ∇f (y) = 0, implies y is a a global minimizer.
Page 37
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Convex Functions: Three Characterizations
A function f is convex if for all x and y we have
f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].
A differentiable function f is convex if for all x and y we have
f (y) ≥ f (x) +∇f (x)T (y − x),
A twice-differentiable function f is convex if for all x we have
∇2f (x) 0
All eigenvalues of ‘Hessian’ are non-negative.
The function is flat or curved upwards in every direction.
This is usually the easiest way to show a function is convex.
Page 38
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Convex Functions: Three Characterizations
A function f is convex if for all x and y we have
f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].
A differentiable function f is convex if for all x and y we have
f (y) ≥ f (x) +∇f (x)T (y − x),
A twice-differentiable function f is convex if for all x we have
∇2f (x) 0
All eigenvalues of ‘Hessian’ are non-negative.
The function is flat or curved upwards in every direction.
This is usually the easiest way to show a function is convex.
Page 39
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Convex Functions: Three Characterizations
A function f is convex if for all x and y we have
f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].
A differentiable function f is convex if for all x and y we have
f (y) ≥ f (x) +∇f (x)T (y − x),
A twice-differentiable function f is convex if for all x we have
∇2f (x) 0
All eigenvalues of ‘Hessian’ are non-negative.
The function is flat or curved upwards in every direction.
This is usually the easiest way to show a function is convex.
Page 40
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Examples of Convex Functions
Some simple convex functions:
f (x) = c
f (x) = aT x
f (x) = ax2 + b (for a > 0)
f (x) = exp(ax)
f (x) = x log x (for x > 0)
f (x) = ‖x‖2
f (x) = ‖x‖pf (x) = maxixi
Some other notable examples:
f (x , y) = log(ex + ey )
f (X ) = log detX (for X positive-definite).
f (x ,Y ) = xTY−1x (for Y positive-definite)
Page 41
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Examples of Convex Functions
Some simple convex functions:
f (x) = c
f (x) = aT x
f (x) = ax2 + b (for a > 0)
f (x) = exp(ax)
f (x) = x log x (for x > 0)
f (x) = ‖x‖2
f (x) = ‖x‖pf (x) = maxixi
Some other notable examples:
f (x , y) = log(ex + ey )
f (X ) = log detX (for X positive-definite).
f (x ,Y ) = xTY−1x (for Y positive-definite)
Page 42
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Operations that Preserve Convexity
1 Non-negative weighted sum:
f (x) = θ1f1(x) + θ2f2(x).
2 Composition with affine mapping:
g(x) = f (Ax + b).
3 Pointwise maximum:
f (x) = maxifi (x).
Show that least-residual problems are convex for any `p-norm:
f (x) = ||Ax − b||p
We know that ‖ · ‖p is a norm, so it follows from (2).
Page 43
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Operations that Preserve Convexity
1 Non-negative weighted sum:
f (x) = θ1f1(x) + θ2f2(x).
2 Composition with affine mapping:
g(x) = f (Ax + b).
3 Pointwise maximum:
f (x) = maxifi (x).
Show that least-residual problems are convex for any `p-norm:
f (x) = ||Ax − b||p
We know that ‖ · ‖p is a norm, so it follows from (2).
Page 44
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Operations that Preserve Convexity
1 Non-negative weighted sum:
f (x) = θ1f1(x) + θ2f2(x).
2 Composition with affine mapping:
g(x) = f (Ax + b).
3 Pointwise maximum:
f (x) = maxifi (x).
Show that least-residual problems are convex for any `p-norm:
f (x) = ||Ax − b||p
We know that ‖ · ‖p is a norm, so it follows from (2).
Page 45
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Operations that Preserve Convexity
1 Non-negative weighted sum:
f (x) = θ1f1(x) + θ2f2(x).
2 Composition with affine mapping:
g(x) = f (Ax + b).
3 Pointwise maximum:
f (x) = maxifi (x).
Show that SVMs are convex:
f (x) =1
2||x ||2 + C
n∑
i=1
max0, 1− biaTi x.
The first term has Hessian I 0, for the second term use (3) onthe two (convex) arguments, then use (1) to put it all together.
Page 46
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Operations that Preserve Convexity
1 Non-negative weighted sum:
f (x) = θ1f1(x) + θ2f2(x).
2 Composition with affine mapping:
g(x) = f (Ax + b).
3 Pointwise maximum:
f (x) = maxifi (x).
Show that SVMs are convex:
f (x) =1
2||x ||2 + C
n∑
i=1
max0, 1− biaTi x.
The first term has Hessian I 0, for the second term use (3) onthe two (convex) arguments, then use (1) to put it all together.
Page 47
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Outline
1 Motivation
2 Gradient Method
3 Stochastic Subgradient
4 Finite-Sum Methods
5 Non-Smooth Objectives
Page 48
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Motivation for Gradient Methods
We can solve convex optimization problems inpolynomial-time by interior-point methods
But these solvers require O(P2) or worse cost per iteration.
Infeasible for applications where P may be in the billions.
Large-scale problems have renewed interest gradient methods:
x t+1 = x t − αt∇f (x t).
Only have O(P) iteration cost!But how many iterations are needed?
Page 49
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Motivation for Gradient Methods
We can solve convex optimization problems inpolynomial-time by interior-point methods
But these solvers require O(P2) or worse cost per iteration.
Infeasible for applications where P may be in the billions.
Large-scale problems have renewed interest gradient methods:
x t+1 = x t − αt∇f (x t).
Only have O(P) iteration cost!But how many iterations are needed?
Page 50
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Motivation for Gradient Methods
We can solve convex optimization problems inpolynomial-time by interior-point methods
But these solvers require O(P2) or worse cost per iteration.
Infeasible for applications where P may be in the billions.
Large-scale problems have renewed interest gradient methods:
x t+1 = x t − αt∇f (x t).
Only have O(P) iteration cost!But how many iterations are needed?
Page 51
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Motivation for Gradient Methods
We can solve convex optimization problems inpolynomial-time by interior-point methods
But these solvers require O(P2) or worse cost per iteration.
Infeasible for applications where P may be in the billions.
Large-scale problems have renewed interest gradient methods:
x t+1 = x t − αt∇f (x t).
Only have O(P) iteration cost!But how many iterations are needed?
Page 52
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Logistic Regression with 2-Norm Regularization
Let’s consider logistic regression with 2-norm regularization:
f (x) =n∑
i=1
log(1 + exp(−bi (xTai ))) +λ
2‖x‖2.
Objective f is convex.
First term is Lipschitz continuous, second term is not.
But we have
µI ∇2f (x) LI ,
for some L and µ.(L ≤ 1
4‖A‖2
2 + λ, µ ≥ λ)
We say that the gradient is Lipschitz-continuous.
We say that the function is strongly-convex.
Page 53
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Logistic Regression with 2-Norm Regularization
Let’s consider logistic regression with 2-norm regularization:
f (x) =n∑
i=1
log(1 + exp(−bi (xTai ))) +λ
2‖x‖2.
Objective f is convex.
First term is Lipschitz continuous, second term is not.
But we have
µI ∇2f (x) LI ,
for some L and µ.(L ≤ 1
4‖A‖2
2 + λ, µ ≥ λ)
We say that the gradient is Lipschitz-continuous.
We say that the function is strongly-convex.
Page 54
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Logistic Regression with 2-Norm Regularization
Let’s consider logistic regression with 2-norm regularization:
f (x) =n∑
i=1
log(1 + exp(−bi (xTai ))) +λ
2‖x‖2.
Objective f is convex.
First term is Lipschitz continuous, second term is not.
But we have
µI ∇2f (x) LI ,
for some L and µ.(L ≤ 1
4‖A‖2
2 + λ, µ ≥ λ)
We say that the gradient is Lipschitz-continuous.
We say that the function is strongly-convex.
Page 55
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Properties of Lipschitz-Continuous Gradient
From Taylor’s theorem, for some z we have:
f (y) = f (x) +∇f (x)T (y − x) +1
2(y − x)T∇2f (z)(y − x)
Use that ∇2f (z) LI .
f (y) ≤ f (x) +∇f (x)T (y − x) +L
2‖y − x‖2
Global quadratic upper bound on function value.
Variant of gradient method if we set x t+1 to minimum yvalue:
x t+1 = x t − 1
L∇f (x t).
Plugging this value in:
f (x t+1) ≤ f (x t)− 1
2L‖∇f (x t)‖2.
Guaranteed decrease of objective.
Page 56
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Properties of Lipschitz-Continuous Gradient
From Taylor’s theorem, for some z we have:
f (y) = f (x) +∇f (x)T (y − x) +1
2(y − x)T∇2f (z)(y − x)
Use that ∇2f (z) LI .
f (y) ≤ f (x) +∇f (x)T (y − x) +L
2‖y − x‖2
Global quadratic upper bound on function value.
Variant of gradient method if we set x t+1 to minimum yvalue:
x t+1 = x t − 1
L∇f (x t).
Plugging this value in:
f (x t+1) ≤ f (x t)− 1
2L‖∇f (x t)‖2.
Guaranteed decrease of objective.
Page 57
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Properties of Lipschitz-Continuous Gradient
From Taylor’s theorem, for some z we have:
f (y) = f (x) +∇f (x)T (y − x) +1
2(y − x)T∇2f (z)(y − x)
Use that ∇2f (z) LI .
f (y) ≤ f (x) +∇f (x)T (y − x) +L
2‖y − x‖2
Global quadratic upper bound on function value.
Variant of gradient method if we set x t+1 to minimum yvalue:
x t+1 = x t − 1
L∇f (x t).
Plugging this value in:
f (x t+1) ≤ f (x t)− 1
2L‖∇f (x t)‖2.
Guaranteed decrease of objective.
Page 58
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Properties of Lipschitz-Continuous GradientFrom Taylor’s theorem, for some z we have:
f (y) = f (x) +∇f (x)T (y − x) +1
2(y − x)T∇2f (z)(y − x)
Use that ∇2f (z) LI .
f (y) ≤ f (x) +∇f (x)T (y − x) +L
2‖y − x‖2
Global quadratic upper bound on function value.
Page 59
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Properties of Lipschitz-Continuous GradientFrom Taylor’s theorem, for some z we have:
f (y) = f (x) +∇f (x)T (y − x) +1
2(y − x)T∇2f (z)(y − x)
Use that ∇2f (z) LI .
f (y) ≤ f (x) +∇f (x)T (y − x) +L
2‖y − x‖2
Global quadratic upper bound on function value.
f(x)
Page 60
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Properties of Lipschitz-Continuous GradientFrom Taylor’s theorem, for some z we have:
f (y) = f (x) +∇f (x)T (y − x) +1
2(y − x)T∇2f (z)(y − x)
Use that ∇2f (z) LI .
f (y) ≤ f (x) +∇f (x)T (y − x) +L
2‖y − x‖2
Global quadratic upper bound on function value.
f(x)
f(x) + ∇f(x)T(y-x)
Page 61
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Properties of Lipschitz-Continuous GradientFrom Taylor’s theorem, for some z we have:
f (y) = f (x) +∇f (x)T (y − x) +1
2(y − x)T∇2f (z)(y − x)
Use that ∇2f (z) LI .
f (y) ≤ f (x) +∇f (x)T (y − x) +L
2‖y − x‖2
Global quadratic upper bound on function value.
f(x)
f(x) + ∇f(x)T(y-x)
f(x) + ∇f(x)T(y-x) + (L/2)||y-x||2
Page 62
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Properties of Lipschitz-Continuous GradientFrom Taylor’s theorem, for some z we have:
f (y) = f (x) +∇f (x)T (y − x) +1
2(y − x)T∇2f (z)(y − x)
Use that ∇2f (z) LI .
f (y) ≤ f (x) +∇f (x)T (y − x) +L
2‖y − x‖2
Global quadratic upper bound on function value.
f(x)
f(x) + ∇f(x)T(y-x)
f(y)
f(x) + ∇f(x)T(y-x) + (L/2)||y-x||2
Page 63
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Properties of Lipschitz-Continuous GradientFrom Taylor’s theorem, for some z we have:
f (y) = f (x) +∇f (x)T (y − x) +1
2(y − x)T∇2f (z)(y − x)
Use that ∇2f (z) LI .
f (y) ≤ f (x) +∇f (x)T (y − x) +L
2‖y − x‖2
Global quadratic upper bound on function value.
Stochastic vs. deterministic methods
• Minimizing g(!) =1
n
n!
i=1
fi(!) with fi(!) = ""yi, !
!!(xi)#
+ µ"(!)
• Batch gradient descent: !t = !t"1!#tg#(!t"1) = !t"1!
#t
n
n!
i=1
f #i(!t"1)
• Stochastic gradient descent: !t = !t"1 ! #tf#i(t)(!t"1)
Page 64
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Properties of Strong-ConvexityFrom Taylor’s theorem, for some z we have:
f (y) = f (x) +∇f (x)T (y − x) +1
2(y − x)T∇2f (z)(y − x)
Use that ∇2f (z) µI .f (y) ≥ f (x) +∇f (x)T (y − x) +
µ
2‖y − x‖2
Global quadratic lower bound on function value.
Page 65
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Properties of Strong-ConvexityFrom Taylor’s theorem, for some z we have:
f (y) = f (x) +∇f (x)T (y − x) +1
2(y − x)T∇2f (z)(y − x)
Use that ∇2f (z) µI .f (y) ≥ f (x) +∇f (x)T (y − x) +
µ
2‖y − x‖2
Global quadratic lower bound on function value.
Page 66
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Properties of Strong-ConvexityFrom Taylor’s theorem, for some z we have:
f (y) = f (x) +∇f (x)T (y − x) +1
2(y − x)T∇2f (z)(y − x)
Use that ∇2f (z) µI .f (y) ≥ f (x) +∇f (x)T (y − x) +
µ
2‖y − x‖2
Global quadratic lower bound on function value.
f(x)
Page 67
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Properties of Strong-ConvexityFrom Taylor’s theorem, for some z we have:
f (y) = f (x) +∇f (x)T (y − x) +1
2(y − x)T∇2f (z)(y − x)
Use that ∇2f (z) µI .f (y) ≥ f (x) +∇f (x)T (y − x) +
µ
2‖y − x‖2
Global quadratic lower bound on function value.
f(x)
f(x) + ∇f(x)T(y-x)
Page 68
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Properties of Strong-ConvexityFrom Taylor’s theorem, for some z we have:
f (y) = f (x) +∇f (x)T (y − x) +1
2(y − x)T∇2f (z)(y − x)
Use that ∇2f (z) µI .f (y) ≥ f (x) +∇f (x)T (y − x) +
µ
2‖y − x‖2
Global quadratic lower bound on function value.
f(x)
f(x) + ∇f(x)T(y-x)
f(x) + ∇f(x)T(y-x) + (μ/2)||y-x||2
Page 69
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Properties of Strong-Convexity
From Taylor’s theorem, for some z we have:
f (y) = f (x) +∇f (x)T (y − x) +1
2(y − x)T∇2f (z)(y − x)
Use that ∇2f (z) µI .
f (y) ≥ f (x) +∇f (x)T (y − x) +µ
2‖y − x‖2
Global quadratic lower bound on function value.
Minimize both sides in terms of y :
f (x∗) ≥ f (x)− 1
2µ‖∇f (x)‖2.
Upper bound on how far we are from the solution.
Page 70
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Linear Convergence of Gradient Descent
We have bounds on x t+1 and x∗:
f (x t+1) ≤ f (x t)− 1
2L‖∇f (x t)‖2, f (x∗) ≥ f (x t)− 1
2µ‖∇f (x t)‖2.
Page 71
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Linear Convergence of Gradient Descent
We have bounds on x t+1 and x∗:
f (x t+1) ≤ f (x t)− 1
2L‖∇f (x t)‖2, f (x∗) ≥ f (x t)− 1
2µ‖∇f (x t)‖2.
f(x)
Page 72
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Linear Convergence of Gradient Descent
We have bounds on x t+1 and x∗:
f (x t+1) ≤ f (x t)− 1
2L‖∇f (x t)‖2, f (x∗) ≥ f (x t)− 1
2µ‖∇f (x t)‖2.
f(x) GuaranteedProgress
Page 73
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Linear Convergence of Gradient Descent
We have bounds on x t+1 and x∗:
f (x t+1) ≤ f (x t)− 1
2L‖∇f (x t)‖2, f (x∗) ≥ f (x t)− 1
2µ‖∇f (x t)‖2.
f(x) GuaranteedProgress
Page 74
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Linear Convergence of Gradient Descent
We have bounds on x t+1 and x∗:
f (x t+1) ≤ f (x t)− 1
2L‖∇f (x t)‖2, f (x∗) ≥ f (x t)− 1
2µ‖∇f (x t)‖2.
f(x) GuaranteedProgress
MaximumSuboptimality
Page 75
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Linear Convergence of Gradient Descent
We have bounds on x t+1 and x∗:
f (x t+1) ≤ f (x t)− 1
2L‖∇f (x t)‖2, f (x∗) ≥ f (x t)− 1
2µ‖∇f (x t)‖2.
f(x) GuaranteedProgress
MaximumSuboptimality
f(x+)
Page 76
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Linear Convergence of Gradient Descent
We have bounds on x t+1 and x∗:
f (x t+1) ≤ f (x t)− 1
2L‖∇f (x t)‖2, f (x∗) ≥ f (x t)− 1
2µ‖∇f (x t)‖2.
combine them to get
f (x t+1)− f (x∗) ≤(
1− µ
L
)[f (x t)− f (x∗)]
This gives a linear convergence rate:
f (x t)− f (x∗) ≤(
1− µ
L
)t[f (x0)− f (x∗)]
Each iteration multiplies the error by a fixed amount.(very fast if µ/L is not too close to one)
Page 77
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Linear Convergence of Gradient Descent
We have bounds on x t+1 and x∗:
f (x t+1) ≤ f (x t)− 1
2L‖∇f (x t)‖2, f (x∗) ≥ f (x t)− 1
2µ‖∇f (x t)‖2.
combine them to get
f (x t+1)− f (x∗) ≤(
1− µ
L
)[f (x t)− f (x∗)]
This gives a linear convergence rate:
f (x t)− f (x∗) ≤(
1− µ
L
)t[f (x0)− f (x∗)]
Each iteration multiplies the error by a fixed amount.(very fast if µ/L is not too close to one)
Page 78
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Maximum Likelihood Logistic RegressionWhat about maximum-likelihood logistic regression?
f (x) =n∑
i=1
log(1 + exp(−bi (xTai ))).
We now only have
0 ∇2f (x) LI .
Convexity only gives a linear upper bound on f (x∗):
f (x∗) ≤ f (x) +∇f (x)T (x∗ − x)
Page 79
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Maximum Likelihood Logistic RegressionWhat about maximum-likelihood logistic regression?
f (x) =n∑
i=1
log(1 + exp(−bi (xTai ))).
We now only have
0 ∇2f (x) LI .
Convexity only gives a linear upper bound on f (x∗):
f (x∗) ≤ f (x) +∇f (x)T (x∗ − x)
Page 80
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Maximum Likelihood Logistic RegressionWhat about maximum-likelihood logistic regression?
f (x) =n∑
i=1
log(1 + exp(−bi (xTai ))).
We now only have
0 ∇2f (x) LI .
Convexity only gives a linear upper bound on f (x∗):
f (x∗) ≤ f (x) +∇f (x)T (x∗ − x)
f(x)GuaranteedProgress
Page 81
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Maximum Likelihood Logistic RegressionWhat about maximum-likelihood logistic regression?
f (x) =n∑
i=1
log(1 + exp(−bi (xTai ))).
We now only have
0 ∇2f (x) LI .
Convexity only gives a linear upper bound on f (x∗):
f (x∗) ≤ f (x) +∇f (x)T (x∗ − x)
f(x)GuaranteedProgress
Page 82
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Maximum Likelihood Logistic RegressionWhat about maximum-likelihood logistic regression?
f (x) =n∑
i=1
log(1 + exp(−bi (xTai ))).
We now only have
0 ∇2f (x) LI .
Convexity only gives a linear upper bound on f (x∗):
f (x∗) ≤ f (x) +∇f (x)T (x∗ − x)
f(x)GuaranteedProgress
MaximumSuboptimality
Page 83
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Maximum Likelihood Logistic Regression
Consider maximum-likelihood logistic regression:
f (x) =n∑
i=1
log(1 + exp(−bi (xTai ))).
We now only have
0 ∇2f (x) LI .
Convexity only gives a linear upper bound on f (x∗):
f (x∗) ≤ f (x) +∇f (x)T (x∗ − x)
If some x∗ exists, we have the sublinear convergence rate:
f (x t)− f (x∗) = O(1/t)
(compare to slower Ω(1/t−1/N) for general Lipschitz functions)
If f is convex, then f + λ‖x‖2 is strongly-convex.
Page 84
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Maximum Likelihood Logistic Regression
Consider maximum-likelihood logistic regression:
f (x) =n∑
i=1
log(1 + exp(−bi (xTai ))).
We now only have
0 ∇2f (x) LI .
Convexity only gives a linear upper bound on f (x∗):
f (x∗) ≤ f (x) +∇f (x)T (x∗ − x)
If some x∗ exists, we have the sublinear convergence rate:
f (x t)− f (x∗) = O(1/t)
(compare to slower Ω(1/t−1/N) for general Lipschitz functions)
If f is convex, then f + λ‖x‖2 is strongly-convex.
Page 85
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Gradient Method: Practical IssuesIn practice, searching for step size (line-search) is usuallymuch faster than α = 1/L.
(and doesn’t require knowledge of L)
Basic Armijo backtracking line-search:1 Start with a large value of α.2 Divide α in half until we satisfy (typically value is γ = .0001)
f (x t+1) ≤ f (x t)− γα||∇f (x t)||2.Practical methods may use Wolfe conditions (so α isn’t toosmall), and/or use interpolation to propose trial step sizes.
(with good interpolation, ≈ 1 evaluation of f per iteration)
Also, check your derivative code!
∇i f (x) ≈ f (x + δei )− f (x)
δFor large-scale problems you can check a random direction d :
∇f (x)Td ≈ f (x + δd)− f (x)
δ
Page 86
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Gradient Method: Practical IssuesIn practice, searching for step size (line-search) is usuallymuch faster than α = 1/L.
(and doesn’t require knowledge of L)
Basic Armijo backtracking line-search:1 Start with a large value of α.2 Divide α in half until we satisfy (typically value is γ = .0001)
f (x t+1) ≤ f (x t)− γα||∇f (x t)||2.
Practical methods may use Wolfe conditions (so α isn’t toosmall), and/or use interpolation to propose trial step sizes.
(with good interpolation, ≈ 1 evaluation of f per iteration)
Also, check your derivative code!
∇i f (x) ≈ f (x + δei )− f (x)
δFor large-scale problems you can check a random direction d :
∇f (x)Td ≈ f (x + δd)− f (x)
δ
Page 87
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Gradient Method: Practical IssuesIn practice, searching for step size (line-search) is usuallymuch faster than α = 1/L.
(and doesn’t require knowledge of L)
Basic Armijo backtracking line-search:1 Start with a large value of α.2 Divide α in half until we satisfy (typically value is γ = .0001)
f (x t+1) ≤ f (x t)− γα||∇f (x t)||2.Practical methods may use Wolfe conditions (so α isn’t toosmall), and/or use interpolation to propose trial step sizes.
(with good interpolation, ≈ 1 evaluation of f per iteration)
Also, check your derivative code!
∇i f (x) ≈ f (x + δei )− f (x)
δFor large-scale problems you can check a random direction d :
∇f (x)Td ≈ f (x + δd)− f (x)
δ
Page 88
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Gradient Method: Practical IssuesIn practice, searching for step size (line-search) is usuallymuch faster than α = 1/L.
(and doesn’t require knowledge of L)
Basic Armijo backtracking line-search:1 Start with a large value of α.2 Divide α in half until we satisfy (typically value is γ = .0001)
f (x t+1) ≤ f (x t)− γα||∇f (x t)||2.Practical methods may use Wolfe conditions (so α isn’t toosmall), and/or use interpolation to propose trial step sizes.
(with good interpolation, ≈ 1 evaluation of f per iteration)
Also, check your derivative code!
∇i f (x) ≈ f (x + δei )− f (x)
δFor large-scale problems you can check a random direction d :
∇f (x)Td ≈ f (x + δd)− f (x)
δ
Page 89
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Accelerated Gradient Method
Is this the best algorithm under these assumptions?
Algorithm Assumptions Rate
Gradient Convex O(1/t)Nesterov Convex O(1/t2)Gradient Strongly-Convex O((1− µ/L)t)
Nesterov Strongly-Convex O((1−√µ/L)t)
Nesterov’s accelerated gradient method:
xt+1 = yt − αt∇f (yt),
yt+1 = xt + βt(xt+1 − xt),
for appropriate αt , βt .
Rate is nearly-optimal for dimension-independent algorithm.
Similar to heavy-ball/momentum and conjugate gradient.
For logistic regression and many other losses, we can getlinear convergence without strong-convexity [Luo & Tseng, 1993].
Page 90
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Accelerated Gradient Method
Is this the best algorithm under these assumptions?
Algorithm Assumptions Rate
Gradient Convex O(1/t)Nesterov Convex O(1/t2)Gradient Strongly-Convex O((1− µ/L)t)
Nesterov Strongly-Convex O((1−√µ/L)t)
Nesterov’s accelerated gradient method:
xt+1 = yt − αt∇f (yt),
yt+1 = xt + βt(xt+1 − xt),
for appropriate αt , βt .
Rate is nearly-optimal for dimension-independent algorithm.
Similar to heavy-ball/momentum and conjugate gradient.
For logistic regression and many other losses, we can getlinear convergence without strong-convexity [Luo & Tseng, 1993].
Page 91
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Accelerated Gradient Method
Is this the best algorithm under these assumptions?
Algorithm Assumptions Rate
Gradient Convex O(1/t)Nesterov Convex O(1/t2)Gradient Strongly-Convex O((1− µ/L)t)
Nesterov Strongly-Convex O((1−√µ/L)t)
Nesterov’s accelerated gradient method:
xt+1 = yt − αt∇f (yt),
yt+1 = xt + βt(xt+1 − xt),
for appropriate αt , βt .
Rate is nearly-optimal for dimension-independent algorithm.
Similar to heavy-ball/momentum and conjugate gradient.
For logistic regression and many other losses, we can getlinear convergence without strong-convexity [Luo & Tseng, 1993].
Page 92
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Accelerated Gradient Method
Is this the best algorithm under these assumptions?
Algorithm Assumptions Rate
Gradient Convex O(1/t)Nesterov Convex O(1/t2)Gradient Strongly-Convex O((1− µ/L)t)
Nesterov Strongly-Convex O((1−√µ/L)t)
Nesterov’s accelerated gradient method:
xt+1 = yt − αt∇f (yt),
yt+1 = xt + βt(xt+1 − xt),
for appropriate αt , βt .
Rate is nearly-optimal for dimension-independent algorithm.
Similar to heavy-ball/momentum and conjugate gradient.
For logistic regression and many other losses, we can getlinear convergence without strong-convexity [Luo & Tseng, 1993].
Page 93
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Newton’s Method
The oldest differentiable optimization method is Newton’s.(also called IRLS for functions of the form f (Ax))
Modern form uses the update
x t+1 = x t − αd ,where d is a solution to the system
∇2f (x)d = ∇f (x).(Assumes ∇2f (x) 0)
Equivalent to minimizing the quadratic approximation:
f (y) ≈ f (x) +∇f (x)T (y − x) +1
2α‖y − x‖2
∇2f (x).
(recall that ‖x‖2H = xTHx)
We can generalize the Armijo condition to
f (x t+1) ≤ f (x t) + γα∇f (x t)Td .
Has a natural step length of α = 1.(always accepted when close to a minimizer)
Page 94
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Newton’s Method
The oldest differentiable optimization method is Newton’s.(also called IRLS for functions of the form f (Ax))
Modern form uses the update
x t+1 = x t − αd ,where d is a solution to the system
∇2f (x)d = ∇f (x).(Assumes ∇2f (x) 0)
Equivalent to minimizing the quadratic approximation:
f (y) ≈ f (x) +∇f (x)T (y − x) +1
2α‖y − x‖2
∇2f (x).
(recall that ‖x‖2H = xTHx)
We can generalize the Armijo condition to
f (x t+1) ≤ f (x t) + γα∇f (x t)Td .
Has a natural step length of α = 1.(always accepted when close to a minimizer)
Page 95
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Newton’s Method
The oldest differentiable optimization method is Newton’s.(also called IRLS for functions of the form f (Ax))
Modern form uses the update
x t+1 = x t − αd ,where d is a solution to the system
∇2f (x)d = ∇f (x).(Assumes ∇2f (x) 0)
Equivalent to minimizing the quadratic approximation:
f (y) ≈ f (x) +∇f (x)T (y − x) +1
2α‖y − x‖2
∇2f (x).
(recall that ‖x‖2H = xTHx)
We can generalize the Armijo condition to
f (x t+1) ≤ f (x t) + γα∇f (x t)Td .
Has a natural step length of α = 1.(always accepted when close to a minimizer)
Page 96
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Newton’s Method
f(x)
Page 97
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Newton’s Method
f(x)
x
Page 98
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Newton’s Method
f(x)
x - !f’(x)
x
Page 99
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Newton’s Method
f(x)
Page 100
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Newton’s Method
f(x)
x
Page 101
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Newton’s Method
f(x)
x - !f’(x)
x
Page 102
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Newton’s Method
Q(x)f(x)
x
x - !f’(x)
Page 103
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Newton’s Method
f(x)
xk - !H-1f’(x)
x
x - !f’(x)Q(x)
Page 104
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Convergence Rate of Newton’s Method
If ∇2f (x) is Lipschitz-continuous and ∇2f (x) µ, then closeto x∗ Newton’s method has local superlinear convergence:
f (x t+1)− f (x∗) ≤ ρt [f (x t)− f (x∗)],
with limt→∞ ρt = 0.
Converges very fast, use it if you can!
But requires solving ∇2f (x)d = ∇f (x).
Get global rates under various assumptions(cubic-regularization/accelerated/self-concordant).
Page 105
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Convergence Rate of Newton’s Method
If ∇2f (x) is Lipschitz-continuous and ∇2f (x) µ, then closeto x∗ Newton’s method has local superlinear convergence:
f (x t+1)− f (x∗) ≤ ρt [f (x t)− f (x∗)],
with limt→∞ ρt = 0.
Converges very fast, use it if you can!
But requires solving ∇2f (x)d = ∇f (x).
Get global rates under various assumptions(cubic-regularization/accelerated/self-concordant).
Page 106
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Newton’s Method: Practical IssuesThere are many practical variants of Newton’s method:
Modify the Hessian to be positive-definite.
Only compute the Hessian every m iterations.
Only use the diagonals of the Hessian.
Quasi-Newton: Update a (diagonal plus low-rank)approximation of the Hessian (BFGS, L-BFGS).
Hessian-free: Compute d inexactly using Hessian-vectorproducts:
∇2f (x)d = limδ→0
∇f (x + δd)−∇f (x)
δ
Barzilai-Borwein: Choose a step-size that acts like the Hessianover the last iteration:
α =(x t+1 − x t)T (∇f (x t+1)−∇f (x t))
‖∇f (x t+1)− f (x t)‖2
Another related method is nonlinear conjugate gradient.
Page 107
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Newton’s Method: Practical IssuesThere are many practical variants of Newton’s method:
Modify the Hessian to be positive-definite.
Only compute the Hessian every m iterations.
Only use the diagonals of the Hessian.
Quasi-Newton: Update a (diagonal plus low-rank)approximation of the Hessian (BFGS, L-BFGS).
Hessian-free: Compute d inexactly using Hessian-vectorproducts:
∇2f (x)d = limδ→0
∇f (x + δd)−∇f (x)
δ
Barzilai-Borwein: Choose a step-size that acts like the Hessianover the last iteration:
α =(x t+1 − x t)T (∇f (x t+1)−∇f (x t))
‖∇f (x t+1)− f (x t)‖2
Another related method is nonlinear conjugate gradient.
Page 108
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Outline
1 Motivation
2 Gradient Method
3 Stochastic Subgradient
4 Finite-Sum Methods
5 Non-Smooth Objectives
Page 109
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Big-N Problems
Recall the regularized empirical risk minimization problem:
minx∈RP
1
N
N∑
i=1
L(x , ai , bi ) + λr(x)
data fitting term + regularizer
What if number of training examples N is very large?
E.g., ImageNet has more than 14 million annotated images.
Page 110
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Big-N Problems
Recall the regularized empirical risk minimization problem:
minx∈RP
1
N
N∑
i=1
L(x , ai , bi ) + λr(x)
data fitting term + regularizer
What if number of training examples N is very large?
E.g., ImageNet has more than 14 million annotated images.
Page 111
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic vs. Deterministic Gradient Methods
We consider minimizing f (x) = 1N
∑Ni=1 fi (x).
Deterministic gradient method [Cauchy, 1847]:
xt+1 = xt − αt∇f (xt) = xt −αt
N
N∑
i=1
∇fi (xt).
Iteration cost is linear in N.Convergence with constant αt or line-search.
Stochastic gradient method [Robbins & Monro, 1951]:Random selection of i from 1, 2, . . . ,N.
xt+1 = xt − αt f′i (xt).
Gives unbiased estimate of true gradient,
E[f ′(i)(x)] =1
N
N∑
i=1
∇fi (x) = ∇f (x).
Iteration cost is independent of N.Convergence requires αt → 0.
Page 112
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic vs. Deterministic Gradient Methods
We consider minimizing f (x) = 1N
∑Ni=1 fi (x).
Deterministic gradient method [Cauchy, 1847]:
xt+1 = xt − αt∇f (xt) = xt −αt
N
N∑
i=1
∇fi (xt).
Iteration cost is linear in N.Convergence with constant αt or line-search.
Stochastic gradient method [Robbins & Monro, 1951]:Random selection of i from 1, 2, . . . ,N.
xt+1 = xt − αt f′i (xt).
Gives unbiased estimate of true gradient,
E[f ′(i)(x)] =1
N
N∑
i=1
∇fi (x) = ∇f (x).
Iteration cost is independent of N.Convergence requires αt → 0.
Page 113
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic vs. Deterministic Gradient Methods
We consider minimizing f (x) = 1N
∑Ni=1 fi (x).
Deterministic gradient method [Cauchy, 1847]:
xt+1 = xt − αt∇f (xt) = xt −αt
N
N∑
i=1
∇fi (xt).
Iteration cost is linear in N.Convergence with constant αt or line-search.
Stochastic gradient method [Robbins & Monro, 1951]:Random selection of i from 1, 2, . . . ,N.
xt+1 = xt − αt f′i (xt).
Gives unbiased estimate of true gradient,
E[f ′(i)(x)] =1
N
N∑
i=1
∇fi (x) = ∇f (x).
Iteration cost is independent of N.Convergence requires αt → 0.
Page 114
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic vs. Deterministic Gradient Methods
We consider minimizing f (x) = 1N
∑Ni=1 fi (x).
Deterministic gradient method [Cauchy, 1847]:
xt+1 = xt − αt∇f (xt) = xt −αt
N
N∑
i=1
∇fi (xt).
Iteration cost is linear in N.Convergence with constant αt or line-search.
Stochastic gradient method [Robbins & Monro, 1951]:Random selection of i from 1, 2, . . . ,N.
xt+1 = xt − αt f′i (xt).
Gives unbiased estimate of true gradient,
E[f ′(i)(x)] =1
N
N∑
i=1
∇fi (x) = ∇f (x).
Iteration cost is independent of N.
Convergence requires αt → 0.
Page 115
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic vs. Deterministic Gradient Methods
We consider minimizing f (x) = 1N
∑Ni=1 fi (x).
Deterministic gradient method [Cauchy, 1847]:
xt+1 = xt − αt∇f (xt) = xt −αt
N
N∑
i=1
∇fi (xt).
Iteration cost is linear in N.Convergence with constant αt or line-search.
Stochastic gradient method [Robbins & Monro, 1951]:Random selection of i from 1, 2, . . . ,N.
xt+1 = xt − αt f′i (xt).
Gives unbiased estimate of true gradient,
E[f ′(i)(x)] =1
N
N∑
i=1
∇fi (x) = ∇f (x).
Iteration cost is independent of N.Convergence requires αt → 0.
Page 116
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic vs. Deterministic Gradient Methods
We consider minimizing g(x) = 1N
∑ni=1 fi (x).
Deterministic gradient method [Cauchy, 1847]:
Stochastic vs. deterministic methods
• Minimizing g(!) =1
n
n!
i=1
fi(!) with fi(!) = ""yi, !
!!(xi)#
+ µ"(!)
• Batch gradient descent: !t = !t"1!#tg#(!t"1) = !t"1!
#t
n
n!
i=1
f #i(!t"1)
• Stochastic gradient descent: !t = !t"1 ! #tf#i(t)(!t"1)
Stochastic gradient method [Robbins & Monro, 1951]:
Stochastic vs. deterministic methods
• Minimizing g(!) =1
n
n!
i=1
fi(!) with fi(!) = ""yi, !
!!(xi)#
+ µ"(!)
• Batch gradient descent: !t = !t"1!#tg#(!t"1) = !t"1!
#t
n
n!
i=1
f #i(!t"1)
• Stochastic gradient descent: !t = !t"1 ! #tf#i(t)(!t"1)
Page 117
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic vs. Deterministic Gradient Methods
Stochastic iterations are N times faster, but how many iterations?
Assumption Deterministic Stochastic
Convex O(1/t2) O(1/√t)
Strongly O((1−√µ/L)t) O(1/t)
Stochastic has low iteration cost but slow convergence rate.
Sublinear rate even in strongly-convex case.Bounds are unimprovable if only unbiased gradient available.
Page 118
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic vs. Deterministic Gradient Methods
Stochastic iterations are N times faster, but how many iterations?
Assumption Deterministic Stochastic
Convex O(1/t2) O(1/√t)
Strongly O((1−√µ/L)t) O(1/t)
Stochastic has low iteration cost but slow convergence rate.
Sublinear rate even in strongly-convex case.Bounds are unimprovable if only unbiased gradient available.
Page 119
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic vs. Deterministic Convergence RatesPlot of convergence rates in strongly-convex case:
Stochastic vs. deterministic methods
• Goal = best of both worlds: linear rate with O(1) iteration cost
time
log(
exce
ss c
ost)
stochastic
deterministic
Stochastic will be superior for low-accuracy/time situations.
Page 120
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic vs. Deterministic for Non-Smooth
Consider the binary support vector machine objective:
f (x) =n∑
i=1
max0, 1− bi (xTai )+
λ
2‖x‖2.
Rates for subgradient methods for non-smooth objectives:
Assumption Deterministic Stochastic
Convex O(1/√t) O(1/
√t)
Strongly O(1/t) O(1/t)
Other black-box methods (cutting plane) are not faster.
For non-smooth problems:
Stochastic methods have same rate as smooth case.Deterministic methods are not faster than stochastic method.So use stochastic subgradient (iterations are n times faster).
Page 121
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic vs. Deterministic for Non-Smooth
Consider the binary support vector machine objective:
f (x) =n∑
i=1
max0, 1− bi (xTai )+
λ
2‖x‖2.
Rates for subgradient methods for non-smooth objectives:
Assumption Deterministic Stochastic
Convex O(1/√t) O(1/
√t)
Strongly O(1/t) O(1/t)
Other black-box methods (cutting plane) are not faster.
For non-smooth problems:
Stochastic methods have same rate as smooth case.Deterministic methods are not faster than stochastic method.So use stochastic subgradient (iterations are n times faster).
Page 122
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic vs. Deterministic for Non-Smooth
Consider the binary support vector machine objective:
f (x) =n∑
i=1
max0, 1− bi (xTai )+
λ
2‖x‖2.
Rates for subgradient methods for non-smooth objectives:
Assumption Deterministic Stochastic
Convex O(1/√t) O(1/
√t)
Strongly O(1/t) O(1/t)
Other black-box methods (cutting plane) are not faster.
For non-smooth problems:
Stochastic methods have same rate as smooth case.Deterministic methods are not faster than stochastic method.So use stochastic subgradient (iterations are n times faster).
Page 123
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Sub-Gradients and Sub-Differentials
Recall that for differentiable convex functions we have
f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y .
A vector d is a subgradient of a convex function f at x if
f (y) ≥ f (x) + dT (y − x),∀y .
At differentiable x :
Only subgradient is ∇f (x).
At non-differentiable x :
We have a set of subgradients.Called the sub-differential, ∂f (x).
Note that 0 ∈ ∂f (x) iff x is a global minimum.
Page 124
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Sub-Gradients and Sub-Differentials
Recall that for differentiable convex functions we have
f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y .
A vector d is a subgradient of a convex function f at x if
f (y) ≥ f (x) + dT (y − x),∀y .
At differentiable x :
Only subgradient is ∇f (x).
At non-differentiable x :
We have a set of subgradients.Called the sub-differential, ∂f (x).
Note that 0 ∈ ∂f (x) iff x is a global minimum.
Page 125
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Sub-Gradients and Sub-Differentials
Recall that for differentiable convex functions we have
f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y .
A vector d is a subgradient of a convex function f at x if
f (y) ≥ f (x) + dT (y − x),∀y .
At differentiable x :
Only subgradient is ∇f (x).
At non-differentiable x :
We have a set of subgradients.Called the sub-differential, ∂f (x).
Note that 0 ∈ ∂f (x) iff x is a global minimum.
Page 126
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Sub-Gradients and Sub-Differentials
Recall that for differentiable convex functions we have
f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y .
A vector d is a subgradient of a convex function f at x if
f (y) ≥ f (x) + dT (y − x),∀y .
Page 127
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Sub-Gradients and Sub-Differentials
Recall that for differentiable convex functions we have
f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y .
A vector d is a subgradient of a convex function f at x if
f (y) ≥ f (x) + dT (y − x),∀y .
f(x)
Page 128
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Sub-Gradients and Sub-Differentials
Recall that for differentiable convex functions we have
f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y .
A vector d is a subgradient of a convex function f at x if
f (y) ≥ f (x) + dT (y − x),∀y .
f(x)
f(x) + ∇f(x)T(y-x)
Page 129
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Sub-Gradients and Sub-Differentials
Recall that for differentiable convex functions we have
f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y .
A vector d is a subgradient of a convex function f at x if
f (y) ≥ f (x) + dT (y − x),∀y .
f(x)
Page 130
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Sub-Gradients and Sub-Differentials
Recall that for differentiable convex functions we have
f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y .
A vector d is a subgradient of a convex function f at x if
f (y) ≥ f (x) + dT (y − x),∀y .
f(x)
Page 131
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Sub-Gradients and Sub-Differentials
Recall that for differentiable convex functions we have
f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y .
A vector d is a subgradient of a convex function f at x if
f (y) ≥ f (x) + dT (y − x),∀y .
f(x)
Page 132
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Sub-Gradients and Sub-Differentials
Recall that for differentiable convex functions we have
f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y .
A vector d is a subgradient of a convex function f at x if
f (y) ≥ f (x) + dT (y − x),∀y .
f(x)
Page 133
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Sub-Gradients and Sub-Differentials
Recall that for differentiable convex functions we have
f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y .
A vector d is a subgradient of a convex function f at x if
f (y) ≥ f (x) + dT (y − x),∀y .
f(x)
Page 134
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Sub-Differential of Absolute Value and Max Functions
Sub-differential of absolute value function:
∂|x | =
1 x > 0
−1 x < 0
[−1, 1] x = 0
(sign of the variable if non-zero, anything in [−1, 1] at 0)
Page 135
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Sub-Differential of Absolute Value and Max Functions
Sub-differential of absolute value function:
∂|x | =
1 x > 0
−1 x < 0
[−1, 1] x = 0
(sign of the variable if non-zero, anything in [−1, 1] at 0)
Page 136
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Sub-Differential of Absolute Value and Max Functions
Sub-differential of absolute value function:
∂|x | =
1 x > 0
−1 x < 0
[−1, 1] x = 0
(sign of the variable if non-zero, anything in [−1, 1] at 0)
f(x)
Page 137
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Sub-Differential of Absolute Value and Max Functions
Sub-differential of absolute value function:
∂|x | =
1 x > 0
−1 x < 0
[−1, 1] x = 0
(sign of the variable if non-zero, anything in [−1, 1] at 0)
f(x)
Page 138
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Sub-Differential of Absolute Value and Max Functions
Sub-differential of absolute value function:
∂|x | =
1 x > 0
−1 x < 0
[−1, 1] x = 0
(sign of the variable if non-zero, anything in [−1, 1] at 0)
f(0)
Page 139
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Sub-Differential of Absolute Value and Max Functions
Sub-differential of absolute value function:
∂|x | =
1 x > 0
−1 x < 0
[−1, 1] x = 0
(sign of the variable if non-zero, anything in [−1, 1] at 0)
f(0)
Page 140
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Sub-Differential of Absolute Value and Max Functions
Sub-differential of absolute value function:
∂|x | =
1 x > 0
−1 x < 0
[−1, 1] x = 0
(sign of the variable if non-zero, anything in [−1, 1] at 0)
f(0)
Page 141
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Sub-Differential of Absolute Value and Max Functions
Sub-differential of absolute value function:
∂|x | =
1 x > 0
−1 x < 0
[−1, 1] x = 0
(sign of the variable if non-zero, anything in [−1, 1] at 0)
f(0)
Page 142
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Sub-Differential of Absolute Value and Max Functions
Sub-differential of absolute value function:
∂|x | =
1 x > 0
−1 x < 0
[−1, 1] x = 0
(sign of the variable if non-zero, anything in [−1, 1] at 0)
f(0)
Page 143
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Sub-Differential of Absolute Value and Max Functions
Sub-differential of absolute value function:
∂|x | =
1 x > 0
−1 x < 0
[−1, 1] x = 0
(sign of the variable if non-zero, anything in [−1, 1] at 0)
f(0)
Page 144
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Sub-Differential of Absolute Value and Max Functions
Sub-differential of absolute value function:
∂|x | =
1 x > 0
−1 x < 0
[−1, 1] x = 0
(sign of the variable if non-zero, anything in [−1, 1] at 0)
Sub-differential of max function:
∂maxf1(x), f2(x) =
∇f1(x) f1(x) > f2(x)
∇f2(x) f2(x) > f1(x)
θ∇f1(x) + (1− θ)∇f2(x) f1(x) = f2(x)
(any convex combination of the gradients of the argmax)
Page 145
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Sub-Differential of Absolute Value and Max Functions
Sub-differential of absolute value function:
∂|x | =
1 x > 0
−1 x < 0
[−1, 1] x = 0
(sign of the variable if non-zero, anything in [−1, 1] at 0)
Sub-differential of max function:
∂maxf1(x), f2(x) =
∇f1(x) f1(x) > f2(x)
∇f2(x) f2(x) > f1(x)
θ∇f1(x) + (1− θ)∇f2(x) f1(x) = f2(x)
(any convex combination of the gradients of the argmax)
Page 146
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Subgradient and Stochastic Subgradient methods
The basic subgradient method:
x t+1 = x t − αd ,
for some d ∈ ∂f (x t).
The steepest descent choice is given by argmind∈∂f (x)‖d‖.(often hard to compute, but easy for `1-regularization)
Otherwise, may increase the objective even for small α.
But ‖x t+1 − x∗‖ ≤ ‖x t − x∗‖ for small enough α.
For convergence, we require α→ 0.
The basic stochastic subgradient method:
x t+1 = x t − αd ,
for some d ∈ ∂fi (x t) for some random i ∈ 1, 2, . . . ,N.
Page 147
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Subgradient and Stochastic Subgradient methods
The basic subgradient method:
x t+1 = x t − αd ,
for some d ∈ ∂f (x t).
The steepest descent choice is given by argmind∈∂f (x)‖d‖.(often hard to compute, but easy for `1-regularization)
Otherwise, may increase the objective even for small α.
But ‖x t+1 − x∗‖ ≤ ‖x t − x∗‖ for small enough α.
For convergence, we require α→ 0.
The basic stochastic subgradient method:
x t+1 = x t − αd ,
for some d ∈ ∂fi (x t) for some random i ∈ 1, 2, . . . ,N.
Page 148
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Subgradient and Stochastic Subgradient methods
The basic subgradient method:
x t+1 = x t − αd ,
for some d ∈ ∂f (x t).
The steepest descent choice is given by argmind∈∂f (x)‖d‖.(often hard to compute, but easy for `1-regularization)
Otherwise, may increase the objective even for small α.
But ‖x t+1 − x∗‖ ≤ ‖x t − x∗‖ for small enough α.
For convergence, we require α→ 0.
The basic stochastic subgradient method:
x t+1 = x t − αd ,
for some d ∈ ∂fi (x t) for some random i ∈ 1, 2, . . . ,N.
Page 149
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Subgradient and Stochastic Subgradient methods
The basic subgradient method:
x t+1 = x t − αd ,
for some d ∈ ∂f (x t).
The steepest descent choice is given by argmind∈∂f (x)‖d‖.(often hard to compute, but easy for `1-regularization)
Otherwise, may increase the objective even for small α.
But ‖x t+1 − x∗‖ ≤ ‖x t − x∗‖ for small enough α.
For convergence, we require α→ 0.
The basic stochastic subgradient method:
x t+1 = x t − αd ,
for some d ∈ ∂fi (x t) for some random i ∈ 1, 2, . . . ,N.
Page 150
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic Subgradient Methods in Practice
The theory says to use decreasing sequence αt = 1/µt:
it = rand(1, 2, . . . ,N), αt =1
µt
x t+1 = x t − αt f′it (x
t).
O(1/t) for smooth objectives.O(log(t)/t) for non-smooth objectives.
Except for some special cases, you should not do this.Initial steps are huge: usually µ = O(1/N) or O(1/
√N).
Later steps are tiny: 1/t gets small very quickly.Convergence rate is not robust to mis-specification of µ.No adaptation to ‘easier’ problems than worst case.
Tricks that can improve theoretical and practical properties:1 Use smaller initial step-sizes, that go to zero more slowly.2 Take a (weighted) average of the iterations or gradients:
xt =t∑
i=1
ωtxt , dt =t∑
i=1
δtdt .
Page 151
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic Subgradient Methods in Practice
The theory says to use decreasing sequence αt = 1/µt:
it = rand(1, 2, . . . ,N), αt =1
µt
x t+1 = x t − αt f′it (x
t).
O(1/t) for smooth objectives.O(log(t)/t) for non-smooth objectives.
Except for some special cases, you should not do this.Initial steps are huge: usually µ = O(1/N) or O(1/
√N).
Later steps are tiny: 1/t gets small very quickly.Convergence rate is not robust to mis-specification of µ.No adaptation to ‘easier’ problems than worst case.
Tricks that can improve theoretical and practical properties:1 Use smaller initial step-sizes, that go to zero more slowly.2 Take a (weighted) average of the iterations or gradients:
xt =t∑
i=1
ωtxt , dt =t∑
i=1
δtdt .
Page 152
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic Subgradient Methods in Practice
The theory says to use decreasing sequence αt = 1/µt:
it = rand(1, 2, . . . ,N), αt =1
µt
x t+1 = x t − αt f′it (x
t).
O(1/t) for smooth objectives.O(log(t)/t) for non-smooth objectives.
Except for some special cases, you should not do this.Initial steps are huge: usually µ = O(1/N) or O(1/
√N).
Later steps are tiny: 1/t gets small very quickly.Convergence rate is not robust to mis-specification of µ.No adaptation to ‘easier’ problems than worst case.
Tricks that can improve theoretical and practical properties:1 Use smaller initial step-sizes, that go to zero more slowly.2 Take a (weighted) average of the iterations or gradients:
xt =t∑
i=1
ωtxt , dt =t∑
i=1
δtdt .
Page 153
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Speeding up Stochastic Subgradient Methods
Works that support using large steps and averaging:
Rakhlin et at. [2011]:
Averaging later iterations achieves O(1/t) in non-smooth case.
Nesterov [2007], Xiao [2010]:
Gradient averaging improves constants (‘dual averaging’).Finds non-zero variables with sparse regularizers.
Bach & Moulines [2011]:
αt = O(1/tβ) for β ∈ (0.5, 1) more robust than αt = O(1/t).
Nedic & Bertsekas [2000]:
Constant step size (αt = α) achieves rate of
E[f (x t)]− f (x∗) ≤ (1− 2µα)t(f (x0)− f (x∗)) + O(α).
Polyak & Juditsky [1992]:
In smooth case, iterate averaging is asymptotically optimal.Achieves same rate as optimal stochastic Newton method.
Page 154
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Speeding up Stochastic Subgradient Methods
Works that support using large steps and averaging:
Rakhlin et at. [2011]:
Averaging later iterations achieves O(1/t) in non-smooth case.
Nesterov [2007], Xiao [2010]:
Gradient averaging improves constants (‘dual averaging’).Finds non-zero variables with sparse regularizers.
Bach & Moulines [2011]:
αt = O(1/tβ) for β ∈ (0.5, 1) more robust than αt = O(1/t).
Nedic & Bertsekas [2000]:
Constant step size (αt = α) achieves rate of
E[f (x t)]− f (x∗) ≤ (1− 2µα)t(f (x0)− f (x∗)) + O(α).
Polyak & Juditsky [1992]:
In smooth case, iterate averaging is asymptotically optimal.Achieves same rate as optimal stochastic Newton method.
Page 155
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Speeding up Stochastic Subgradient Methods
Works that support using large steps and averaging:
Rakhlin et at. [2011]:
Averaging later iterations achieves O(1/t) in non-smooth case.
Nesterov [2007], Xiao [2010]:
Gradient averaging improves constants (‘dual averaging’).Finds non-zero variables with sparse regularizers.
Bach & Moulines [2011]:
αt = O(1/tβ) for β ∈ (0.5, 1) more robust than αt = O(1/t).
Nedic & Bertsekas [2000]:
Constant step size (αt = α) achieves rate of
E[f (x t)]− f (x∗) ≤ (1− 2µα)t(f (x0)− f (x∗)) + O(α).
Polyak & Juditsky [1992]:
In smooth case, iterate averaging is asymptotically optimal.Achieves same rate as optimal stochastic Newton method.
Page 156
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic Newton Methods?
Should we use accelerated/Newton-like stochastic methods?
These do not improve the convergence rate.
But some positive results exist.Ghadimi & Lan [2010]:
Acceleration can improve dependence on L and µ.Improves performance at start or if noise is small.
Duchi et al. [2010]:
Newton-like methods can improve regret bounds.
Bach & Moulines [2013]:
Newton-like method achieves O(1/t) withoutstrong-convexity.(under extra self-concordance assumption)
Page 157
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic Newton Methods?
Should we use accelerated/Newton-like stochastic methods?
These do not improve the convergence rate.
But some positive results exist.Ghadimi & Lan [2010]:
Acceleration can improve dependence on L and µ.Improves performance at start or if noise is small.
Duchi et al. [2010]:
Newton-like methods can improve regret bounds.
Bach & Moulines [2013]:
Newton-like method achieves O(1/t) withoutstrong-convexity.(under extra self-concordance assumption)
Page 158
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Outline
1 Motivation
2 Gradient Method
3 Stochastic Subgradient
4 Finite-Sum Methods
5 Non-Smooth Objectives
Page 159
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Big-N Problems
Recall the regularized empirical risk minimization problem:
minx∈RP
1
N
N∑
i=1
L(x , ai , bi ) + λr(x)
data fitting term + regularizer
Stochastic methods:
O(1/t) convergence but requires 1 gradient per iterations.Rates are unimprovable for general stochastic objectives.
Deterministic methods:
O(ρt) convergence but requires N gradients per iteration.The faster rate is possible because N is finite.
For minimizing finite sums, can we design a better method?
Page 160
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Big-N Problems
Recall the regularized empirical risk minimization problem:
minx∈RP
1
N
N∑
i=1
L(x , ai , bi ) + λr(x)
data fitting term + regularizer
Stochastic methods:
O(1/t) convergence but requires 1 gradient per iterations.Rates are unimprovable for general stochastic objectives.
Deterministic methods:
O(ρt) convergence but requires N gradients per iteration.The faster rate is possible because N is finite.
For minimizing finite sums, can we design a better method?
Page 161
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Big-N Problems
Recall the regularized empirical risk minimization problem:
minx∈RP
1
N
N∑
i=1
L(x , ai , bi ) + λr(x)
data fitting term + regularizer
Stochastic methods:
O(1/t) convergence but requires 1 gradient per iterations.Rates are unimprovable for general stochastic objectives.
Deterministic methods:
O(ρt) convergence but requires N gradients per iteration.The faster rate is possible because N is finite.
For minimizing finite sums, can we design a better method?
Page 162
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Big-N Problems
Recall the regularized empirical risk minimization problem:
minx∈RP
1
N
N∑
i=1
L(x , ai , bi ) + λr(x)
data fitting term + regularizer
Stochastic methods:
O(1/t) convergence but requires 1 gradient per iterations.Rates are unimprovable for general stochastic objectives.
Deterministic methods:
O(ρt) convergence but requires N gradients per iteration.The faster rate is possible because N is finite.
For minimizing finite sums, can we design a better method?
Page 163
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Motivation for Hybrid Methods
Stochastic vs. deterministic methods
• Goal = best of both worlds: linear rate with O(1) iteration cost
time
log(
exce
ss c
ost)
stochastic
deterministic
Page 164
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Motivation for Hybrid Methods
Stochastic vs. deterministic methods
• Goal = best of both worlds: linear rate with O(1) iteration cost
hybridlog(
exce
ss c
ost)
stochastic
deterministic
time
Page 165
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Hybrid Deterministic-Stochastic
Approach 1: control the sample size.
The FG method uses all N gradients,
∇f (x t) =1
N
N∑
i=1
fi (xt).
The SG method approximates it with 1 sample,
fit (xt) ≈ 1
N
N∑
i=1
fi (xt).
A common variant is to use larger sample Bt ,
1
|Bt |∑
i∈Btf ′i (x t) ≈ 1
N
N∑
i=1
fi (xt).
Page 166
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Hybrid Deterministic-Stochastic
Approach 1: control the sample size.
The FG method uses all N gradients,
∇f (x t) =1
N
N∑
i=1
fi (xt).
The SG method approximates it with 1 sample,
fit (xt) ≈ 1
N
N∑
i=1
fi (xt).
A common variant is to use larger sample Bt ,
1
|Bt |∑
i∈Btf ′i (x t) ≈ 1
N
N∑
i=1
fi (xt).
Page 167
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Hybrid Deterministic-Stochastic
Approach 1: control the sample size.
The FG method uses all N gradients,
∇f (x t) =1
N
N∑
i=1
fi (xt).
The SG method approximates it with 1 sample,
fit (xt) ≈ 1
N
N∑
i=1
fi (xt).
A common variant is to use larger sample Bt ,
1
|Bt |∑
i∈Btf ′i (x t) ≈ 1
N
N∑
i=1
fi (xt).
Page 168
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Approach 1: Batching
The SG method with a sample Bt uses iterations
x t+1 = x t − αt
|Bt |∑
i∈Btfi (x
t).
For a fixed sample size |Bt |, the rate is sublinear.
Gradient error decreases as sample size |Bt | increases.Common to gradually increase the sample size |Bt |.[Bertsekas & Tsitsiklis, 1996]
We can choose |Bt | to achieve a linear convergence rate:
Early iterations are cheap like SG iterations.Later iterations can use a Newton-like method.
Page 169
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Approach 1: Batching
The SG method with a sample Bt uses iterations
x t+1 = x t − αt
|Bt |∑
i∈Btfi (x
t).
For a fixed sample size |Bt |, the rate is sublinear.
Gradient error decreases as sample size |Bt | increases.
Common to gradually increase the sample size |Bt |.[Bertsekas & Tsitsiklis, 1996]
We can choose |Bt | to achieve a linear convergence rate:
Early iterations are cheap like SG iterations.Later iterations can use a Newton-like method.
Page 170
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Approach 1: Batching
The SG method with a sample Bt uses iterations
x t+1 = x t − αt
|Bt |∑
i∈Btfi (x
t).
For a fixed sample size |Bt |, the rate is sublinear.
Gradient error decreases as sample size |Bt | increases.Common to gradually increase the sample size |Bt |.[Bertsekas & Tsitsiklis, 1996]
We can choose |Bt | to achieve a linear convergence rate:
Early iterations are cheap like SG iterations.Later iterations can use a Newton-like method.
Page 171
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Approach 1: Batching
The SG method with a sample Bt uses iterations
x t+1 = x t − αt
|Bt |∑
i∈Btfi (x
t).
For a fixed sample size |Bt |, the rate is sublinear.
Gradient error decreases as sample size |Bt | increases.Common to gradually increase the sample size |Bt |.[Bertsekas & Tsitsiklis, 1996]
We can choose |Bt | to achieve a linear convergence rate:
Early iterations are cheap like SG iterations.Later iterations can use a Newton-like method.
Page 172
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Evaluation on Chain-Structured CRFs
Results on chain-structured conditional random field:
0 20 40 60 80 100
102
103
104
105
Passes through data
Ob
jec
tiv
e m
inu
s O
ptim
al
Stochastic1(1e−01)
Stochastic1(1e−02)
Stochastic1(1e−03)
Hybrid
Deterministic
Page 173
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic Average Gradient
Growing |Bt | eventually requires O(N) iteration cost.
Can we have a rate of O(ρt) with only 1 gradientevaluation per iteration?
YES! The stochastic average gradient (SAG) algorithm:
Randomly select it from 1, 2, . . . ,N and compute f ′it (xt).
x t+1 = x t − αt
N
N∑i=1
∇fi (x t)
Memory: y ti = ∇fi (x t) from the last t where i was selected.
[Le Roux et al., 2012]
Stochastic variant of increment average gradient (IAG).[Blatt et al., 2007]
Assumes gradients of non-selected examples don’t change.Assumption becomes accurate as ||x t+1 − x t || → 0.
Page 174
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic Average Gradient
Growing |Bt | eventually requires O(N) iteration cost.
Can we have a rate of O(ρt) with only 1 gradientevaluation per iteration?
YES!
The stochastic average gradient (SAG) algorithm:
Randomly select it from 1, 2, . . . ,N and compute f ′it (xt).
x t+1 = x t − αt
N
N∑i=1
∇fi (x t)
Memory: y ti = ∇fi (x t) from the last t where i was selected.
[Le Roux et al., 2012]
Stochastic variant of increment average gradient (IAG).[Blatt et al., 2007]
Assumes gradients of non-selected examples don’t change.Assumption becomes accurate as ||x t+1 − x t || → 0.
Page 175
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic Average Gradient
Growing |Bt | eventually requires O(N) iteration cost.
Can we have a rate of O(ρt) with only 1 gradientevaluation per iteration?
YES! The stochastic average gradient (SAG) algorithm:
Randomly select it from 1, 2, . . . ,N and compute f ′it (xt).
x t+1 = x t − αt
N
N∑i=1
∇fi (x t)
Memory: y ti = ∇fi (x t) from the last t where i was selected.
[Le Roux et al., 2012]
Stochastic variant of increment average gradient (IAG).[Blatt et al., 2007]
Assumes gradients of non-selected examples don’t change.Assumption becomes accurate as ||x t+1 − x t || → 0.
Page 176
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic Average Gradient
Growing |Bt | eventually requires O(N) iteration cost.
Can we have a rate of O(ρt) with only 1 gradientevaluation per iteration?
YES! The stochastic average gradient (SAG) algorithm:
Randomly select it from 1, 2, . . . ,N and compute f ′it (xt).
x t+1 = x t − αt
N
N∑i=1
∇fi (x t)
Memory: y ti = ∇fi (x t) from the last t where i was selected.
[Le Roux et al., 2012]
Stochastic variant of increment average gradient (IAG).[Blatt et al., 2007]
Assumes gradients of non-selected examples don’t change.Assumption becomes accurate as ||x t+1 − x t || → 0.
Page 177
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic Average Gradient
Growing |Bt | eventually requires O(N) iteration cost.
Can we have a rate of O(ρt) with only 1 gradientevaluation per iteration?
YES! The stochastic average gradient (SAG) algorithm:
Randomly select it from 1, 2, . . . ,N and compute f ′it (xt).
x t+1 = x t − αt
N
N∑i=1
y ti
Memory: y ti = ∇fi (x t) from the last t where i was selected.
[Le Roux et al., 2012]
Stochastic variant of increment average gradient (IAG).[Blatt et al., 2007]
Assumes gradients of non-selected examples don’t change.Assumption becomes accurate as ||x t+1 − x t || → 0.
Page 178
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic Average Gradient
Growing |Bt | eventually requires O(N) iteration cost.
Can we have a rate of O(ρt) with only 1 gradientevaluation per iteration?
YES! The stochastic average gradient (SAG) algorithm:
Randomly select it from 1, 2, . . . ,N and compute f ′it (xt).
x t+1 = x t − αt
N
N∑i=1
y ti
Memory: y ti = ∇fi (x t) from the last t where i was selected.
[Le Roux et al., 2012]
Stochastic variant of increment average gradient (IAG).[Blatt et al., 2007]
Assumes gradients of non-selected examples don’t change.Assumption becomes accurate as ||x t+1 − x t || → 0.
Page 179
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic Average Gradient
Growing |Bt | eventually requires O(N) iteration cost.
Can we have a rate of O(ρt) with only 1 gradientevaluation per iteration?
YES! The stochastic average gradient (SAG) algorithm:
Randomly select it from 1, 2, . . . ,N and compute f ′it (xt).
x t+1 = x t − αt
N
N∑i=1
y ti
Memory: y ti = ∇fi (x t) from the last t where i was selected.
[Le Roux et al., 2012]
Stochastic variant of increment average gradient (IAG).[Blatt et al., 2007]
Assumes gradients of non-selected examples don’t change.Assumption becomes accurate as ||x t+1 − x t || → 0.
Page 180
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Convergence Rate of SAG
If each f ′i is L−continuous and f is strongly-convex,with αt = 1/16L SAG has
E[f (x t)− f (x∗)] 6
(1−min
µ
16L,
1
8N
)t
C ,
where
C = [f (x0)− f (x∗)] +4L
N‖x0 − x∗‖2 +
σ2
16L.
Linear convergence rate but only 1 gradient per iteration.For well-conditioned problems, constant reduction per pass:
(1− 1
8N
)N
≤ exp
(−1
8
)= 0.8825.
For ill-conditioned problems, almost same as deterministicmethod (but N times faster).
Page 181
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Convergence Rate of SAG
If each f ′i is L−continuous and f is strongly-convex,with αt = 1/16L SAG has
E[f (x t)− f (x∗)] 6
(1−min
µ
16L,
1
8N
)t
C ,
where
C = [f (x0)− f (x∗)] +4L
N‖x0 − x∗‖2 +
σ2
16L.
Linear convergence rate but only 1 gradient per iteration.For well-conditioned problems, constant reduction per pass:
(1− 1
8N
)N
≤ exp
(−1
8
)= 0.8825.
For ill-conditioned problems, almost same as deterministicmethod (but N times faster).
Page 182
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Rate of Convergence ComparisonAssume that N = 700000, L = 0.25, µ = 1/N:
Gradient method has rate(
L−µL+µ
)2
= 0.99998.
Accelerated gradient method has rate(1−
õL
)= 0.99761.
SAG (N iterations) has rate(1−min
µ
16L ,1
8N
)N= 0.88250.
Fastest possible first-order method:(√
L−√µ√L+√µ
)2
= 0.99048.
SAG beats two lower bounds:Stochastic gradient bound (of O(1/t)).Deterministic gradient bound (for typical L, µ, and N).
Number of f ′i evaluations to reach ε:Stochastic: O( L
µ (1/ε)).
Gradient: O(N Lµ log(1/ε)).
Accelerated: O(N√
Lµ log(1/ε)).
SAG: O(maxN, Lµ log(1/ε)).
Page 183
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Rate of Convergence ComparisonAssume that N = 700000, L = 0.25, µ = 1/N:
Gradient method has rate(
L−µL+µ
)2
= 0.99998.
Accelerated gradient method has rate(1−
õL
)= 0.99761.
SAG (N iterations) has rate(1−min
µ
16L ,1
8N
)N= 0.88250.
Fastest possible first-order method:(√
L−√µ√L+√µ
)2
= 0.99048.
SAG beats two lower bounds:Stochastic gradient bound (of O(1/t)).Deterministic gradient bound (for typical L, µ, and N).
Number of f ′i evaluations to reach ε:Stochastic: O( L
µ (1/ε)).
Gradient: O(N Lµ log(1/ε)).
Accelerated: O(N√
Lµ log(1/ε)).
SAG: O(maxN, Lµ log(1/ε)).
Page 184
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Rate of Convergence ComparisonAssume that N = 700000, L = 0.25, µ = 1/N:
Gradient method has rate(
L−µL+µ
)2
= 0.99998.
Accelerated gradient method has rate(1−
õL
)= 0.99761.
SAG (N iterations) has rate(1−min
µ
16L ,1
8N
)N= 0.88250.
Fastest possible first-order method:(√
L−√µ√L+√µ
)2
= 0.99048.
SAG beats two lower bounds:Stochastic gradient bound (of O(1/t)).Deterministic gradient bound (for typical L, µ, and N).
Number of f ′i evaluations to reach ε:Stochastic: O( L
µ (1/ε)).
Gradient: O(N Lµ log(1/ε)).
Accelerated: O(N√
Lµ log(1/ε)).
SAG: O(maxN, Lµ log(1/ε)).
Page 185
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Rate of Convergence ComparisonAssume that N = 700000, L = 0.25, µ = 1/N:
Gradient method has rate(
L−µL+µ
)2
= 0.99998.
Accelerated gradient method has rate(1−
õL
)= 0.99761.
SAG (N iterations) has rate(1−min
µ
16L ,1
8N
)N= 0.88250.
Fastest possible first-order method:(√
L−√µ√L+√µ
)2
= 0.99048.
SAG beats two lower bounds:Stochastic gradient bound (of O(1/t)).Deterministic gradient bound (for typical L, µ, and N).
Number of f ′i evaluations to reach ε:Stochastic: O( L
µ (1/ε)).
Gradient: O(N Lµ log(1/ε)).
Accelerated: O(N√
Lµ log(1/ε)).
SAG: O(maxN, Lµ log(1/ε)).
Page 186
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Rate of Convergence ComparisonAssume that N = 700000, L = 0.25, µ = 1/N:
Gradient method has rate(
L−µL+µ
)2
= 0.99998.
Accelerated gradient method has rate(1−
õL
)= 0.99761.
SAG (N iterations) has rate(1−min
µ
16L ,1
8N
)N= 0.88250.
Fastest possible first-order method:(√
L−√µ√L+√µ
)2
= 0.99048.
SAG beats two lower bounds:Stochastic gradient bound (of O(1/t)).Deterministic gradient bound (for typical L, µ, and N).
Number of f ′i evaluations to reach ε:Stochastic: O( L
µ (1/ε)).
Gradient: O(N Lµ log(1/ε)).
Accelerated: O(N√
Lµ log(1/ε)).
SAG: O(maxN, Lµ log(1/ε)).
Page 187
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Rate of Convergence ComparisonAssume that N = 700000, L = 0.25, µ = 1/N:
Gradient method has rate(
L−µL+µ
)2
= 0.99998.
Accelerated gradient method has rate(1−
õL
)= 0.99761.
SAG (N iterations) has rate(1−min
µ
16L ,1
8N
)N= 0.88250.
Fastest possible first-order method:(√
L−√µ√L+√µ
)2
= 0.99048.
SAG beats two lower bounds:Stochastic gradient bound (of O(1/t)).Deterministic gradient bound (for typical L, µ, and N).
Number of f ′i evaluations to reach ε:Stochastic: O( L
µ (1/ε)).
Gradient: O(N Lµ log(1/ε)).
Accelerated: O(N√
Lµ log(1/ε)).
SAG: O(maxN, Lµ log(1/ε)).
Page 188
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Rate of Convergence ComparisonAssume that N = 700000, L = 0.25, µ = 1/N:
Gradient method has rate(
L−µL+µ
)2
= 0.99998.
Accelerated gradient method has rate(1−
õL
)= 0.99761.
SAG (N iterations) has rate(1−min
µ
16L ,1
8N
)N= 0.88250.
Fastest possible first-order method:(√
L−√µ√L+√µ
)2
= 0.99048.
SAG beats two lower bounds:Stochastic gradient bound (of O(1/t)).Deterministic gradient bound (for typical L, µ, and N).
Number of f ′i evaluations to reach ε:
Stochastic: O( Lµ (1/ε)).
Gradient: O(N Lµ log(1/ε)).
Accelerated: O(N√
Lµ log(1/ε)).
SAG: O(maxN, Lµ log(1/ε)).
Page 189
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Rate of Convergence ComparisonAssume that N = 700000, L = 0.25, µ = 1/N:
Gradient method has rate(
L−µL+µ
)2
= 0.99998.
Accelerated gradient method has rate(1−
õL
)= 0.99761.
SAG (N iterations) has rate(1−min
µ
16L ,1
8N
)N= 0.88250.
Fastest possible first-order method:(√
L−√µ√L+√µ
)2
= 0.99048.
SAG beats two lower bounds:Stochastic gradient bound (of O(1/t)).Deterministic gradient bound (for typical L, µ, and N).
Number of f ′i evaluations to reach ε:Stochastic: O( L
µ (1/ε)).
Gradient: O(N Lµ log(1/ε)).
Accelerated: O(N√
Lµ log(1/ε)).
SAG: O(maxN, Lµ log(1/ε)).
Page 190
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Rate of Convergence ComparisonAssume that N = 700000, L = 0.25, µ = 1/N:
Gradient method has rate(
L−µL+µ
)2
= 0.99998.
Accelerated gradient method has rate(1−
õL
)= 0.99761.
SAG (N iterations) has rate(1−min
µ
16L ,1
8N
)N= 0.88250.
Fastest possible first-order method:(√
L−√µ√L+√µ
)2
= 0.99048.
SAG beats two lower bounds:Stochastic gradient bound (of O(1/t)).Deterministic gradient bound (for typical L, µ, and N).
Number of f ′i evaluations to reach ε:Stochastic: O( L
µ (1/ε)).
Gradient: O(N Lµ log(1/ε)).
Accelerated: O(N√
Lµ log(1/ε)).
SAG: O(maxN, Lµ log(1/ε)).
Page 191
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Rate of Convergence ComparisonAssume that N = 700000, L = 0.25, µ = 1/N:
Gradient method has rate(
L−µL+µ
)2
= 0.99998.
Accelerated gradient method has rate(1−
õL
)= 0.99761.
SAG (N iterations) has rate(1−min
µ
16L ,1
8N
)N= 0.88250.
Fastest possible first-order method:(√
L−√µ√L+√µ
)2
= 0.99048.
SAG beats two lower bounds:Stochastic gradient bound (of O(1/t)).Deterministic gradient bound (for typical L, µ, and N).
Number of f ′i evaluations to reach ε:Stochastic: O( L
µ (1/ε)).
Gradient: O(N Lµ log(1/ε)).
Accelerated: O(N√
Lµ log(1/ε)).
SAG: O(maxN, Lµ log(1/ε)).
Page 192
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Rate of Convergence ComparisonAssume that N = 700000, L = 0.25, µ = 1/N:
Gradient method has rate(
L−µL+µ
)2
= 0.99998.
Accelerated gradient method has rate(1−
õL
)= 0.99761.
SAG (N iterations) has rate(1−min
µ
16L ,1
8N
)N= 0.88250.
Fastest possible first-order method:(√
L−√µ√L+√µ
)2
= 0.99048.
SAG beats two lower bounds:Stochastic gradient bound (of O(1/t)).Deterministic gradient bound (for typical L, µ, and N).
Number of f ′i evaluations to reach ε:Stochastic: O( L
µ (1/ε)).
Gradient: O(N Lµ log(1/ε)).
Accelerated: O(N√
Lµ log(1/ε)).
SAG: O(maxN, Lµ log(1/ε)).
Page 193
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Comparing Deterministic and Stochatic Methods
quantum (n = 50000, p = 78) and rcv1 (n = 697641,p = 47236)
0 10 20 30 40 50
10−6
10−4
10−2
100
Effective Passes
Ob
jec
tiv
e m
inu
s O
ptim
um
AFG
L−BFGS
SGASG
IAG
0 10 20 30 40 50
10−4
10−3
10−2
10−1
100
Effective Passes
Ob
jec
tiv
e m
inu
s O
ptim
um
AFG
L−BFGS
SG
ASG
IAG
Page 194
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
SAG Compared to FG and SG Methods
quantum (n = 50000, p = 78) and rcv1 (n = 697641,p = 47236)
0 10 20 30 40 50
10−12
10−10
10−8
10−6
10−4
10−2
100
Effective Passes
Ob
jec
tiv
e m
inu
s O
ptim
um
AFG
L−BFGS
SG
ASG
IAG
SAG−LS
0 10 20 30 40 50
10−15
10−10
10−5
100
Effective Passes
Ob
jec
tiv
e m
inu
s O
ptim
um
AFGL−BFGS
SG
ASG
IAG
SAG−LS
Page 195
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Other Linearly-Convergent Stochastic Methods
Newer stochastic algorithms are now available with linearrates:
Stochastic dual coordinate ascent [Shalev-Schwartz & Zhang, 2013]
Incremental surrogate optimization [Mairal, 2013].Stochastic variance-reduced gradient (SVRG)[Johnson & Zhang, 2013, Konecny & Richtarik, 2013, Mahdavi et al.,
2013, Zhang et al., 2013]
SAGA [Defazio et al., 2014]
SVRG has a much lower memory requirement (later in talk).
There are also non-smooth extensions (last part of talk).
Page 196
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Other Linearly-Convergent Stochastic Methods
Newer stochastic algorithms are now available with linearrates:
Stochastic dual coordinate ascent [Shalev-Schwartz & Zhang, 2013]
Incremental surrogate optimization [Mairal, 2013].Stochastic variance-reduced gradient (SVRG)[Johnson & Zhang, 2013, Konecny & Richtarik, 2013, Mahdavi et al.,
2013, Zhang et al., 2013]
SAGA [Defazio et al., 2014]
SVRG has a much lower memory requirement (later in talk).
There are also non-smooth extensions (last part of talk).
Page 197
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
SAG Implementation Issues
Basic SAG algorithm:
while(1)Sample i from 1, 2, . . . ,N.Compute f ′i (x).d = d − yi + f ′i (x).yi = f ′i (x).x = x − α
N .
Practical variants of the basic algorithm allow:
Regularization.Sparse gradients.Automatic step-size selection.Termination criterion.Acceleration [Lin et al., 2015].Adaptive non-uniform sampling [Schmidt et al., 2013]:
Sample gradients that change quickly more often.
Page 198
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
SAG Implementation Issues
Basic SAG algorithm:
while(1)Sample i from 1, 2, . . . ,N.Compute f ′i (x).d = d − yi + f ′i (x).yi = f ′i (x).x = x − α
N .
Practical variants of the basic algorithm allow:
Regularization.Sparse gradients.Automatic step-size selection.Termination criterion.Acceleration [Lin et al., 2015].
Adaptive non-uniform sampling [Schmidt et al., 2013]:
Sample gradients that change quickly more often.
Page 199
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
SAG Implementation Issues
Basic SAG algorithm:
while(1)Sample i from 1, 2, . . . ,N.Compute f ′i (x).d = d − yi + f ′i (x).yi = f ′i (x).x = x − α
N .
Practical variants of the basic algorithm allow:
Regularization.Sparse gradients.Automatic step-size selection.Termination criterion.Acceleration [Lin et al., 2015].Adaptive non-uniform sampling [Schmidt et al., 2013]:
Sample gradients that change quickly more often.
Page 200
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
SAG with Adaptive Non-Uniform Sampling
protein (n = 145751, p = 74) and sido (n = 12678, p = 4932)
0 10 20 30 40 50
10−5
10−4
10−3
10−2
10−1
100
Effective Passes
Ob
jec
tiv
e m
inu
s O
ptim
um
IAG
AGD
L−BFG
S
SG−C
ASG−C
PCD−L
DCA
SAG
0 10 20 30 40 50
10−6
10−5
10−4
10−3
10−2
10−1
100
Effective Passes
Ob
jec
tiv
e m
inu
s O
ptim
um
IAG
AGD
L−BFGS
SG−CASG−C
PCD−L
DCA
SAG
Datasets where SAG had the worst relative performance.
Page 201
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
SAG with Non-Uniform Sampling
protein (n = 145751, p = 74) and sido (n = 12678, p = 4932)
0 10 20 30 40 50
10−20
10−15
10−10
10−5
100
Effective Passes
Ob
jec
tiv
e m
inu
s O
ptim
um
IAG
AGDL−BFGS SG−CASG−C
PCD−L
DC
A
SAG
SAG−LS (Lipschitz)
0 10 20 30 40 50
10−15
10−10
10−5
100
Effective Passes
Ob
jec
tiv
e m
inu
s O
ptim
um
IAG AGD L−BFGS
SG−C ASG−CPCD−L
DCA
SAG
SAG−LS (Lipschitz)
Lipschitz sampling helps a lot.
Page 202
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Minimizing Finite Sums: Dealing with the Memory
A major disadvantage of SAG is the memory requirement.
There are several ways to avoid this:
Use mini-batches (only store gradient of the mini-batch).Use structure in the objective:
For fi (x) = L(aTi x), only need to store N values of aTi x .For CRFs, only need to store marginals of parts.
0 20 40 60 80 100
10−4
10−2
100
102
104
Ob
jec
tiv
e m
inu
s O
ptim
al
L−BFGS
Pegasos
SG AdaGrad
ASGHybrid
SAGSAG−NUS
SAG−NUS*
OEG
SMD
0 20 40 60 80 100
10−5
100
105
Ob
jec
tiv
e m
inu
s O
ptim
al
L−BFGS
Pegasos
SGAdaGradASG
Hybrid SAG
SAG−NUSSAG−NUS*
OEG
SMD
(optical character and named-entity recognition tasks)
If the above don’t work, use SVRG...
Page 203
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Minimizing Finite Sums: Dealing with the Memory
A major disadvantage of SAG is the memory requirement.
There are several ways to avoid this:
Use mini-batches (only store gradient of the mini-batch).
Use structure in the objective:
For fi (x) = L(aTi x), only need to store N values of aTi x .For CRFs, only need to store marginals of parts.
0 20 40 60 80 100
10−4
10−2
100
102
104
Ob
jec
tiv
e m
inu
s O
ptim
al
L−BFGS
Pegasos
SG AdaGrad
ASGHybrid
SAGSAG−NUS
SAG−NUS*
OEG
SMD
0 20 40 60 80 100
10−5
100
105
Ob
jec
tiv
e m
inu
s O
ptim
al
L−BFGS
Pegasos
SGAdaGradASG
Hybrid SAG
SAG−NUSSAG−NUS*
OEG
SMD
(optical character and named-entity recognition tasks)
If the above don’t work, use SVRG...
Page 204
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Minimizing Finite Sums: Dealing with the Memory
A major disadvantage of SAG is the memory requirement.
There are several ways to avoid this:
Use mini-batches (only store gradient of the mini-batch).Use structure in the objective:
For fi (x) = L(aTi x), only need to store N values of aTi x .
For CRFs, only need to store marginals of parts.
0 20 40 60 80 100
10−4
10−2
100
102
104
Ob
jec
tiv
e m
inu
s O
ptim
al
L−BFGS
Pegasos
SG AdaGrad
ASGHybrid
SAGSAG−NUS
SAG−NUS*
OEG
SMD
0 20 40 60 80 100
10−5
100
105
Ob
jec
tiv
e m
inu
s O
ptim
al
L−BFGS
Pegasos
SGAdaGradASG
Hybrid SAG
SAG−NUSSAG−NUS*
OEG
SMD
(optical character and named-entity recognition tasks)
If the above don’t work, use SVRG...
Page 205
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Minimizing Finite Sums: Dealing with the Memory
A major disadvantage of SAG is the memory requirement.
There are several ways to avoid this:
Use mini-batches (only store gradient of the mini-batch).Use structure in the objective:
For fi (x) = L(aTi x), only need to store N values of aTi x .For CRFs, only need to store marginals of parts.
0 20 40 60 80 100
10−4
10−2
100
102
104
Ob
jec
tive
min
us
Op
tim
al
L−BFGS
Pegasos
SG AdaGrad
ASGHybrid
SAGSAG−NUS
SAG−NUS*
OEG
SMD
0 20 40 60 80 100
10−5
100
105
Ob
jec
tive
min
us
Op
tim
al
L−BFGS
Pegasos
SGAdaGradASG
Hybrid SAG
SAG−NUSSAG−NUS*
OEG
SMD
(optical character and named-entity recognition tasks)
If the above don’t work, use SVRG...
Page 206
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Minimizing Finite Sums: Dealing with the Memory
A major disadvantage of SAG is the memory requirement.
There are several ways to avoid this:
Use mini-batches (only store gradient of the mini-batch).Use structure in the objective:
For fi (x) = L(aTi x), only need to store N values of aTi x .For CRFs, only need to store marginals of parts.
0 20 40 60 80 100
10−4
10−2
100
102
104
Ob
jec
tive
min
us
Op
tim
al
L−BFGS
Pegasos
SG AdaGrad
ASGHybrid
SAGSAG−NUS
SAG−NUS*
OEG
SMD
0 20 40 60 80 100
10−5
100
105
Ob
jec
tive
min
us
Op
tim
al
L−BFGS
Pegasos
SGAdaGradASG
Hybrid SAG
SAG−NUSSAG−NUS*
OEG
SMD
(optical character and named-entity recognition tasks)
If the above don’t work, use SVRG...
Page 207
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic Variance-Reduced Gradient
SVRG algorithm:
Start with x0
for s = 0, 1, 2 . . .
ds = 1N
∑Ni=1 f
′i (xs)
x0 = xs
for t = 1, 2, . . .m
Randomly pick it ∈ 1, 2, . . . ,Nx t = x t−1 − αt(f
′it (x
t−1)− f ′it (xs) + ds).
xs+1 = x t for random t ∈ 1, 2, . . . ,m.Requires 2 gradients per iteration and occasional full passes,but only requires storing ds and xs .
Page 208
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic Variance-Reduced Gradient
SVRG algorithm:
Start with x0
for s = 0, 1, 2 . . .
ds = 1N
∑Ni=1 f
′i (xs)
x0 = xsfor t = 1, 2, . . .m
Randomly pick it ∈ 1, 2, . . . ,Nx t = x t−1 − αt(f
′it (x
t−1)− f ′it (xs) + ds).
xs+1 = x t for random t ∈ 1, 2, . . . ,m.
Requires 2 gradients per iteration and occasional full passes,but only requires storing ds and xs .
Page 209
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Stochastic Variance-Reduced Gradient
SVRG algorithm:
Start with x0
for s = 0, 1, 2 . . .
ds = 1N
∑Ni=1 f
′i (xs)
x0 = xsfor t = 1, 2, . . .m
Randomly pick it ∈ 1, 2, . . . ,Nx t = x t−1 − αt(f
′it (x
t−1)− f ′it (xs) + ds).
xs+1 = x t for random t ∈ 1, 2, . . . ,m.Requires 2 gradients per iteration and occasional full passes,but only requires storing ds and xs .
Page 210
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Outline
1 Motivation
2 Gradient Method
3 Stochastic Subgradient
4 Finite-Sum Methods
5 Non-Smooth Objectives
Page 211
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Motivation: Sparse Regularization
Recall the regularized empirical risk minimization problem:
minx∈RP
1
N
N∑
i=1
L(x , ai , bi ) + λr(x)
data fitting term + regularizer
Often, regularizer r is used to encourage sparsity pattern in x .
For example, `1-regularized least squares,
minx‖Ax − b‖2 + λ‖x‖1
Regularizes and encourages sparsity in x
The objective is non-differentiable when any xi = 0.
Subgradient methods are optimal (slow) black-box methods.
Faster methods for specific non-smooth problems?
Page 212
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Motivation: Sparse Regularization
Recall the regularized empirical risk minimization problem:
minx∈RP
1
N
N∑
i=1
L(x , ai , bi ) + λr(x)
data fitting term + regularizer
Often, regularizer r is used to encourage sparsity pattern in x .
For example, `1-regularized least squares,
minx‖Ax − b‖2 + λ‖x‖1
Regularizes and encourages sparsity in x
The objective is non-differentiable when any xi = 0.
Subgradient methods are optimal (slow) black-box methods.
Faster methods for specific non-smooth problems?
Page 213
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Motivation: Sparse Regularization
Recall the regularized empirical risk minimization problem:
minx∈RP
1
N
N∑
i=1
L(x , ai , bi ) + λr(x)
data fitting term + regularizer
Often, regularizer r is used to encourage sparsity pattern in x .
For example, `1-regularized least squares,
minx‖Ax − b‖2 + λ‖x‖1
Regularizes and encourages sparsity in x
The objective is non-differentiable when any xi = 0.
Subgradient methods are optimal (slow) black-box methods.
Faster methods for specific non-smooth problems?
Page 214
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Motivation: Sparse Regularization
Recall the regularized empirical risk minimization problem:
minx∈RP
1
N
N∑
i=1
L(x , ai , bi ) + λr(x)
data fitting term + regularizer
Often, regularizer r is used to encourage sparsity pattern in x .
For example, `1-regularized least squares,
minx‖Ax − b‖2 + λ‖x‖1
Regularizes and encourages sparsity in x
The objective is non-differentiable when any xi = 0.
Subgradient methods are optimal (slow) black-box methods.
Faster methods for specific non-smooth problems?
Page 215
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Smoothing Approximations of Non-Smooth Functions
Smoothing: replace non-smooth f with smooth fε.
Apply a fast method for smooth optimization.
Smooth approximation to the absolute value:
|x | ≈√x2 + ν.
Page 216
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Smoothing Approximations of Non-Smooth Functions
Smoothing: replace non-smooth f with smooth fε.
Apply a fast method for smooth optimization.
Smooth approximation to the absolute value:
|x | ≈√x2 + ν.
Page 217
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Smoothing Approximations of Non-Smooth Functions
Smoothing: replace non-smooth f with smooth fε.
Apply a fast method for smooth optimization.
Smooth approximation to the absolute value:
|x | ≈√x2 + ν.
Page 218
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Smoothing Approximations of Non-Smooth Functions
Smoothing: replace non-smooth f with smooth fε.
Apply a fast method for smooth optimization.
Smooth approximation to the absolute value:
|x | ≈√x2 + ν.
Smooth approximation to the max function:
maxa, b ≈ log(exp(a) + exp(b))
Smooth approximation to the hinge loss:
max0, x ≈
0 x ≥ 1
1− x2 t < x < 1
(1− t)2 + 2(1− t)(t − x) x ≤ t
Generic smoothing strategy: strongly-convex regularization ofconvex conjugate.[Nesterov, 2005]
Page 219
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Smoothing Approximations of Non-Smooth Functions
Smoothing: replace non-smooth f with smooth fε.
Apply a fast method for smooth optimization.
Smooth approximation to the absolute value:
|x | ≈√x2 + ν.
Smooth approximation to the max function:
maxa, b ≈ log(exp(a) + exp(b))
Smooth approximation to the hinge loss:
max0, x ≈
0 x ≥ 1
1− x2 t < x < 1
(1− t)2 + 2(1− t)(t − x) x ≤ t
Generic smoothing strategy: strongly-convex regularization ofconvex conjugate.[Nesterov, 2005]
Page 220
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Discussion of Smoothing Approach
Nesterov [2005] shows that:
Gradient method on smoothed problem has O(1/√t)
subgradient rate.Accelerated gradient method has faster O(1/t) rate.
In practice:
Slowly decrease level of smoothing (often difficult to tune).Use faster algorithms like L-BFGS, SAG, or SVRG.
You can get the O(1/t) rate for minx maxfi (x) for fi convexand smooth using mirror-prox method.[Nemirovski, 2004]
See also Chambolle & Pock [2010].
Page 221
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Discussion of Smoothing Approach
Nesterov [2005] shows that:
Gradient method on smoothed problem has O(1/√t)
subgradient rate.Accelerated gradient method has faster O(1/t) rate.
In practice:
Slowly decrease level of smoothing (often difficult to tune).Use faster algorithms like L-BFGS, SAG, or SVRG.
You can get the O(1/t) rate for minx maxfi (x) for fi convexand smooth using mirror-prox method.[Nemirovski, 2004]
See also Chambolle & Pock [2010].
Page 222
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Discussion of Smoothing Approach
Nesterov [2005] shows that:
Gradient method on smoothed problem has O(1/√t)
subgradient rate.Accelerated gradient method has faster O(1/t) rate.
In practice:
Slowly decrease level of smoothing (often difficult to tune).Use faster algorithms like L-BFGS, SAG, or SVRG.
You can get the O(1/t) rate for minx maxfi (x) for fi convexand smooth using mirror-prox method.[Nemirovski, 2004]
See also Chambolle & Pock [2010].
Page 223
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Converting to Constrained Optimization
Re-write non-smooth problem as constrained problem.
The problemminx
f (x) + λ‖x‖1,
is equivalent to the problem
minx+≥0,x−≥0
f (x+ − x−) + λ∑
i
(x+i + x−i ),
or the problems
min−y≤x≤y
f (x) + λ∑
i
yi , min‖x‖1≤γ
f (x) + λγ
These are smooth objective with ‘simple’ constraints.
minx∈C
f (x).
Page 224
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Converting to Constrained Optimization
Re-write non-smooth problem as constrained problem.
The problemminx
f (x) + λ‖x‖1,
is equivalent to the problem
minx+≥0,x−≥0
f (x+ − x−) + λ∑
i
(x+i + x−i ),
or the problems
min−y≤x≤y
f (x) + λ∑
i
yi , min‖x‖1≤γ
f (x) + λγ
These are smooth objective with ‘simple’ constraints.
minx∈C
f (x).
Page 225
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Converting to Constrained Optimization
Re-write non-smooth problem as constrained problem.
The problemminx
f (x) + λ‖x‖1,
is equivalent to the problem
minx+≥0,x−≥0
f (x+ − x−) + λ∑
i
(x+i + x−i ),
or the problems
min−y≤x≤y
f (x) + λ∑
i
yi , min‖x‖1≤γ
f (x) + λγ
These are smooth objective with ‘simple’ constraints.
minx∈C
f (x).
Page 226
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Optimization with Simple Constraints
Recall: gradient descent minimizes quadratic approximation:
x t+1 = argminy
f (x t) +∇f (x t)T (y − x t) +
1
2αt‖y − x t‖2
.
Consider minimizing subject to simple constraints:
x t+1 = argminy∈C
f (x t) +∇f (x t)T (y − x t) +
1
2αt‖y − x t‖2
.
Equivalent to projection of gradient descent:
xGDt = x t − αt∇f (x t),
x t+1 = argminy∈C
‖y − xGDt ‖
,
Page 227
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Optimization with Simple Constraints
Recall: gradient descent minimizes quadratic approximation:
x t+1 = argminy
f (x t) +∇f (x t)T (y − x t) +
1
2αt‖y − x t‖2
.
Consider minimizing subject to simple constraints:
x t+1 = argminy∈C
f (x t) +∇f (x t)T (y − x t) +
1
2αt‖y − x t‖2
.
Equivalent to projection of gradient descent:
xGDt = x t − αt∇f (x t),
x t+1 = argminy∈C
‖y − xGDt ‖
,
Page 228
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Optimization with Simple Constraints
Recall: gradient descent minimizes quadratic approximation:
x t+1 = argminy
f (x t) +∇f (x t)T (y − x t) +
1
2αt‖y − x t‖2
.
Consider minimizing subject to simple constraints:
x t+1 = argminy∈C
f (x t) +∇f (x t)T (y − x t) +
1
2αt‖y − x t‖2
.
Equivalent to projection of gradient descent:
xGDt = x t − αt∇f (x t),
x t+1 = argminy∈C
‖y − xGDt ‖
,
Page 229
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Gradient Projection
f(x)
Page 230
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Gradient Projection
f(x)
x1
x2
Page 231
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Gradient Projection
f(x)Feasible Set
x1
x2
Page 232
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Gradient Projection
f(x)Feasible Set
x
x1
x2
Page 233
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Gradient Projection
f(x)Feasible Set
x
x1
x2
x - !f’(x)
Page 234
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Gradient Projection
f(x)Feasible Set
x
x1
x2
x - !f’(x)
Page 235
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Gradient Projection
f(x)Feasible Set
x+
x1
x2
x
x - !f’(x)
Page 236
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Discussion of Projected Gradient
Projected gradient has same rate as gradient method!
Can do many of the same tricks (i.e. line-search, acceleration,Barzilai-Borwein, SAG, SVRG).
For projected Newton, you need to do an expensive projectionunder ‖ · ‖Ht .
Two-metric projection methods allow Newton-like strategy forbound constraints.Inexact Newton methods allow Newton-like like strategy foroptimizing costly functions with simple constraints.
Page 237
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Discussion of Projected Gradient
Projected gradient has same rate as gradient method!
Can do many of the same tricks (i.e. line-search, acceleration,Barzilai-Borwein, SAG, SVRG).
For projected Newton, you need to do an expensive projectionunder ‖ · ‖Ht .
Two-metric projection methods allow Newton-like strategy forbound constraints.Inexact Newton methods allow Newton-like like strategy foroptimizing costly functions with simple constraints.
Page 238
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Discussion of Projected Gradient
Projected gradient has same rate as gradient method!
Can do many of the same tricks (i.e. line-search, acceleration,Barzilai-Borwein, SAG, SVRG).
For projected Newton, you need to do an expensive projectionunder ‖ · ‖Ht .
Two-metric projection methods allow Newton-like strategy forbound constraints.Inexact Newton methods allow Newton-like like strategy foroptimizing costly functions with simple constraints.
Page 239
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Projection Onto Simple Sets
Projections onto simple sets:
argminy≥0 ‖y − x‖ = maxx , 0
argminl≤y≤u ‖y − x‖ = maxl ,minx , uargminaT y=b ‖y − x‖ = x + (b − aT x)a/‖a‖2.
argminaT y≥b ‖y − x‖ =
x aT x ≥ b
x + (b − aT x)a/‖a‖2 aT x < b
argmin‖y‖≤τ ‖y − x‖ = τx/‖x‖.Linear-time algorithm for `1-norm ‖y‖1 ≤ τ .
Linear-time algorithm for probability simplex y ≥ 0,∑
y = 1.
Intersection of simple sets: Dykstra’s algorithm.
We can solve large instances of problems with these constraints.
Page 240
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Projection Onto Simple Sets
Projections onto simple sets:
argminy≥0 ‖y − x‖ = maxx , 0argminl≤y≤u ‖y − x‖ = maxl ,minx , u
argminaT y=b ‖y − x‖ = x + (b − aT x)a/‖a‖2.
argminaT y≥b ‖y − x‖ =
x aT x ≥ b
x + (b − aT x)a/‖a‖2 aT x < b
argmin‖y‖≤τ ‖y − x‖ = τx/‖x‖.Linear-time algorithm for `1-norm ‖y‖1 ≤ τ .
Linear-time algorithm for probability simplex y ≥ 0,∑
y = 1.
Intersection of simple sets: Dykstra’s algorithm.
We can solve large instances of problems with these constraints.
Page 241
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Projection Onto Simple Sets
Projections onto simple sets:
argminy≥0 ‖y − x‖ = maxx , 0argminl≤y≤u ‖y − x‖ = maxl ,minx , uargminaT y=b ‖y − x‖ = x + (b − aT x)a/‖a‖2.
argminaT y≥b ‖y − x‖ =
x aT x ≥ b
x + (b − aT x)a/‖a‖2 aT x < b
argmin‖y‖≤τ ‖y − x‖ = τx/‖x‖.Linear-time algorithm for `1-norm ‖y‖1 ≤ τ .
Linear-time algorithm for probability simplex y ≥ 0,∑
y = 1.
Intersection of simple sets: Dykstra’s algorithm.
We can solve large instances of problems with these constraints.
Page 242
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Projection Onto Simple Sets
Projections onto simple sets:
argminy≥0 ‖y − x‖ = maxx , 0argminl≤y≤u ‖y − x‖ = maxl ,minx , uargminaT y=b ‖y − x‖ = x + (b − aT x)a/‖a‖2.
argminaT y≥b ‖y − x‖ =
x aT x ≥ b
x + (b − aT x)a/‖a‖2 aT x < b
argmin‖y‖≤τ ‖y − x‖ = τx/‖x‖.Linear-time algorithm for `1-norm ‖y‖1 ≤ τ .
Linear-time algorithm for probability simplex y ≥ 0,∑
y = 1.
Intersection of simple sets: Dykstra’s algorithm.
We can solve large instances of problems with these constraints.
Page 243
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Projection Onto Simple Sets
Projections onto simple sets:
argminy≥0 ‖y − x‖ = maxx , 0argminl≤y≤u ‖y − x‖ = maxl ,minx , uargminaT y=b ‖y − x‖ = x + (b − aT x)a/‖a‖2.
argminaT y≥b ‖y − x‖ =
x aT x ≥ b
x + (b − aT x)a/‖a‖2 aT x < b
argmin‖y‖≤τ ‖y − x‖ = τx/‖x‖.
Linear-time algorithm for `1-norm ‖y‖1 ≤ τ .
Linear-time algorithm for probability simplex y ≥ 0,∑
y = 1.
Intersection of simple sets: Dykstra’s algorithm.
We can solve large instances of problems with these constraints.
Page 244
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Projection Onto Simple Sets
Projections onto simple sets:
argminy≥0 ‖y − x‖ = maxx , 0argminl≤y≤u ‖y − x‖ = maxl ,minx , uargminaT y=b ‖y − x‖ = x + (b − aT x)a/‖a‖2.
argminaT y≥b ‖y − x‖ =
x aT x ≥ b
x + (b − aT x)a/‖a‖2 aT x < b
argmin‖y‖≤τ ‖y − x‖ = τx/‖x‖.Linear-time algorithm for `1-norm ‖y‖1 ≤ τ .
Linear-time algorithm for probability simplex y ≥ 0,∑
y = 1.
Intersection of simple sets: Dykstra’s algorithm.
We can solve large instances of problems with these constraints.
Page 245
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Projection Onto Simple Sets
Projections onto simple sets:
argminy≥0 ‖y − x‖ = maxx , 0argminl≤y≤u ‖y − x‖ = maxl ,minx , uargminaT y=b ‖y − x‖ = x + (b − aT x)a/‖a‖2.
argminaT y≥b ‖y − x‖ =
x aT x ≥ b
x + (b − aT x)a/‖a‖2 aT x < b
argmin‖y‖≤τ ‖y − x‖ = τx/‖x‖.Linear-time algorithm for `1-norm ‖y‖1 ≤ τ .
Linear-time algorithm for probability simplex y ≥ 0,∑
y = 1.
Intersection of simple sets: Dykstra’s algorithm.
We can solve large instances of problems with these constraints.
Page 246
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Projection Onto Simple Sets
Projections onto simple sets:
argminy≥0 ‖y − x‖ = maxx , 0argminl≤y≤u ‖y − x‖ = maxl ,minx , uargminaT y=b ‖y − x‖ = x + (b − aT x)a/‖a‖2.
argminaT y≥b ‖y − x‖ =
x aT x ≥ b
x + (b − aT x)a/‖a‖2 aT x < b
argmin‖y‖≤τ ‖y − x‖ = τx/‖x‖.Linear-time algorithm for `1-norm ‖y‖1 ≤ τ .
Linear-time algorithm for probability simplex y ≥ 0,∑
y = 1.
Intersection of simple sets: Dykstra’s algorithm.
We can solve large instances of problems with these constraints.
Page 247
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Proximal-Gradient Method
A generalization of projected-gradient is Proximal-gradient.
The proximal-gradient method addresses problem of the form
minx
f (x) + r(x),
where f is smooth but r is a general convex function.
Applies proximity operator of r to gradient descent on f :
xGDt = x t − αt∇f (xt),
x t+1 = argminy
1
2‖y − xGDt ‖2 + αr(y)
,
Equivalent to using the approximation
x t+1 = argminy
f (x t) +∇f (x t)T (y − x t) +
1
2α‖y − x t‖2+r(y)
.
Convergence rates are still the same as for minimizing f .
Page 248
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Proximal-Gradient Method
A generalization of projected-gradient is Proximal-gradient.
The proximal-gradient method addresses problem of the form
minx
f (x) + r(x),
where f is smooth but r is a general convex function.
Applies proximity operator of r to gradient descent on f :
xGDt = x t − αt∇f (xt),
x t+1 = argminy
1
2‖y − xGDt ‖2 + αr(y)
,
Equivalent to using the approximation
x t+1 = argminy
f (x t) +∇f (x t)T (y − x t) +
1
2α‖y − x t‖2+r(y)
.
Convergence rates are still the same as for minimizing f .
Page 249
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Proximal-Gradient Method
A generalization of projected-gradient is Proximal-gradient.
The proximal-gradient method addresses problem of the form
minx
f (x) + r(x),
where f is smooth but r is a general convex function.
Applies proximity operator of r to gradient descent on f :
xGDt = x t − αt∇f (xt),
x t+1 = argminy
1
2‖y − xGDt ‖2 + αr(y)
,
Equivalent to using the approximation
x t+1 = argminy
f (x t) +∇f (x t)T (y − x t) +
1
2α‖y − x t‖2+r(y)
.
Convergence rates are still the same as for minimizing f .
Page 250
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Proximal Operator, Iterative Soft Thresholding
The proximal operator is the solution to
proxr [y ] = argminx∈RP
r(x) +1
2‖x − y‖2.
For L1-regularization, we obtain iterative soft-thresholding:
x t+1 = softThreshαλ[x t − α∇f (x t)].
Example with λ = 1:Input Threshold Soft-Threshold
0.6715−1.20750.71721.63020.4889
0−1.2075
01.6302
0
0−0.2075
00.6302
0
Page 251
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Proximal Operator, Iterative Soft Thresholding
The proximal operator is the solution to
proxr [y ] = argminx∈RP
r(x) +1
2‖x − y‖2.
For L1-regularization, we obtain iterative soft-thresholding:
x t+1 = softThreshαλ[x t − α∇f (x t)].
Example with λ = 1:Input Threshold Soft-Threshold
0.6715−1.20750.71721.63020.4889
0−1.2075
01.6302
0
0−0.2075
00.6302
0
Page 252
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Proximal Operator, Iterative Soft Thresholding
The proximal operator is the solution to
proxr [y ] = argminx∈RP
r(x) +1
2‖x − y‖2.
For L1-regularization, we obtain iterative soft-thresholding:
x t+1 = softThreshαλ[x t − α∇f (x t)].
Example with λ = 1:Input Threshold Soft-Threshold
0.6715−1.20750.71721.63020.4889
0−1.2075
01.6302
0
0−0.2075
00.6302
0
Page 253
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Proximal Operator, Iterative Soft Thresholding
The proximal operator is the solution to
proxr [y ] = argminx∈RP
r(x) +1
2‖x − y‖2.
For L1-regularization, we obtain iterative soft-thresholding:
x t+1 = softThreshαλ[x t − α∇f (x t)].
Example with λ = 1:Input Threshold Soft-Threshold
0.6715−1.20750.71721.63020.4889
0−1.2075
01.6302
0
0−0.2075
00.6302
0
Page 254
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Proximal Operator, Iterative Soft Thresholding
The proximal operator is the solution to
proxr [y ] = argminx∈RP
r(x) +1
2‖x − y‖2.
For L1-regularization, we obtain iterative soft-thresholding:
x t+1 = softThreshαλ[x t − α∇f (x t)].
Example with λ = 1:Input Threshold Soft-Threshold
0.6715−1.20750.71721.63020.4889
0−1.2075
01.6302
0
0−0.2075
00.6302
0
Page 255
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Special case of Projected-Gradient Methods
Projected-gradient methods are another special case:
r(x) =
0 if x ∈ C∞ if x /∈ C
,
givesx t+1 = projectC[x t − α∇f (x t)],
f(x)
x
Page 256
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Special case of Projected-Gradient Methods
Projected-gradient methods are another special case:
r(x) =
0 if x ∈ C∞ if x /∈ C
,
givesx t+1 = projectC[x t − α∇f (x t)],
f(x)
x
Page 257
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Special case of Projected-Gradient Methods
Projected-gradient methods are another special case:
r(x) =
0 if x ∈ C∞ if x /∈ C
,
givesx t+1 = projectC[x t − α∇f (x t)],
f(x)
x
Page 258
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Special case of Projected-Gradient Methods
Projected-gradient methods are another special case:
r(x) =
0 if x ∈ C∞ if x /∈ C
,
givesx t+1 = projectC[x t − α∇f (x t)],
Feasible Set
f(x)
x
Page 259
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Special case of Projected-Gradient Methods
Projected-gradient methods are another special case:
r(x) =
0 if x ∈ C∞ if x /∈ C
,
givesx t+1 = projectC[x t − α∇f (x t)],
Feasible Set
x - !f’(x)f(x)
x
Page 260
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Special case of Projected-Gradient Methods
Projected-gradient methods are another special case:
r(x) =
0 if x ∈ C∞ if x /∈ C
,
givesx t+1 = projectC[x t − α∇f (x t)],
Feasible Set
f(x)
x
x - !f’(x)
Page 261
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Special case of Projected-Gradient Methods
Projected-gradient methods are another special case:
r(x) =
0 if x ∈ C∞ if x /∈ C
,
givesx t+1 = projectC[x t − α∇f (x t)],
Feasible Set
x+
f(x)
x
x - !f’(x)
Page 262
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Exact Proximal-Gradient Methods
For what problems can we apply these methods?
We can efficiently compute the proximity operator for:1 L1-Regularization.2 Group `1-Regularization.3 Lower and upper bounds.4 Small number of linear constraint.5 Probability constraints.6 A few other simple regularizers/constraints.
Can solve these non-smooth/constrained problems as fast assmooth/unconstrained problems!
We can again do many of the same tricks (line-search,acceleration, Barzilai-Borwein, two-metric projection, inexactproximal operators, SAG, SVRG).
Page 263
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Exact Proximal-Gradient Methods
For what problems can we apply these methods?
We can efficiently compute the proximity operator for:1 L1-Regularization.
2 Group `1-Regularization.3 Lower and upper bounds.4 Small number of linear constraint.5 Probability constraints.6 A few other simple regularizers/constraints.
Can solve these non-smooth/constrained problems as fast assmooth/unconstrained problems!
We can again do many of the same tricks (line-search,acceleration, Barzilai-Borwein, two-metric projection, inexactproximal operators, SAG, SVRG).
Page 264
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Exact Proximal-Gradient Methods
For what problems can we apply these methods?
We can efficiently compute the proximity operator for:1 L1-Regularization.2 Group `1-Regularization.
3 Lower and upper bounds.4 Small number of linear constraint.5 Probability constraints.6 A few other simple regularizers/constraints.
Can solve these non-smooth/constrained problems as fast assmooth/unconstrained problems!
We can again do many of the same tricks (line-search,acceleration, Barzilai-Borwein, two-metric projection, inexactproximal operators, SAG, SVRG).
Page 265
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Exact Proximal-Gradient Methods
For what problems can we apply these methods?
We can efficiently compute the proximity operator for:1 L1-Regularization.2 Group `1-Regularization.3 Lower and upper bounds.
4 Small number of linear constraint.5 Probability constraints.6 A few other simple regularizers/constraints.
Can solve these non-smooth/constrained problems as fast assmooth/unconstrained problems!
We can again do many of the same tricks (line-search,acceleration, Barzilai-Borwein, two-metric projection, inexactproximal operators, SAG, SVRG).
Page 266
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Exact Proximal-Gradient Methods
For what problems can we apply these methods?
We can efficiently compute the proximity operator for:1 L1-Regularization.2 Group `1-Regularization.3 Lower and upper bounds.4 Small number of linear constraint.
5 Probability constraints.6 A few other simple regularizers/constraints.
Can solve these non-smooth/constrained problems as fast assmooth/unconstrained problems!
We can again do many of the same tricks (line-search,acceleration, Barzilai-Borwein, two-metric projection, inexactproximal operators, SAG, SVRG).
Page 267
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Exact Proximal-Gradient Methods
For what problems can we apply these methods?
We can efficiently compute the proximity operator for:1 L1-Regularization.2 Group `1-Regularization.3 Lower and upper bounds.4 Small number of linear constraint.5 Probability constraints.
6 A few other simple regularizers/constraints.
Can solve these non-smooth/constrained problems as fast assmooth/unconstrained problems!
We can again do many of the same tricks (line-search,acceleration, Barzilai-Borwein, two-metric projection, inexactproximal operators, SAG, SVRG).
Page 268
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Exact Proximal-Gradient Methods
For what problems can we apply these methods?
We can efficiently compute the proximity operator for:1 L1-Regularization.2 Group `1-Regularization.3 Lower and upper bounds.4 Small number of linear constraint.5 Probability constraints.6 A few other simple regularizers/constraints.
Can solve these non-smooth/constrained problems as fast assmooth/unconstrained problems!
We can again do many of the same tricks (line-search,acceleration, Barzilai-Borwein, two-metric projection, inexactproximal operators, SAG, SVRG).
Page 269
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Exact Proximal-Gradient Methods
For what problems can we apply these methods?
We can efficiently compute the proximity operator for:1 L1-Regularization.2 Group `1-Regularization.3 Lower and upper bounds.4 Small number of linear constraint.5 Probability constraints.6 A few other simple regularizers/constraints.
Can solve these non-smooth/constrained problems as fast assmooth/unconstrained problems!
We can again do many of the same tricks (line-search,acceleration, Barzilai-Borwein, two-metric projection, inexactproximal operators, SAG, SVRG).
Page 270
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Exact Proximal-Gradient Methods
For what problems can we apply these methods?
We can efficiently compute the proximity operator for:1 L1-Regularization.2 Group `1-Regularization.3 Lower and upper bounds.4 Small number of linear constraint.5 Probability constraints.6 A few other simple regularizers/constraints.
Can solve these non-smooth/constrained problems as fast assmooth/unconstrained problems!
We can again do many of the same tricks (line-search,acceleration, Barzilai-Borwein, two-metric projection, inexactproximal operators, SAG, SVRG).
Page 271
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Alternating Direction Method of Multipliers
Alernating direction method of multipliers (ADMM) solves:
minAx+By=c
f (x) + r(y).
Alternate between prox-like operators with respect to f and r .
Can introduce constraints to convert to this form:
minx
f (Ax) + r(x) ⇔ minx=Ay
f (x) + r(y),
minx
f (x) + r(Bx) ⇔ miny=Bx
f (x) + r(y).
If prox can not be computed exactly: Linearized ADMM.
Page 272
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Alternating Direction Method of Multipliers
Alernating direction method of multipliers (ADMM) solves:
minAx+By=c
f (x) + r(y).
Alternate between prox-like operators with respect to f and r .
Can introduce constraints to convert to this form:
minx
f (Ax) + r(x) ⇔ minx=Ay
f (x) + r(y),
minx
f (x) + r(Bx) ⇔ miny=Bx
f (x) + r(y).
If prox can not be computed exactly: Linearized ADMM.
Page 273
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Alternating Direction Method of Multipliers
Alernating direction method of multipliers (ADMM) solves:
minAx+By=c
f (x) + r(y).
Alternate between prox-like operators with respect to f and r .
Can introduce constraints to convert to this form:
minx
f (Ax) + r(x) ⇔ minx=Ay
f (x) + r(y),
minx
f (x) + r(Bx) ⇔ miny=Bx
f (x) + r(y).
If prox can not be computed exactly: Linearized ADMM.
Page 274
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Alternating Direction Method of Multipliers
Alernating direction method of multipliers (ADMM) solves:
minAx+By=c
f (x) + r(y).
Alternate between prox-like operators with respect to f and r .
Can introduce constraints to convert to this form:
minx
f (Ax) + r(x) ⇔ minx=Ay
f (x) + r(y),
minx
f (x) + r(Bx) ⇔ miny=Bx
f (x) + r(y).
If prox can not be computed exactly: Linearized ADMM.
Page 275
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Frank-Wolfe Method
In some cases the projected gradient step
x t+1 = argminy∈C
f (x t) +∇f (x t)T (y − x t) +
1
2αt‖y − x t‖2
,
may be hard to compute (e.g., dual of max-margin Markovnetworks).
Frank-Wolfe method simply uses:
x t+1 = argminy∈C
f (x t) +∇f (x t)T (y − x t)
,
requires compact C, takes convex combination of x t and x t+1.
Iterate can be written as convex combination of vertices of C.
O(1/t) rate for smooth convex objectives, some linearconvergence results for smooth and strongly-convex.[Jaggi, 2013]
Page 276
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Frank-Wolfe Method
In some cases the projected gradient step
x t+1 = argminy∈C
f (x t) +∇f (x t)T (y − x t) +
1
2αt‖y − x t‖2
,
may be hard to compute (e.g., dual of max-margin Markovnetworks).
Frank-Wolfe method simply uses:
x t+1 = argminy∈C
f (x t) +∇f (x t)T (y − x t)
,
requires compact C, takes convex combination of x t and x t+1.
Iterate can be written as convex combination of vertices of C.
O(1/t) rate for smooth convex objectives, some linearconvergence results for smooth and strongly-convex.[Jaggi, 2013]
Page 277
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Frank-Wolfe Method
In some cases the projected gradient step
x t+1 = argminy∈C
f (x t) +∇f (x t)T (y − x t) +
1
2αt‖y − x t‖2
,
may be hard to compute (e.g., dual of max-margin Markovnetworks).
Frank-Wolfe method simply uses:
x t+1 = argminy∈C
f (x t) +∇f (x t)T (y − x t)
,
requires compact C, takes convex combination of x t and x t+1.
Iterate can be written as convex combination of vertices of C.
O(1/t) rate for smooth convex objectives, some linearconvergence results for smooth and strongly-convex.[Jaggi, 2013]
Page 278
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Alternatives to Quadratic/Linear Surrogates
Mirror descent uses the iterations[Beck & Teboulle, 2003]
x t+1 = argminy∈C
f (x) +∇f (x)T (y − x t) +
1
2αtD(x t , y)
,
where D is a Bregman-divergence:
D = ‖x t − y‖2 (gradient method).D = ‖x t − y‖2
H (Newton’s method).
D =∑
i xti log(
x ti
yi)−∑i (x
ti − yi ) (exponentiated gradient).
Mairal [2013,2014] considers general surrogate optimization:
x t+1 = argminy∈C
g(y) ,
where g upper bounds f , g(x t) = f (x t), ∇g(x t) = ∇f (x t),and ∇g −∇f is Lipschitz-continuous.
Get O(1/k) and linear convergence rates depending on g − f .
Page 279
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Alternatives to Quadratic/Linear Surrogates
Mirror descent uses the iterations[Beck & Teboulle, 2003]
x t+1 = argminy∈C
f (x) +∇f (x)T (y − x t) +
1
2αtD(x t , y)
,
where D is a Bregman-divergence:
D = ‖x t − y‖2 (gradient method).D = ‖x t − y‖2
H (Newton’s method).
D =∑
i xti log(
x ti
yi)−∑i (x
ti − yi ) (exponentiated gradient).
Mairal [2013,2014] considers general surrogate optimization:
x t+1 = argminy∈C
g(y) ,
where g upper bounds f , g(x t) = f (x t), ∇g(x t) = ∇f (x t),and ∇g −∇f is Lipschitz-continuous.
Get O(1/k) and linear convergence rates depending on g − f .
Page 280
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Dual Methods
Stronly-convex problems have smooth duals.
Solve the dual instead of the primal.
SVM non-smooth strongly-convex primal:
minx
CN∑
i=1
max0, 1− biaTi x+
1
2‖x‖2.
SVM smooth dual:
min0≤α≤C
1
2αTAATα−
N∑
i=1
αi
Smooth bound constrained problem:
Two-metric projection (efficient Newton-liked method).Randomized coordinate descent (part 2 of this talk).
Page 281
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Dual Methods
Stronly-convex problems have smooth duals.
Solve the dual instead of the primal.
SVM non-smooth strongly-convex primal:
minx
CN∑
i=1
max0, 1− biaTi x+
1
2‖x‖2.
SVM smooth dual:
min0≤α≤C
1
2αTAATα−
N∑
i=1
αi
Smooth bound constrained problem:
Two-metric projection (efficient Newton-liked method).Randomized coordinate descent (part 2 of this talk).
Page 282
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives
Summary
Summary:
Part 1: Convex functions have special properties that allow usto efficiently minimize them.
Part 2: Gradient-based methods allow elegant scaling withdimensionality of problem.
Part 3: Stochastic-gradient methods allow scaling withnumber of training examples, at cost of slower convergencerate.
Part 4: For finite datasets, SAG fixes convergence rate ofstochastic gradient methods, and SVRG fixes memoryproblem of SAG.
Part 5: These building blocks can be extended to solve avariety of constrained and non-smooth problems.