ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015
ECS289: Scalable Machine Learning
Cho-Jui HsiehUC Davis
Oct 1, 2015
Outline
Convex vs Nonconvex Functions
Coordinate Descent
Gradient Descent
Newton’s method
Stochastic Gradient Descent
Numerical Optimization
Numerical Optimization:
minX f (X )
Can be applied to computer science, economics, control engineering,operating research, . . .
Machine Learning: find a model that minimizes the predictionerror.
Properties of the Function
Smooth function: a function has continuous derivative.
Example: ridge regression
argminw
1
2‖Xw − y‖2 +
λ
2‖w‖2
Non-smooth function: Lasso, primal SVM
Lasso: argminw
1
2‖Xw − y‖2 + λ‖w‖1
SVM: argminw
n∑i=1
max(0, 1− yiwT xi ) +λ
2‖w‖2
Convex Functions
A function is convex if:
∀x1, x2,∀t ∈ [0, 1], f (tx1 + (1− t)x2) ≤ tf (x1) + (1− t)f (x2)
No local optimum (why?)
Figure from Wikipedia
Convex Functions
If f (x) is twice differentiable, then
f is convex if and only if ∇2f (x) 0
Optimal solution may not be unique:
has a set of optimal solutions SGradient: capture the first order change of f :
f (x + αd ) = f (x) + α∇f (x)T d + O(α2)
If f is differentiable, we have the following optimality condition:
x∗ ∈ S if and only if ∇f (x) = 0
Strongly Convex Functions
f is strongly convex if there exists a m > 0 such that
f (y) ≥ f (x) +∇f (x)T (y − x) +m
2‖y − x‖2
2
A strongly convex function has a unique global optimum x∗ (why?)
If f is twice differentiable, then
f is strongly convex if and only if ∇2f (x) mI > 0 for all x
Gradient descent, coordinate descent will converge linearly (will seelater)
Nonconvex Functions
If f is nonconvex, most algorithms can only converge to stationarypointsx is a stationary point if and only if ∇f (x) = 0Three types of stationary points:
(1) Global optimum (2) Local optimum (3) Saddle pointExample: matrix completion, neural network, . . .Example: f (x , y) = 1
2 (xy − a)2
Coordinate Descent
Coordinate Descent
Update one variable at a timeCoordinate Descent: repeatedly perform the following loop
Step 1: pick an index iStep 2: compute a step size δ∗ by (approximately) minimizing
argminδ
f (x + δei )
Step 3: xi ← xi + δ∗
Coordinate Descent (update sequence)
Three types of updating order:
Cyclic: update sequence
x1, x2, . . . , xn︸ ︷︷ ︸1st outer iteration
, x1, x2, . . . , xn︸ ︷︷ ︸2nd outer iteration
, . . .
A more general setting: update each variable at least once within everyT stepsRandomly permute the sequence for each outer iteration (fasterconvergence in practice)
Random: each time pick a random coordinate to update
Typical way: sample from uniform distributionSample from uniform distribution vs sample from biased distributionP. Zhao and T. Zhang, Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. In
ICML 2015
D. Csiba, Z. Qu and P. Richtarik, Stochastic Dual Coordinate Ascent with Adaptive Probabilities. In ICML 2015
Coordinate Descent (update sequence)
Three types of updating order:
Cyclic: update sequence
x1, x2, . . . , xn︸ ︷︷ ︸1st outer iteration
, x1, x2, . . . , xn︸ ︷︷ ︸2nd outer iteration
, . . .
A more general setting: update each variable at least once within everyT stepsRandomly permute the sequence for each outer iteration (fasterconvergence in practice)
Random: each time pick a random coordinate to update
Typical way: sample from uniform distributionSample from uniform distribution vs sample from biased distributionP. Zhao and T. Zhang, Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. In
ICML 2015
D. Csiba, Z. Qu and P. Richtarik, Stochastic Dual Coordinate Ascent with Adaptive Probabilities. In ICML 2015
Greedy Coordinate Descent
Greedy: choose the most “important” coordinate to update
How to measure the importance?
By first derivative: |∇i f (x)|By first and second derivative: |∇i f (x)/∇2
ii f (x)|By maximum reduction of objective function
i∗ = argmaxi=1,...,n
(f (x)−min
δf (x + δei )
)Need to consider the time complexity for variable selection
Useful for kernel SVM (see lecture 6)
Extension: block coordinate descent
Variables are divided into blocks X1, . . . ,Xp, where each Xi is asubset of variables and
X1 ∪ X2, . . . ,Xp = 1, . . . , n, Xi ∩ Xj = ϕ, ∀i , j
Each time update a Xi by (approximately) solving the subproblemwithin the block
Example: alternating minimization for matrix completion (2 blocks).(See lecture 7)
Coordinate Descent (convergence)
Converge to an optimum if f (·) is convex and smooth
Has a linear convergence rate if f (·) is strongly convex
Linear convergence: error f (x t)− f (x∗) decays as
β, β2, β3, . . .
for some β < 1.
Local linear convergence: an algorithm converges linearly after‖x − x∗‖ ≤ K for some K > 0
Coordinate Descent (nonconvex)
Block coordinate descent with 2 blocks:
converges to stationary points
With > 2 blocks:
converges to stationary points if each subproblem has a uniqueminimizer.
Coordinate Descent: other names
Alternating minimization (matrix completion)
Iterative scaling (for log-linear models)
Decomposition method (for kernel SVM)
Gauss Seidel (for linear system when the matrix is positive definite)
. . .
Gradient Descent
Gradient Descent
Gradient descent algorithm: repeatedly conduct the following update:
x t+1 ← x t − α∇f (x t)
where α > 0 is the step sizeIt is a fixed point iteration method:
x − α∇f (x) = x if and only if x is an optimal solution
Step size too large ⇒ diverge; too small ⇒ slow convergence
Gradient Descent: successive approximation
At each iteration, form an approximation of f (·):
f (x t + d ) ≈ fx t (d ) :=f (x t) +∇f (x t)T d +1
2dT (
1
αI )d
=f (x t) +∇f (x t)T d +1
2αdT d
Update solution by x t+1 ← x t + argmind fx t (d )
d ∗ = −α∇f (x t) is the minimizer of argmind fx t (d )
d ∗ may not decrease the original objective function f
Gradient Descent: successive approximation
However, the function value will decrease if
Condition 1: fx(d ) ≥ f (x + d ) for all dCondition 2: fx(0) = f (x)
Why?
f (x t + d ∗) ≤ fx t (d ∗)
≤ fx t (0)
= f (x t)
Condition 2 is satisfied by construction of fx t
Condition 1 is satisfied if 1α I ∇
2f (x) for all x (why?)
Gradient Descent: step size
A function has L-Lipchitz continuous gradient if
‖∇f (x)−∇f (y)‖ ≤ L‖x − y‖ ∀x , y
If f is twice differentiable, this implies
∇2f (x) ≤ LI ∀x
In this case, Condition 2 is staisfied if α ≤ 1L
Theorem: gradient descent converges if α ≤ 1L
Theorem: gradient descent converges linearly with α ≤ 1L if f is
strongly convex
Gradient Descent
In practice, we do not know L. . . . . .
Step size α too large: the algorithm diverges
Step size α too small: the algorithm converges very slowly
Gradient Descent: line search
d ∗ is a “descent direction” if and only if (d ∗)T∇f (x) < 0
Armijo rule bakctracking line search:
Try α = 1, 12 ,
14 , . . . until it staisfies
f (x + αd ∗) ≤ f (x) + γα(d ∗)T∇f (x)
where 0 < γ < 1
Figure from http://ool.sourceforge.net/ool-ref.html
Gradient Descent: line search
Gradient descent with line search:
Converges to optimal solutions if f is smoothConverges linearly if f is strongly convex
However, each iteration requires evaluating f several times
Several other step-size selection approaches
(an ongoing research topic, especially for stochastic gradient descent)
Gradient Descent: applying to ridge regression
Input: X ∈ RN×d , y ∈ RN , initial w (0)
Output: Solution w∗ := argminw12‖Xw − y‖2 + λ
2‖w‖2
1: t = 02: while not converged do3: Compute the gradient
g = XT (Xw − y) + λw
4: Choose step size αt
5: Update w ← w − αtg6: t ← t + 17: end while
Time complexity: O(nnz(X )) per iteration
Proximal Gradient Descent
How can we apply gradient descent to solve the Lasso problem?
argminw
1
2‖Xw − y‖2 + λ ‖w‖1︸ ︷︷ ︸
non−differentiable
General composite function minimization:
argminx
f (x) := g(x) + h(x)
where g is smooth and convex, h is convex but may benon-differentiable
Usually assume h is simple (for computational efficiency)
Proximal Gradient Descent: successive approximation
At each iteration, form an approximation of f (·):
f (x t + d ) ≈ fx t (d ) :=g(x t) +∇g(x t)T d +1
2dT (
1
αI )d + h(x t + d )
=g(x t) +∇g(x t)T d +1
2αdT d + h(x t + d )
Update solution by x t+1 ← x t + argmind fx t (d )
This is called “proximal” gradient descent
Sometimes d ∗ = argmind fx t (d ) has a closed form solution
Proximal Gradient Descent: `1-regularization (*)
The subproblem:
x t+1 =x t + argmind∇g(x t)T d +
1
2αdT d + λ‖x t + d‖1
= argminu
1
2‖u − (x t − α∇g(x t))‖2 + λα‖u‖1
=S(x t − α∇g(x t), αλ),
where S is the soft-thresholding operator defined by
S(a, z) =
a− z if a > z
a + z if a < −z0 if a ∈ [−z , z ]
Proximal Gradient: soft-thresholding
Figure from http://jocelynchi.com/soft-thresholding-operator-and-the-lasso-solution/
Proximal Gradient Descent for Lasso
Input: X ∈ RN×d , y ∈ RN , initial w (0)
Output: Solution w∗ := argminw12‖Xw − y‖2 + λ‖w‖1
1: t = 02: while not converged do3: Compute the gradient
g = XT (Xw − y)
4: Choose step size αt
5: Update w ← S(w − αtg , αtλ)6: t ← t + 17: end while
Time complexity: O(nnz(X )) per iteration
Newton’s Method
Newton’s Method
Iteratively conduct the following updates:
x ← x − α∇2f (x)−1∇f (x)
where α is the step size
If α = 1: converges quadratically when x t is close enough to x∗:
‖x t+1 − x∗‖ ≤ K‖x t − x∗‖2
for some constant K . This means the error f (x t)− f (x∗) decaysquadratically:
β, β2, β4, β8, β16, . . .
Only need few iterations to converge in this “quadratic convergenceregion”
Newton’s Method
However, Newton’s update rule is more expensive than gradientdescent/coordinate descent
Newton’s Method
Need to compute ∇2f (x)−1∇f (x)
Closed form solution: O(d3) for solving a d dimensional linear system
Usually solved by another iterative solver:
gradient descent
coordinate descent
conjugate gradient method
. . .
Useful for the cases where the quadratic subproblem can be solvedmore efficiently than the original problem
Examples: primal L2-SVM/logistic regression, `1-regularized logisticregression, . . .
Newton’s Method(*)
At each iteration, form an approximation of f (·):
f (x t + d ) ≈ fx t (d ) := f (x t) +∇f (x t)T d +1
2αdT∇2f (x)d
Update solution by x t+1 ← x t + argmind fx t (d )
When x is far away from x∗, needs line search to gaurantee convergence
Assume LI ∇2f (x) mI for all x , then α ≤ mL gaurantee the
objective function value deacrease because
L
m∇2f (x) ∇2f (y) ∀x , y
In practice, we often just use line search.
Proximal Newton (*)
What if f (x) = g(x) + h(x) and h(x) is non-smooth (h(x) = ‖x‖1)?
At each iteration, form an approximation of f (·):
f (x t + d ) ≈ fx t (d ) := g(x t) +∇g(x t)T d +α
2dT∇2g(x)d + h(x + d )
Update solution by x t+1 ← x t + argmind fx t (d )
Need another iterative solver for solving the subproblem
Stochastic Gradient
Stochastic Gradient: Motivation
Widely used for machine learning problems (with large number ofsamples)
Given training samples x1, . . . , xn, we usually want to solve thefollowing empirical risk minimization (ERM) problem:
argminw
n∑i=1
`i (xi ),
where each `i (·) is the loss function
Minimize the summation of individual loss on each sample
Stochastic Gradient
Assume the objective function can be written as
f (x) =1
n
n∑i=1
fi (x)
Stochastic gradient method:
Iterative conducts the following updates1 Choose an index i (uniform) randomly2 x t+1 ← x t − ηt∇fi (x t)
ηt > 0 is the step size
Why does SG work?
Ei [∇fi (x)] =1
n
n∑i=1
∇fi (x) = ∇f (x)
Is it a fixed point method? No if η > 0 because x∗ − η∇i f (x∗) 6= x∗
Is it a descent method? No, because f (x t+1) 6< f (x t)
Stochastic Gradient
Assume the objective function can be written as
f (x) =1
n
n∑i=1
fi (x)
Stochastic gradient method:
Iterative conducts the following updates1 Choose an index i (uniform) randomly2 x t+1 ← x t − ηt∇fi (x t)
ηt > 0 is the step size
Why does SG work?
Ei [∇fi (x)] =1
n
n∑i=1
∇fi (x) = ∇f (x)
Is it a fixed point method? No if η > 0 because x∗ − η∇i f (x∗) 6= x∗
Is it a descent method? No, because f (x t+1) 6< f (x t)
Stochastic Gradient
Assume the objective function can be written as
f (x) =1
n
n∑i=1
fi (x)
Stochastic gradient method:
Iterative conducts the following updates1 Choose an index i (uniform) randomly2 x t+1 ← x t − ηt∇fi (x t)
ηt > 0 is the step size
Why does SG work?
Ei [∇fi (x)] =1
n
n∑i=1
∇fi (x) = ∇f (x)
Is it a fixed point method? No if η > 0 because x∗ − η∇i f (x∗) 6= x∗
Is it a descent method? No, because f (x t+1) 6< f (x t)
Stochastic Gradient
Step size η has to decay to 0
(e.g., ηt = Ct−a for some constant a,C )
Many variants proposed recently (SVRG, SAGA, . . . )
Widely used in online setting
Stochastic Gradient: applying to ridge regression
Objective function:
argminw
1
n
n∑i=1
(wT xi − yi )2 + λ‖w‖2
How to write as argminw1n
∑ni=1 fi (w)?
How to decompose into n components?
Stochastic Gradient: applying to ridge regression
Objective function:
argminw
1
n
n∑i=1
(wT xi − yi )2 + λ‖w‖2
How to write as argminw1n
∑ni=1 fi (w)?
First approach: fi (w) = (wT xi − yi )2 + λ‖w‖2
Update rule:
w t+1 ← w t − 2ηt(wT xi − yi )xi − 2ηtλw
= (1− 2ηtλ)w − 2ηt(wT xi − yi )xi
Stochastic Gradient: applying to ridge regression
Objective function:
argminw
1
n
n∑i=1
(wT xi − yi )2 + λ‖w‖2
How to write as argminw1n
∑ni=1 fi (w)?
First approach: fi (w) = (wT xi − yi )2 + λ‖w‖2
Update rule:
w t+1 ← w t − 2ηt(wT xi − yi )xi − 2ηtλw
= (1− 2ηtλ)w − 2ηt(wT xi − yi )xi
Need O(d) complexity per iteration even if data is sparse
Stochastic Gradient: applying to ridge regression
Objective function:
argminw
1
n
n∑i=1
(wT xi − yi )2 + λ‖w‖2
How to write as argminw1n
∑ni=1 fi (w)?
First approach: fi (w) = (wT xi − yi )2 + λ‖w‖2
Update rule:
w t+1 ← w t − 2ηt(wT xi − yi )xi − 2ηtλw
= (1− 2ηtλ)w − 2ηt(wT xi − yi )xi
Need O(d) complexity per iteration even if data is sparse
Solution: store w = sv where s is a scalar
Stochastic Gradient: applying to ridge regression
Objective function:
argminw
1
n
n∑i=1
(wT xi − yi )2 + λ‖w‖2
Second approach:
define Ωi = j | Xij 6= 0 for i = 1, . . . , n
define nj = |i | Xij 6= 0| for j = 1, . . . , d
define fi (w) = (wT xi − yi )2 +
∑j∈Ωi
λnnjw2
j
Update rule when selecting index i :
w t+1j ← w t
j − 2ηt(xTi w t − yi )Xij −
2ηtλn
njw t
j , ∀j ∈ Ωi
Solution: update can be done in O(|Ωi |) operations
Stochastic Gradient: applying to ridge regression
Objective function:
argminw
1
n
n∑i=1
(wT xi − yi )2 + λ‖w‖2
Second approach:
define Ωi = j | Xij 6= 0 for i = 1, . . . , n
define nj = |i | Xij 6= 0| for j = 1, . . . , d
define fi (w) = (wT xi − yi )2 +
∑j∈Ωi
λnnjw2
j
Update rule when selecting index i :
w t+1j ← w t
j − 2ηt(xTi w t − yi )Xij −
2ηtλn
njw t
j , ∀j ∈ Ωi
Solution: update can be done in O(|Ωi |) operations
Coming up
Next class: Parallel Optimization Methods
Questions?