Stochastic Optimization STUDY GROUP Stochastic Gradient Descent Methods Xuetong Wu & Viktoria Schram Department of EEE University of Melbourne October 22, 2020 1 / 91
Stochastic Optimization STUDY GROUP
Stochastic Gradient Descent Methods
Xuetong Wu & Viktoria Schram
Department of EEEUniversity of Melbourne
October 22, 2020
1 / 91
Stochastic Optimization STUDY GROUP
Overview
1 Introduction
2 Gradient Descent
3 Stochastic Gradient Descent
4 Stochastic Subgradient MethodsNon-Smooth Optimization ProblemsOptimization Considering Additional InformationOptimization in Case of Non-I.i.d. Data
5 Conclusion
2 / 91
Stochastic Optimization STUDY GROUP
Introduction
Outline
1 Introduction
2 Gradient Descent
3 Stochastic Gradient Descent
4 Stochastic Subgradient Methods
5 Conclusion
3 / 91
Stochastic Optimization STUDY GROUP
Introduction
Parameter Estimation Problems
Communications
Tracking
Control theory
System identification
Machine learning
...
Kushner, 1997, Stochastic ApproximationHan-Fu Chen, 2003, Stochastic Approximation and Its Applications
4 / 91
Stochastic Optimization STUDY GROUP
Introduction
Classification Problems
Consider the typical image classification problem,
Training Examples Models
Dogs
Cats
Labels
We wish to learn a good model h to minimise the predicted error:
minh
1
n
n∑i=1
(1h(Xi)6=Yi)
5 / 91
Stochastic Optimization STUDY GROUP
Introduction
Classification Problems
Consider the typical image classification problem,
Training Examples Models
Dogs
Cats
Labels
We wish to learn a good model h to minimise the predicted error:
minh
1
n
n∑i=1
(1h(Xi)6=Yi)
6 / 91
Stochastic Optimization STUDY GROUP
Introduction
Regression ProblemsConsider the simple regression problem, we wish to learn a goodmodel to minimise the mean squared error.
𝑌𝑌 = 𝑎𝑎𝑎𝑎 + 𝑏𝑏
Training Examples Models Predicted Labels
Mathematically,
mina,b
1
n
n∑i=1
(Yi − aXi − b)2
7 / 91
Stochastic Optimization STUDY GROUP
Introduction
Optimization in Learning Problems
Many of machine learning problems can be formulated as thefollowing problem,
minw∈W
F (w) =1
n
n∑i=1
f(w,Zi)
Zi: training sample/ data pair (Xi, Yi)
w: model parameters (e.g., a, b in least square problem)
f : loss function
8 / 91
Stochastic Optimization STUDY GROUP
Gradient Descent
Outline
1 Introduction
2 Gradient Descent
3 Stochastic Gradient Descent
4 Stochastic Subgradient Methods
5 Conclusion
9 / 91
Stochastic Optimization STUDY GROUP
Gradient Descent
Gradient Descentminw∈W
F (w) =1
n
n∑i=1
f(w,Zi)
If f is convex and differentiable w.r.t. w, with the first-order Taylorapproximation with η > 0,
F (w + η∆w) ≈ F (w) + η∆wT∇wF (w)
Best ∆w that minimises the R.H.S.
∆w = −∇wF (w)
we choose initial point w0 and certain step size ηt at each time t.
(Batch) Gradient descent:
wt+1 = wt − ηt∇wF (w) = wt −ηtn
n∑i=1
∇wf(wt, Zi)
Stops at a certain point such that,
F (wt)− F (w∗) ≤ ε10 / 91
Stochastic Optimization STUDY GROUP
Gradient Descent
Gradient Descent
Figure: Visualization of Gradient Descent
11 / 91
Stochastic Optimization STUDY GROUP
Gradient Descent
Convergence Rate For GDAssume f is convex and differentiable w.r.t. w, further assume thegradient ∇wF (w) = 1
n
∑ni=1∇wf(w,Zi) is L-Lipschitz continuous
(∇2F � LI) . Then,
Theorem
Gradient descent with fixed step size η ≤ 1/L satisfies
F (wt)− F (w∗) ≤ ‖w0 − w?‖2
2ηt
Convergence rate ∼ O(1t ), iteration complexity ∼ O(1
ε ).
R. Tibshirani, Convex Optimization 10-725
12 / 91
Stochastic Optimization STUDY GROUP
Gradient Descent
Convergence Rate For GD with Strong ConvexityFurthermore, if F (w) is µ-strongly-convex (∇2F � µI).
Theorem
Gradient descent with fixed step size η ≤ 2/(µ+ L) or withbacktracking line search satisfies
F (wt)− F (w?) ≤ ctL2‖w0 − w?‖2
where 0 < c < 1.
Convergence rate ∼ O(ct), iteration needed for error ε ∼ O(log 1ε ).
R. Tibshirani, Convex Optimization 10-725
13 / 91
Stochastic Optimization STUDY GROUP
Gradient Descent
ProblemsTwo main drawbacks of gradient descent,
If n is relatively large, computing the gradient is memory andtime consuming.
If the loss function is nonconvex, the solution can be stuck ina local stationary point (e.g., a saddle point).
14 / 91
Stochastic Optimization STUDY GROUP
Stochastic Gradient Descent
Outline
1 Introduction
2 Gradient Descent
3 Stochastic Gradient Descent
4 Stochastic Subgradient Methods
5 Conclusion
15 / 91
Stochastic Optimization STUDY GROUP
Stochastic Gradient Descent
Stochastic Gradient Descent
A possible practical way is to simulate the stream by randomly pickup Zt uniformly at time t from the training examples.
Namely, the stochastic gradient descent:
wt+1 = wt − ηt∇wf(wt, Zt)
Why does this work? By uniformly picking,
EZt [∇wf(wt, Zt)] =1
n
n∑i=1
∇wf(wt, Zi)
Unbiased estimate but high variance, usually works well in largescale problems.
16 / 91
Stochastic Optimization STUDY GROUP
Stochastic Gradient Descent
Stochastic GD v.s. GD
Figure: Stochastic GD v.s. GD
17 / 91
Stochastic Optimization STUDY GROUP
Stochastic Gradient Descent
Remarks on SGD
Computational cost for n samples and p iterations.
GD ∼ O(np)
SGD ∼ O(p)
SGD does not always produce descending directions andgradient is very noisy!
The solution of SGD bounces around optimal value withconstant step size.
Convergence properties?
18 / 91
Stochastic Optimization STUDY GROUP
Stochastic Gradient Descent
Convergence Rate Analysis
minimizew F (w) :=1
n
n∑i=1
f(w,Zi)
We wish to get to achieve ε-optimality,
E[F (wt)]− F (w∗) ≤ ε
after t iterations.
19 / 91
Stochastic Optimization STUDY GROUP
Stochastic Gradient Descent
Assumptions
F (w) is µ-strongly-convex and the gradeint is L Lipschitzcontinuous, µ/L ≤ 1.
∇f(wt, Zt) is an unbiased estimate of ∇F (wt).
For all w, the variance of the gradient
EZ [‖∇f(w,Z)‖22]− ‖EZ [∇f(w,Z)]‖22 ≤ σ2
20 / 91
Stochastic Optimization STUDY GROUP
Stochastic Gradient Descent
Constant Step Size
Theorem (Convergence with Fixed Stepsizes)
Under the assumptions, if ηt = η ≤ 1L , then SGD achieves
E [F (wt)]− F (w∗) ≤ ηLσ2
2µ+ (1− ηµ)t (F (w0)− F (w∗))
Linear convergence at the beginning.
When t→∞,
E [F (wt)− F (w∗)] ≤ ηLσ2
2µ
converges to some neighborhood of w∗− variation in gradientcomputation prevents further progress.
Theorem 4.6 in Bottou, 2018, Optimization Methods for Large-Scale Machine Learning
21 / 91
Stochastic Optimization STUDY GROUP
Stochastic Gradient Descent
Diminishing Step Size
Theorem (Convergence with Diminishing Stepsizes)
Under the assumptions, if ηt = θt+1 for some θ > 1
µ , then SGDachieves
E [F (wt)− F (w∗)] ≤ 2ν
t+ 1
where
ν := max
{Lσ2
4(µ− 1), F (w0)− F (w∗)
}
Convergence rate is O(1t ), iterations needed O(1
ε ) with diminishingstepsize ηt � 1
t .
Theorem 4.7 in Bottou, 2018, Optimization Methods for Large-Scale Machine Learning
22 / 91
Stochastic Optimization STUDY GROUP
Stochastic Gradient Descent
Convergence Rate and Time Comparison
Risk Minimization Under Strong Convexity and L-smooth:
iterationcomplexity
per-iterationcost
totalcomput. cost
GD log 1ε n n log 1
ε
SGD 1ε 1 1
ε
23 / 91
Stochastic Optimization STUDY GROUP
Stochastic Gradient Descent
AdvantagesCompared to the gradient descent methods, SGD has the followingadvantages.
Less computational cost per iteration.
For larger datasets it can converge faster with large n andmoderate ε.
For non-convex cases, SGD can get rid of saddle pointssometimes due to the variance.
Rong Ge et.al,, COLT’2015, Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition
24 / 91
Stochastic Optimization STUDY GROUP
Stochastic Gradient Descent
Variance Reduction and AccelerationTo reduce the variance of the estimate of the gradient, we can usemini-batch SGD for k � n:
wt+1 = wt − ηtk∑i=1
∇f(wt, Zi)
SGD for logistic regression problems with n = 10000:
R. Tibshirani, Convex Optimization 10-725
25 / 91
Stochastic Optimization STUDY GROUP
Stochastic Gradient Descent
Variance Reduction and AccelerationCan we have 1 gradient per iteration and only O(log(1/ε))iterations?Yes! Stochastic Average Gradient (SAG):
wt = wt−1 −ηtn
n∑i=1
∇f ti with ∇f ti =
{∇f (wt−1, Zi) if i = i(t)∇f t−1i otherwise
SAG gradient estimates are no longer unbiased, but they havegreatly reduced variance. With the fixed stepsize ηt = 1
16L ,
E [F (wt)]− F (w∗) 6 O
((1−min
{µ
16L,
1
8n
})t)
Iteration complexity ∼ O(log 1ε )!
Other variants with similar convergence: SDCA, SVRG, SAGA
Roux et.al, NIPS’12, A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training SetsShai & Zhang, JMLR’13, Stochastic dual coordinate ascent methods for regularized loss minimization.Johnson & Zhang, NIPS’13, Accelerating stochastic gradient descent using predictive variance reductionDefazio & Bach, NIPS’14, SAGA
26 / 91
Stochastic Optimization STUDY GROUP
Stochastic Gradient Descent
More on SGD
If n is not very large,
1
n
n∑i=1
(f (wt, Zi)− f(w∗, Zi)) ≤ ε?
Sometimes simply minimising the loss will cause overfitting.
27 / 91
Stochastic Optimization STUDY GROUP
Stochastic Gradient Descent
Caveats: Data Overfitting for y = sin 2πx+ n
(a) t = 2 (b) t = 7
(c) t = 100 (d) t = 5000
Often, Small Training Error 6⇒ Small Testing Error !
28 / 91
Stochastic Optimization STUDY GROUP
Stochastic Gradient Descent
Early StoppingActually, if samples are regarded as some random variable drawnfrom some distributions P , we may consider minimising the truerisk,
Ftrue(wt) = EZ∼P [f(wt, Z)]
Figure: Early Stopping with SGD
29 / 91
Stochastic Optimization STUDY GROUP
Stochastic Gradient Descent
SGD Applications in supervised learningSGD Methods are widely applied in machine learning problems.
For different differentiable loss functions,
Adaline:∑
i12
(yi − w>Φ(xi)
)2Tikhonov:
∑i
(yi − wTxi
)2+ λ‖w‖22
Logistic Regression:∑i yi log
(1
1+e−wT xi
)+ (1− yi) log
(1− 1
1+e−wT xi
)What if the loss is not differentiable?
SVM: 12‖w‖
22 + C
∑i max
(0, 1− yi
(w>xi + b
))Lasso:
∑i
(yi − wTxi
)2+ λ‖w‖1
Perceptron:∑
i max{
0,−yiw>Φ(xi)}
Neural Network:∑
i f(w,Zi)
30 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Outline
1 Introduction
2 Gradient Descent
3 Stochastic Gradient Descent
4 Stochastic Subgradient MethodsNon-Smooth Optimization ProblemsOptimization Considering Additional InformationOptimization in Case of Non-I.i.d. Data
5 Conclusion
31 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Short note: to be consistent with the notations used in the firstpart of this presentation, the following slides were adapted and not
the same nomenclature is used as was used in the recordedpresentation.
Sorry for any inconvenience this might cause.
32 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Problem - Optimal Szenario
Finding zeros (roots) of a real-valued smooth function∇f(w) : Rd → Rd, for w ∈ R.
⇒Newton’s method
wt+1 = wt − [∇f ′w(wt)]−1∇f(wt)
n: tth iterationw: Parameters of function f(.)∇f(w): Derivative of function f(.)f ′w(.): Derivative of ∇f(.) wrt. w
Kushner, 1997, Stochastic ApproximationHan-Fu Chen, 2003, Stochastic Approximation and Its Applications
33 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Problem - Reality
Finding zeros (roots) of an unknown, real-valued function∇f(w) : Rd → Rd, for w ∈ R, which can be observed but theobservation may be corrupted by (i.i.d.) errors (εt)t≥1
⇒Stochastic Approximation (Robbins & Monro ’51):
wt+1 = wt − ηt[∇f(wt) + εt]
ηt: Stepsize, learning rate for iteration tεt: Zero mean noise for iteration t∇f(w): Derivative of function f(.)
Kushner, 1997, Stochastic ApproximationHan-Fu Chen, 2003, Stochastic Approximation and Its Applications
34 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Gradient Descent
wt+1 = wt − ηt1
n
n∑i=1
[∇f(wt, Zi)]
Stochastic Gradient Descent
wt+1 = wt − ηt[∇f(wt, Zit)]
f(wt, Zit ): Loss function for tth parameters w and tth sample Zitn: Set (/batch) sizeηt: Learning rate at iteration tt: Iteration t = 1, 2, ...it ∈ {1, ..., n}: Uniformly at random chosen index at iteration t
R. Tibshirani, Convex Optimization 10-725
35 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Stochastic Gradient Descent
wt+1 = wt − ηt[∇f(wt, Zit)]
Random selection of f(.):
E[∇f(wt, Zit)] = ∇F (w)
f(wt, Zit ): Loss function for tth parameters w and tth sample Zitn: Set (/batch) sizeηt: Learning rate at iteration tt: Iteration t = 1, 2, ...it ∈ {1, ..., n}: Uniformly at random chosen index at iteration t
R. Tibshirani, Convex Optimization 10-725
36 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Question
Optimization function:
Non-smooth
Additional information
Non-i.i.d input samples
Goal:
minw
1
n
n∑i=1
f(w,Zi)
f(w,Zi): Loss function with parameters w for ith sample Zi
H. Li et al., 2018, Visualizing the Loss Landscape of Neural Nets.
37 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Question
Optimization function:
Non-smooth
Additional information
Non-i.i.d input samples
Goal:
minw
1
n
n∑i=1
f(w,Zi)
f(w,Zi): Loss function with parameters w for ith sample Zi
H. Li et al., 2018, Visualizing the Loss Landscape of Neural Nets.
38 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Question
Optimization function:
Non-smooth
Additional information
Non-i.i.d input samples
Goal:
minw
1
n
n∑i=1
f(w,Zi)
f(w,Zi): Loss function with parameters w for ith sample Zi
H. Li et al., 2018, Visualizing the Loss Landscape of Neural Nets.
39 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Goal
Non-differentiable (non-smooth)
⇒ ?
Additional Information
⇒ ?
Non-i.i.d. data
⇒ ?
40 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Non-Smooth Optimization Problems
Outline
1 Introduction
2 Gradient Descent
3 Stochastic Gradient Descent
4 Stochastic Subgradient MethodsNon-Smooth Optimization ProblemsOptimization Considering Additional InformationOptimization in Case of Non-I.i.d. Data
5 Conclusion
41 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Non-Smooth Optimization Problems
Subgradient of a Function
g is a subgradient of f at x2 if
f(x) ≥ f(x2) + gT (x− x2) ∀ x
*∂f(x): Subdifferential, set of all subgradients, i.e. gi ∈ ∂f(x)(**x =̂ w: Slight abuse in notation compared to before)
S. Boyd, Stanford University, EE364b
42 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Non-Smooth Optimization Problems
Subgradient of a Function
Goal: minx
f(x); Use: Gradient descent method
D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home 43 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Non-Smooth Optimization Problems
Subgradient of a Function
Goal: minx
f(x); Use: Gradient descent method
D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home 44 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Non-Smooth Optimization Problems
Subgradient of a Function
Goal: minx
f(x); Use: Gradient descent method
D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home
45 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Non-Smooth Optimization Problems
Subgradient of a Function
Goal: minx
f(x); Use: Gradient descent method
D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home
46 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Non-Smooth Optimization Problems
Subgradient of a Function
Goal: minx
f(x); Use: Gradient descent method
D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home
47 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Non-Smooth Optimization Problems
Subgradient of a Function
Goal: minx
f(x); Use: Gradient descent method
⇒Negative subgradient doesn’t necessarily give a descent direction
D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home 48 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Non-Smooth Optimization Problems
Subgradient of a Function
Goal: minx
f(x); Use: Gradient descent method
⇒Negative subgradient doesn’t necessarily give a descent direction
D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home 49 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Non-Smooth Optimization Problems
Subgradient of a Function
Goal: minx
f(x); Use: Gradient descent method
⇒Negative subgradient doesn’t necessarily give a descent direction
D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home50 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Non-Smooth Optimization Problems
Subgradient of a Function
Goal: minx
f(x); Use: Gradient descent method
⇒Negative subgradient doesn’t necessarily give a descent direction
D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home51 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Non-Smooth Optimization Problems
Subgradient of a Function
Goal: minx
f(x); Use: Gradient descent method
⇒Negative subgradient doesn’t necessarily give a descent direction
D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home52 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Non-Smooth Optimization Problems
Subgradient of a Function
Goal: minx
f(x); Use: Gradient descent method
⇒Negative subgradient doesn’t necessarily give a descent directionD. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home
53 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Non-Smooth Optimization Problems
Subgradient Method
wt+1 = wt − ηtgt
⇒ keep track of best iterate wbestt+1 among w1, ..., wt+1, i.e.,
f(wbestt+1 ) = minj=1,...t+1
f(wj)
wt: tth parameter estimateηt: Learning rategt: Subgradientt: Iteration t = 1, 2, ...(*w =̂ x: Switch back to notation used in the beginning)
S. Boyd, Stanford University, EE364bR. Tibshirani, Convex Optimization 10-725/36-725
54 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Non-Smooth Optimization Problems
Stochastic Subgradient Method
Noisy subgradients: g̃ = g + v, where g ∈ ∂f(w), E[v] = 0
wt+1 = wt − ηtg̃t
⇒ Random choice of (sample) index i at iteration t(out of a set (/batch) of samples with size n)
wt+1 = wt − ηtgit
S. Boyd, Stanford University, EE364bR. Tibshirani, Convex Optimization 10-725/36-725J. Zhu, University of Melbourne, 2020, Discussion after IT lecture
55 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Non-Smooth Optimization Problems
Stochastic Subgradient Method
wt+1 = wt − ηtgit
⇒ keep track of best iterate wbestt+1 among w1, ..., wt+1, i.e.,
f(wbestt+1 ) = minj=1,...t+1
f(wj)
wt: tth parameter estimateηt: Learning rategit : Subgradient for uniformly at random chosen sample (Zi)t: Iteration t = 1, 2, ...(*w =̂ x: Switch back to notation used in the beginning)
S. Boyd, Stanford University, EE364bR. Tibshirani, Convex Optimization 10-725/36-725
56 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Non-Smooth Optimization Problems
Convergence Results
Fixed step size η
SGM:
limt→∞
F (w(best)t ) ≤ F ∗ +
L2η
2
Stochastic SGM:
limt→∞
F (w(best)t ) ≤ F ∗ +
5n2L2η
2
S. Boyd et al., 2003, Subgradient MethodsR. Tibshirani, Convex Optimization 10-725/36-725 Subgradient Method*Convergence rates for f(w) Lipschitz continuous and constant L > 0
57 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Non-Smooth Optimization Problems
Convergence Results
Diminishing stepsize (Robins-Monro condition)
(Stochastic) SGM:
limt→∞
F (w(best)t ) ≤ F ∗
ηt > 0, limt→∞
ηt = 0,
∞∑t=1
ηt =∞
e.g.:
ηt > 0,
∞∑t=1
η2t <∞,
∞∑t=1
ηt =∞
*More about how to choose η: S. Boyd et al., 2003, Subgradient MethodsR. Tibshirani, Convex Optimization 10-725/36-725 Subgradient Method**Convergence rates for f(w) Lipschitz continuous and constant L > 0
58 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Non-Smooth Optimization Problems
Applications
Algorithms for non-differentiable convex optimization
Convex analysis
ML/DL
⇒ Methods based on stochastic subgradients are used to(approximately) optimize (nonconvex, nonsmooth) deep neuralnetworks (DNNs)
⇒ E.g.: Adagrad, ADAM, NADAM, RMSProp, ...
Udell, Operations Research and Information Engineering Cornell, 2017, PresentationJ. Duchi et al., 2011, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
59 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
Outline
1 Introduction
2 Gradient Descent
3 Stochastic Gradient Descent
4 Stochastic Subgradient MethodsNon-Smooth Optimization ProblemsOptimization Considering Additional InformationOptimization in Case of Non-I.i.d. Data
5 Conclusion
60 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
Adagrad
wt+1 = wt − ηt(Gt)−12 git
⇒ Incorporates knowledge of the geometry of past iterations
wt: tth parameter estimateηt: Learning rategit : Subgradient for uniformly at random chosen sample (Zi)t: Iteration t = 1, 2, ...
Gt: Outer product matrix of past gradients up to time step t:∑tτ=1 gτg
Tτ
Udell, Cornell, 2017, Operations Research and Information EngineeringJ. Duchi et al., 2011, Adaptive Subgradient Methods for Online Learning and Stochastic OptimizationAdaGrad - Adaptive Subgradient Methods, https://ppasupat.github.io/a9online/1107.html 61 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
Adagrad
wt+1,j = wt,j − ηt,j(t∑
τ=1
g2τ,j)− 1
2 git,j
Example: minw
f(w) = 100w21 + w2
2
j: jth feature/parameter wηt: Learning rategit : Subgradient for uniformly at random chosen sample (Zi)t: Iteration t = 1, 2, ...∑tτ=1 g
2τ,j : Sum of past gradients up to time step t
Udell, Cornell, 2017, Operations Research and Information EngineeringJ. Duchi et al., 2011, Adaptive Subgradient Methods for Online Learning and Stochastic OptimizationAdaGrad - Adaptive Subgradient Methods, https://ppasupat.github.io/a9online/1107.html
62 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
Adagrad
⇒Variable metric projected subgradient method
S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric Methods
63 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
Adagrad
⇒Variable metric projected subgradient method
S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric Methods
64 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
Variable metric projected subgradient method
Projection carried out in Ht = 1ηt
(Gt)12 metric
wt+1 = wt − ηt(Gt)−12 git
wt+1 = PHtW (wt −H−1t git) = PHtW (y)
where
PHtW (y) = argminw∈W
||w − y||2Ht
PHtW (y): Projection of a vector y onto W according to Ht metric
||w − y||Ht =√
(w − y)THt−1(w − y): Mahalanobis norm, weighted l2-distance
S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsY. Chen, Princton University, 2019, ELE 522: Large-Scale Optimization for Data Science
65 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
Variable metric projected subgradient method
Projection carried out in Ht = 1ηt
(Gt)12 metric
wt+1 = wt −H−1t git
wt+1 = PHtW (wt −H−1t git) = PHtW (y)
where
PHtW (y) = argminw∈W
||w − y||2Ht
PHtW (y): Projection of a vector y onto W according to Ht metric
||w − y||Ht =√
(w − y)THt−1(w − y): Mahalanobis norm, weighted l2-distance
S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsY. Chen, Princton University, 2019, ELE 522: Large-Scale Optimization for Data Science
66 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
Variable metric projected subgradient method
Projection carried out in Ht = 1ηt
(Gt)12 metric
wt+1 = wt −H−1t git
wt+1 = PHtW (wt −H−1t git) = PHtW (y)
where
PHtW (y) = argminw∈W
||w − y||2Ht
PHtW (y): Projection of a vector y onto W according to Ht metric
||w − y||Ht =√
(w − y)THt−1(w − y): Mahalanobis norm, weighted l2-distance
S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsY. Chen, Princton University, 2019, ELE 522: Large-Scale Optimization for Data Science
67 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
Variable metric projected subgradient method
Projection carried out in Ht = 1ηt
(Gt)12 metric
wt+1 = wt −H−1t git
wt+1 = PHtW (wt −H−1t git) = PHtW (y)
where
PHtW (y) = argminw∈W
||w − y||2Ht
PHtW (y): Projection of a vector y onto W according to Ht metric
||w − y||Ht =√
(w − y)THt−1(w − y): Mahalanobis norm, weighted l2-distance
S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsY. Chen, Princton University, 2019, ELE 522: Large-Scale Optimization for Data Science
68 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
But now, what if...
Parameter of interest lies on a non-Euclidean manifold
Probability vectors
G. Raskutti, The information geometry of mirror descentY. Chen, Princton University, 2019, ELE 522: Large-Scale Optimization for Data Science
69 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
Convergence Analysis
Basic inequality: (Projected) subgradient method
F (w(best)t )− F ∗ ≤
R2 + L2∑t
i=1 η2i
2∑t
i=1 ηi
ηi = (R/L)/√t
F (w(best)t )− F ∗ ≤ RL√
t
for: L = maxw∈W||git ||2 and R = max
w,w∗∈W||w − w∗||2
⇒Analysis and convergence results depend on l2 normS. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric Methods
70 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
Subgradient Method Update Rule(s)
minw∈W
F (w)
wt+1 = wt − ηtgit
wt+1 = PW(wt − ηtgit)wt+1 = argmin
w∈W||w − (wt − ηtgit)||22
Using some math:
wt+1 = argminw∈W
{gTitw +1
2ηt||w − wt||22}
S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsJ. Duchi et al., 2003, Proximal and First-Order Methods for Convex Optimization
71 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
Subgradient Method Update Rule(s)
minw∈W
F (w)
wt+1 = wt − ηtgitwt+1 = PW(wt − ηtgit)
wt+1 = argminw∈W
||w − (wt − ηtgit)||22
Using some math:
wt+1 = argminw∈W
{gTitw +1
2ηt||w − wt||22}
S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsJ. Duchi et al., 2003, Proximal and First-Order Methods for Convex Optimization
72 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
Subgradient Method Update Rule(s)
minw∈W
F (w)
wt+1 = wt − ηtgitwt+1 = PW(wt − ηtgit)wt+1 = argmin
w∈W||w − (wt − ηtgit)||22
Using some math:
wt+1 = argminw∈W
{gTitw +1
2ηt||w − wt||22}
S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsJ. Duchi et al., 2003, Proximal and First-Order Methods for Convex Optimization
73 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
Subgradient Method Update Rule(s)
minw∈W
F (w)
wt+1 = wt − ηtgitwt+1 = PW(wt − ηtgit)wt+1 = argmin
w∈W||w − (wt − ηtgit)||22
Using some math:
wt+1 = argminw∈W
{gTitw +1
2ηt||w − wt||22}
S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsJ. Duchi et al., 2003, Proximal and First-Order Methods for Convex Optimization
74 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
Stochastic Mirror Descent
wt+1 = argminw{gTtiw +
1
2ηtBφ(w,wt)}
Bregman divergence:
Bφ(w,wt) = φ(w)− φ(wt)−∇φ(wt)T (w − wt)
∇φ(.): Mirror map, invertible mapφ(.): Potential function, strictly convex, differentiable
S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsS. Bubeck, 2015, Convex Optimization: Algorithms and ComplexityN. Azizan et al., Stochastic Interpretation of SMD: Risk-Sensitive Optimality
75 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
Generalization/Examples
Gradient descent for φ(w) = 12 ||w||
22
Bφ = 12 ||w − wt||
22
Mirror descent = Projected subgradient method
Negative entropy for φ(w) =∑n
i=1wi logwi(For w ∈ unit simplex)
p-norm Algorithm φ(w) = 12 ||w||
2p
Exponential Gradient descent
Sparse Mirror Descent
...
S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsN. Azizan al., 2019, SMD on Overparameterized Nonlinear Models: Conv., Implicit Regul., and General.
76 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
wt+1 = argminw{gTtiw +
1
2ηtBφ(w,wt)}
Using some math:
wt+1 = argminw∈W∩D
Bφ(w, yt+1)
wt+1 = P φW(yt+1)
wt+1 = P φW(∇φ(wt+1)−1)
Alternative update rule stochastic mirror descent:
∇φ(wt+1) = ∇φ(wt)− ηtgTti
S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsS. Bubeck, 2015, Convex Optimization: Algorithms and Complexity
77 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
wt+1 = argminw{gTtiw +
1
2ηtBφ(w,wt)}
Using some math:
wt+1 = argminw∈W∩D
Bφ(w, yt+1)
wt+1 = P φW(yt+1)
wt+1 = P φW(∇φ(wt+1)−1)
Alternative update rule stochastic mirror descent:
∇φ(wt+1) = ∇φ(wt)− ηtgTti
S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsS. Bubeck, 2015, Convex Optimization: Algorithms and Complexity
78 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
wt+1 = argminw{gTtiw +
1
2ηtBφ(w,wt)}
Using some math:
wt+1 = argminw∈W∩D
Bφ(w, yt+1)
wt+1 = P φW(yt+1)
wt+1 = P φW(∇φ(wt+1)−1)
Alternative update rule stochastic mirror descent:
∇φ(wt+1) = ∇φ(wt)− ηtgTti
S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsS. Bubeck, 2015, Convex Optimization: Algorithms and Complexity
79 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
wt+1 = argminw{gTtiw +
1
2ηtBφ(w,wt)}
Using some math:
wt+1 = argminw∈W∩D
Bφ(w, yt+1)
wt+1 = P φW(yt+1)
wt+1 = P φW(∇φ(wt+1)−1)
Alternative update rule stochastic mirror descent:
∇φ(wt+1) = ∇φ(wt)− ηtgTti
S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsS. Bubeck, 2015, Convex Optimization: Algorithms and Complexity
80 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
wt+1 = argminw{gTtiw +
1
2ηtBφ(w,wt)}
Using some math:
wt+1 = argminw∈W∩D
Bφ(w, yt+1)
wt+1 = P φW(yt+1)
wt+1 = P φW(∇φ(wt+1)−1)
Alternative update rule stochastic mirror descent:
∇φ(wt+1) = ∇φ(wt)− ηtgTtiS. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsS. Bubeck, 2015, Convex Optimization: Algorithms and Complexity
81 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
Stochastic Mirror Descent
∇φ(wt+1) = ∇φ(wt)− ηtgTti
∇φ(.): Mirror map, invertible mapφ(.): Potential function, strictly convex, differentiable
S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsM.S. Alkousa al., 2019, On some SMD methods for constrained online optimization problems.Z. Zhou et al., 2020, On the convergence of MD beyond stochastic convex programming 82 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization Considering Additional Information
Applications
Non-smooth and/or non-convex stochastic optimizationproblems
Highly overparameterized nonlinear learning problems
Large scale optimization problems
Online learning
Reinforcement learning
Z. Zhou et al., 2020, On the convergence of MD beyond stochastic convex programmingN. Aziza et al., 2019, SMD on Overparametrized Nonlinear Models: Conv., Implicit Regul., and General.M. Raginsky et al., Sparse Q-learning with Mirror DescentS. Mahadevan et al., Continuous-Time SMD on a Network: Variance Reduction, Consensus, Convergence
83 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization in Case of Non-I.i.d. Data
Outline
1 Introduction
2 Gradient Descent
3 Stochastic Gradient Descent
4 Stochastic Subgradient MethodsNon-Smooth Optimization ProblemsOptimization Considering Additional InformationOptimization in Case of Non-I.i.d. Data
5 Conclusion
84 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization in Case of Non-I.i.d. Data
Ergodic Mirror Descent
Update rule:
wt+1 = argminw{gTtiw +
1
2ηtBφt(w,wt)}
⇒ Based on stochastic mirror descent
J. Duchi et al., 2012, Ergodic Mirror Descent
85 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization in Case of Non-I.i.d. Data
Ergodic Mirror Descent
Update rule:
wt+1 = argminw{gTtiw +
1
2ηtBφt(w,wt)}
⇒ Based on stochastic mirror descent
J. Duchi et al., 2012, Ergodic Mirror Descent
86 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization in Case of Non-I.i.d. Data
Ergodic Mirror Descent
F (w) := EΠ[f(w;Zi)]w∈W
Stochastic process Pi
Stationary distribution Π such that Pi → Π
Training samples (Z1, ..., Zn) ∼ PLoss function for w on sample Zi is f(w,Zi)
J. Duchi et al., 2012, Ergodic Mirror Descent
87 / 91
Stochastic Optimization STUDY GROUP
Stochastic Subgradient Methods
Optimization in Case of Non-I.i.d. Data
Ergodic Mirror Descent
Convergence in expectation and with high probabilityshown for:
Distributed convex optimization
(Potentially nonlinear) ARMA processes
Learning ranking facts
Pseudo-random sanity
J. Duchi et al., 2012, Ergodic Mirror DescentMicrosoft Research, 2016, Learning and stochastic optimization with non-iid data,https://www.youtube.com/watch?v=_yRnHRQVMgw
88 / 91
Stochastic Optimization STUDY GROUP
Conclusion
Outline
1 Introduction
2 Gradient Descent
3 Stochastic Gradient Descent
4 Stochastic Subgradient Methods
5 Conclusion
89 / 91
Stochastic Optimization STUDY GROUP
Conclusion
Usually
⇒ Stochastic gradient descent
Non-differentiable (non-smooth)
⇒ Stochastic subgradient methods
Additional Information
⇒ Stochastic mirror descent
Non-i.i.d. input data
⇒ Ergodic mirror descent
90 / 91
Stochastic Optimization STUDY GROUP
Conclusion
Thank you
91 / 91