Page 1
Training Neural NetworksOptimization (update rule)
M. Soleymani
Sharif University of Technology
Fall 2017
Some slides have been adapted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017,
some from Andrew Ng lectures, “Deep Learning Specialization”, coursera, 2017,
and few from Hinton lectures, “Neural Networks for Machine Learning”, coursera, 2015.
Page 2
Outline
• Mini-batch gradient descent
• Gradient descent with momentum
• RMSprop
• Adam
• Learning rate and learning rate decay
• Batch normalization
Page 3
Reminder: The error surface
• For a linear neuron with a squared error,it is a quadratic bowl.
– Vertical cross-sections are parabolas.
– Horizontal cross-sections are ellipses.
• For multi-layer nets the error surface ismuch more complicated.
– But locally, a piece of a quadratic bowl isusually a very good approximation.
Page 4
Mini-batch gradient descent
• Large datasets– Divide dataset into smaller batches containing one subset of the main training
set
– Weights are updated after seeing training data in each of these batches
• Vectorization provides efficiency
Page 5
Gradient descent methods
Batch 1
𝑋 1 , 𝑌{1}
Batch 2
𝑋 2 , 𝑌{2}
Batch m
𝑋 𝑚 , 𝑌{𝑚}
Batch size=1 Batch size=n(the size of training set)
Batch gradient Stochastic gradient Stochastic mini-batch gradient
e.g., Batch size= 32, 64, 128, 256
n: whole no of training databs: the size of batches
𝑚 =𝑛
𝑏𝑠: the number of batches
Page 6
Mini-batch gradient descent
For epoch=1,…,k
1. For t=1,…,m
1. Forward propagation on 𝑋{𝑡}
2. 𝐽{𝑡} =1
𝑚 𝑛∈𝐵𝑎𝑡𝑐ℎ𝑡
𝐿 𝑌𝑛𝑡, 𝑌𝑛
𝑡+ 𝜆𝑅(𝑊)
3. Backpropagation on 𝐽{𝑡} to compute gradients 𝑑𝑊
4. For 𝑙 = 1,… , 𝐿
1. 𝑊[𝑙] = 𝑊[𝑙] − 𝛼𝑑𝑊[𝑙]
𝐴[0] = 𝑋{𝑡}
1. For 𝑙 = 1,… , 𝐿
1. 𝑍 𝑙 = 𝑊[𝑙]𝐴 𝑙−1
2. 𝐴[𝑙] = 𝑓 𝑙 𝑍 𝑙
𝑌𝑛𝑡
= 𝐴𝑛𝐿
1 epoch:Single pass over all trainingsamples
Batch 1
𝑋 1 , 𝑌{1}
Batch 2
𝑋 2 , 𝑌{2}
Batch m
𝑋 𝑚 , 𝑌{𝑚}
Vectorized computaion
Page 7
Gradient descent methods
Batch size=1 Batch size=n(the size of training set)
Batch gradient descent Stochastic gradient descent Stochastic mini-batch gradient
e.g., Batch size= 32, 64, 128, 256
• Does not use vectorizedform and thus not computationally efficient
• Need to process whole training set for weight update
• Vectorization• Fastest learning
(for proper batch size)
Page 8
Batch size
• Full batch (batch size = N)
• SGB (batch size = 1)
• SGD (batch size = 128)
[Andrew Ng, Deep Learning Specialization, 2017]
© 2017 Coursera Inc.
Page 9
Choosing mini-batch size
• For small training sets (e.g., n<2000) use full-batch gradient descent
• Typical mini-batch sizes for larger training sets:– 64, 128, 256, 512
• Make sure one batch of training data and the corresponding forward,backward required to be cached can fit in CPU memory
Page 10
Mini-batch gradient descent: loss-#epoch curve
Page 11
Mini-batch gradient descent vs. full-batch
[Andrew Ng, Deep Learning Specialization, 2017]
© 2017 Coursera Inc.
Page 12
Problems with gradient descent
• Poor conditioning
• Saddle points and local minima
• Noisy gradients in the SGD
Page 13
Problems with gradient descent
• What if loss changes quickly in one direction and slowly in another?What does gradient descent do?
Page 14
Problems with gradient descent
• What if loss changes quickly in one direction and slowly in another?What does gradient descent do?
• Very slow progress along shallow dimension, jitter along steepdirection
Poor conditioning
Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large
Page 15
Convergence speed of gradient descent
• Learning rate– If the learning rate is big, the weights slosh to and fro across the ravine.
– If the learning rate is too big, this oscillation diverges.
– What we would like to achieve:• Move quickly in directions with small but consistent gradients.
• Move slowly in directions with big but inconsistent gradients.
• How to train your neural networks much more quickly?
Page 16
Optimization: Problems with SGD
• What if the loss function has a local minima or saddle point?
Page 17
Optimization: Problems with SGD
• What if the loss function has a local minima or saddle point?
Zero gradient, gradient descent gets stuck
Saddle points much more common in high dimension
Page 18
The problem of local optima
For high dimensional parameter space, most points of
zero gradients are not local optima and are saddle points.
What about very high-dimensional spaces?
[Andrew Ng, Deep Learning Specialization, 2017]
© 2017 Coursera Inc.
Page 19
The problem of local optima
[Andrew Ng, Deep Learning Specialization, 2017]
© 2017 Coursera Inc.
Page 20
The problem of local optima
• A function of very high dimensional space:– if the gradient is zero, then in each direction it can either be a convex light
function or a concave light function.
– for it to be a local optima, all directions need to look like this.• And so the chance of that happening is maybe very small
Page 21
Problem of plateaus
• The problem is near the saddle points
• Plateau is a region where the derivative is close to zero for a long time– Very very slow progress
– Finally, algorithm can then find its way off the plateau.
[Andrew Ng, Deep Learning Specialization, 2017]
© 2017 Coursera Inc.
Page 22
Problems with stochastic gradient descent
• Noisy gradients by SGD
• Might need long time
Page 23
SGD + Momentum
• 𝑣𝑑𝑊 = 0
• On each iteration:– Compute 𝛻𝑊𝐽 on the current mini-batch– 𝑣𝑑𝑊 = 𝛽𝑣𝑑𝑊 + 1 − 𝛽 𝛻𝑊𝐽
– 𝑊 = 𝑊 − 𝛼𝑣𝑑𝑊
• We will show that 𝑣𝑑𝑊 is moving average of recent gradientswith exponentially decaying weights
Build up “velocity” as a running mean of gradientse.g., 𝛽=0.9 or 0.99
Page 24
Momentum helps to reduce problems with SGD
Page 25
Exponentially weighted averages
𝑣0 = 0𝑣1 = 0.9𝑣0 + 0.1𝜃1
…𝑣𝑡 = 0.9𝑣𝑡−1 + 0.1𝜃𝑡
[Andrew Ng, Deep Learning Specialization, 2017]
© 2017 Coursera Inc.
Page 26
Exponentially weighted averages
• 𝑣0 = 0
• 𝑣𝑡 = 𝛽𝑣𝑡−1 + 1 − 𝛽 𝜃𝑡 𝛽 = 𝟎. 𝟗
𝛽 = 𝟎. 𝟗𝟖
[Andrew Ng, Deep Learning Specialization, 2017]
© 2017 Coursera Inc.
Page 27
Exponentially weighted averages
• 𝑣0 = 0
• 𝑣𝑡 = 𝛽𝑣𝑡−1 + 1 − 𝛽 𝜃𝑡
𝑣𝑡 = 1 − 𝛽 𝜃𝑡 + 𝛽𝜃𝑡−1 + 𝛽2𝜃𝑡−2 + ⋯+ 𝛽𝑡−1𝜃1
1
-1
1
𝑒
1 − 𝜖 1/𝜖 =1
𝑒
0.910 =1
𝑒
0.9850 =1
𝑒
[Andrew Ng, Deep Learning Specialization, 2017]
© 2017 Coursera Inc.
Page 28
Bias correction
• Computation of these averages more accurately:
𝑣𝑡𝑐 =
𝑣𝑡
1 − 𝛽𝑡
• Example:𝑣0 = 0
𝑣1 = 0.98𝑣0 + 0.02𝜃1 = 0.02𝜃1
𝑣2 = 0.98𝑣1 + 0.02𝜃2 = 0.0196𝜃1 + 0.02𝜃2
𝑣2𝑐 =
0.0196𝜃1 + 0.02𝜃2
1 − 0.982=
0.0196𝜃1 + 0.02𝜃2
0.0396
[Andrew Ng, Deep Learning Specialization, 2017]
© 2017 Coursera Inc.
Page 29
Bias correction
𝑣𝑡𝑐 =
𝑣𝑡
1 − 𝛽𝑡
• when you're still warming up your estimates, the bias correction canhelp
• when t is large enough, the bias correction makes almost nodifference
• people don't often bother to implement bias corrections.– rather just wait that initial period and have a slightly more biased estimate
and go from there.
[Andrew Ng, Deep Learning Specialization, 2017]
© 2017 Coursera Inc.
Page 30
Gradient descent with momentum
• 𝑣𝑑𝑊 = 0
• On each iteration:– Compute dW on the current mini-batch
– 𝑣𝑑𝑊 = 𝛽𝑣𝑑𝑊 + 1 − 𝛽 𝑑𝑊
– 𝑊 = 𝑊 − 𝛼𝑣𝑑𝑊
• Hyper-parameters 𝛽 and 𝛼 (𝛽 = 0.9)
[Andrew Ng, Deep Learning Specialization, 2017]
© 2017 Coursera Inc.
Page 31
Gradient descent with momentum
• 𝑣𝑑𝑊 = 0
• On each iteration:– Compute dW on the current mini-batch
– 𝑣𝑑𝑊 = 𝛽𝑣𝑑𝑊 + 1 − 𝛽 𝑑𝑊
– 𝑊 = 𝑊 − 𝛼𝑣𝑑𝑊
• Hyper-parameters 𝛽 and 𝛼 (𝛽 = 0.9)
learning rate would need to be tuned
differently for these two different versions
𝑣𝑑𝑊 = 𝛽𝑣𝑑𝑊 + 𝑑𝑊
[Andrew Ng, Deep Learning Specialization, 2017]
© 2017 Coursera Inc.
Page 32
Gradient descent with momentum
• 𝑣𝑑𝑊 = 0
• On each iteration:– Compute dW on the current mini-batch
– 𝑣𝑑𝑊 = 𝛽𝑣𝑑𝑊 + 𝑑𝑊
– 𝑊 = 𝑊 − 𝛼𝑣𝑑𝑊
Page 33
Nesterov Momentum
• Nesterov Momentum is a slightly different version of the momentumupdate
– It enjoys stronger theoretical converge guarantees for convex functions and inpractice it also consistently works slightly better than standard momentum
Page 34
Nesterov Momentum
Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983 Nesterov, “Introductory lectures on convex optimization: a basic course”, 2004 Sutskever et al, “On the importance of initialization and momentum in deel learning”, ICML 2013
Page 35
Nesterov Momentum
Simple update rule
Momentum
Nesterov
Page 36
Nesterov Accelerate Gradient (NAG)
• Nestrov accelerated gradient– a little is annoying (requires gradient at 𝑥𝑡 + 𝜌𝑣𝑡 instead of at 𝑥𝑡)
• You like to evaluate gradient and loss in the same point
– With change of variables, it is resolved:
Page 37
Nesterov Momentum
Page 38
RMSprop (Root Mean Square prop)
• How to adjust the scale of each dimension of 𝑑𝑊?
[Andrew Ng, Deep Learning Specialization, 2017]
© 2017 Coursera Inc.
Page 39
RMSprop
• 𝑆𝑑𝑊 = 0
• On iteration t:– Compute dW on the current mini-batch
– 𝑆𝑑𝑊 = 𝛽𝑆𝑑𝑊 + (1 − 𝛽)𝑑𝑊2
– 𝑊 = 𝑊 − 𝛼𝑑𝑊
𝑆𝑑𝑊
𝑆𝑑𝑊: an exponentially weighted average of the squares of the derivatives (second order moment)
Element-wise: 𝑑𝑊𝑖2 = 𝑑𝑊𝑖 ∗ 𝑑𝑊𝑖
Adagrad𝑆𝑑𝑊 = 𝑆𝑑𝑊 + 𝑑𝑊2
(𝑆𝑑𝑊: Sum of the squares of derivatives)
Step size will be very small during this algorithm
Page 40
RMSprop
• Updates in the vertical direction is divided by a much larger numberwhile in the horizontal direction by a smaller number.
• Thus, you can therefore use a larger learning rate alpha
Page 41
RMSprop
• 𝑆𝑑𝑊 = 0
• On iteration t:– Compute dW on the current mini-batch
– 𝑆𝑑𝑊 = 𝛽𝑆𝑑𝑊 + (1 − 𝛽)𝑑𝑊2
– 𝑊 = 𝑊 − 𝛼𝑑𝑊
𝑆𝑑𝑊+𝜖𝜖 = 10−8
Page 43
Adam (Adaptive Moment Estimation)
• Takes RMSprop and momentum idea together
• 𝑣𝑑𝑊 = 0, 𝑆𝑑𝑊 = 0
• On each iteration:– Compute dW on the current mini-batch
– 𝑉𝑑𝑊 = 𝛽1𝑉𝑑𝑊 + 1 − 𝛽1 𝑑𝑊
– 𝑆𝑑𝑊 = 𝛽2𝑆𝑑𝑊 + 1 − 𝛽2 𝑑𝑊2
– 𝑊 = 𝑊 − 𝛼𝑉𝑑𝑊
𝑆𝑑𝑊+𝜖
Problem: may cause very large steps at the beginning
Solution: Bias correction to prevent first and second moment estimates start at zero
Page 44
Adam (full form)
• Takes RMSprop and momentum idea together
• 𝑣𝑑𝑊 = 0, 𝑆𝑑𝑊 = 0
• On each iteration:– Compute dW on the current mini-batch
– 𝑉𝑑𝑊 = 𝛽1𝑉𝑑𝑊 + 1 − 𝛽1 𝑑𝑊
– 𝑆𝑑𝑊 = 𝛽2𝑆𝑑𝑊 + 1 − 𝛽2 𝑑𝑊2
– 𝑉𝑑𝑊𝑐 =
𝑉𝑑𝑊
1−𝛽1𝑡
– 𝑆𝑑𝑊𝑐 =
𝑆𝑑𝑊
1−𝛽2𝑡
– 𝑊 = 𝑊 − 𝛼𝑉𝑑𝑊
𝑐
𝑆𝑑𝑊𝑐 +𝜖
Page 45
Adam
• Takes RMSprop and momentum idea together
• 𝑣𝑑𝑊 = 0, 𝑆𝑑𝑊 = 0
• On each iteration:– Compute dW on the current mini-batch
– 𝑉𝑑𝑊 = 𝛽1𝑉𝑑𝑊 + 1 − 𝛽1 𝑑𝑊
– 𝑆𝑑𝑊 = 𝛽2𝑆𝑑𝑊 + 1 − 𝛽2 𝑑𝑊2
– 𝑉𝑑𝑊𝑐 =
𝑉𝑑𝑊
1−𝛽1𝑡
– 𝑆𝑑𝑊𝑐 =
𝑆𝑑𝑊
1−𝛽2𝑡
– 𝑊 = 𝑊 − 𝛼𝑉𝑑𝑊
𝑐
𝑆𝑑𝑊𝑐 +𝜖
Hyperparameter Choice:𝛼: need to bee tuned𝛽1 = 0.9𝛽2 = 0.999𝜖 = 10−8
Momentum
RMSprop
Bias correction
Page 47
Example
RMSprop, AdaGrad, and Adam can adaptively tune the update vector size (per parameter)Known also as methods with adaptive learning rates
Source: http://cs231n.github.io/neural-networks-3/
Page 48
Example
Source: http://cs231n.github.io/neural-networks-3/
Page 49
Learning rate
• SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learningrate as a hyperparameter.
• Learning rate is an important hyperparameter that usually adjust first
Page 50
Which one of these learning rates is best to use?
Page 51
Choosing learning rate parameter
• loss not going down: learning rate too low
• loss exploding: learning rate too high
• Rough range for learning rate we should be cross-validating issomewhere [1e-3 … 1e-5]
cost: NaN almost always means high learning rate...
Page 52
Learning rate decay
• Maybe during the initial steps of learning, you could afford to takemuch bigger steps.
• But then as learning approaches converges, then having a slowerlearning rate allows you to take smaller steps.
Page 53
Learning rate decay
• Mini-batch gradient descent
• Slowly reduce learning rate over the time
it won't exactly converge
oscillating in a tighter region
around this minimum
[Andrew Ng, Deep Learning Specialization, 2017]
© 2017 Coursera Inc.
Page 54
Learning rate decay over time
• Step decay: decay learning rate by half every few epochs.
• Exponential decay:– 𝛼 = 𝛼0𝑒
−decay_rate×epoch_num 𝛼 = 𝛼00.95epoch_num
• 1/t decay:
– 𝛼 =𝛼0
1+decay_rate×epoch_num
Page 57
Monitoring loss function during iterations
Page 58
Track the ratio of weight updates / weight magnitudes:
ratio between the updates and values: e.g., 0.0002 / 0.02 = 0.01 (about okay) want this to be somewhere around 0.001 or so
Page 59
First-order optimization
Page 60
First-order optimization
Page 61
Second-order optimization
Page 62
Second-order optimization
What is nice about this update?No hyperparameters! No learning rate!
Page 63
Second-order optimization
Why is this bad for deep learning?Hessian has O(N^2) elementsInverting takes O(N^3) N = (Tens or Hundreds of) Millions
Page 64
Second-order optimization: L-BFGS
• Quasi-Newton methods (BGFS most popular):– instead of inverting the Hessian (O(n^3)), approximate inverse Hessian with
rank 1 updates over time (O(n^2) each).
• L-BFGS (Limited memory BFGS):– Does not form/store the full inverse Hessian.
• L-BFGS usually works very well in full batch, deterministic mode– i.e. work very well when you have a single, deterministic cost function
• But does not transfer very well to mini-batch setting.– Gives bad results.
– Adapting L-BFGS to large-scale, stochastic setting is an active area of research.
Page 65
In practice
• Adam is a good default choice in most cases
• If you can afford to do full batch updates then try out L-BFGS (anddon’t forget to disable all sources of noise)
Page 66
Before going to the generalization in the next lecture
We now see an important technique that helps to speed up training while also acts as a form of regularization
Page 67
Recall: Normalizing inputs
• On the training set compute mean of each input (feature)
– 𝜇𝑖 = 𝑛=1
𝑁 𝑥𝑖𝑛
𝑁
– 𝜎𝑖2 =
𝑛=1𝑁 𝑥𝑖
𝑛−𝜇𝑖
2
𝑁
• Remove mean: from each of the input mean of the corresponding input– 𝑥𝑖 ← 𝑥𝑖 − 𝜇𝑖
• Normalize variance:
– 𝑥𝑖 ←𝑥𝑖
𝜎𝑖
Page 68
Batch normalization
• “you want unit gaussian activations? justmake them so.”
• Can we normalize 𝑎[𝑙] too?
• Normalizing 𝑧[𝑙] is much more common
• Makes neural networks more robust to therange of hyperparameters
– Much more easily train a very deep network𝑊[2] 𝑥
×
𝑓[1]
𝑊[𝐿]
×
𝑓[𝐿]
𝑧[1]
𝑎[1]
𝑎[𝐿−1]
𝑧[𝐿]
𝑎[𝐿]
𝑎[𝐿] = 𝑜𝑢𝑡𝑝𝑢𝑡
Page 69
Batch normalization
• Consider a batch of activations at some layer.
• To make each dimension unit gaussian, apply:
𝜇[𝑙] =
𝑖
𝑧(𝑖)[𝑙]
𝑏
𝜎2[𝑙]=
𝑖
𝑧 𝑖 𝑙 − 𝜇 𝑙 2
𝑏
𝑧𝑛𝑜𝑟𝑚𝑖 𝑙
=𝑧(𝑖)[𝑙] − 𝜇[𝑙]
𝜎2[𝑙]+ 𝜖
[Ioffe and Szegedy, 2015]
Page 70
Batch normalization
• To make each dimension unit gaussian, apply:
𝜇[𝑙] =
𝑖
𝑧(𝑖)[𝑙]
𝑏
𝜎2[𝑙]=
𝑖=1
𝑏𝑧 𝑖 𝑙 − 𝜇 𝑙 2
𝑏
𝑧𝑛𝑜𝑟𝑚𝑖 𝑙
=𝑧(𝑖)[𝑙] − 𝜇[𝑙]
𝜎2[𝑙]+ 𝜖
• Problem: do we necessarily want a unit gaussian input to an activation layer?
𝑧(𝑛)[𝑙] = 𝛾 𝑙 𝑧𝑛𝑜𝑟𝑚𝑖 𝑙
+ 𝛽 𝑙
𝛾 𝑙 and 𝛽 𝑙 are learnable parameters of the model
Page 71
Batch normalization
𝑧𝑛𝑜𝑟𝑚𝑖 𝑙
=𝑧(𝑖)[𝑙] − 𝜇[𝑙]
𝜎2[𝑙]+ 𝜖
• Allow the network to squash the range if it wants to:
𝑧(𝑖)[𝑙] = 𝛾 𝑙 𝑧𝑛𝑜𝑟𝑚𝑖 𝑙
+ 𝛽 𝑙
The network can learn to recover the identity:
[Ioffe and Szegedy, 2015]
𝛾 𝑙 = 𝜎2[𝑙]+ 𝜖
𝛽 𝑙 = 𝜇[𝑙]
In the linear area of tanh
⇒ 𝑧(𝑖)[𝑙] = 𝑧(𝑖)[𝑙]
Page 72
Batch normalization at test time
• At test time BatchNorm layer functions differently:– The mean/std are not computed based on the batch.
– Instead, a single fixed empirical mean of activations during training is used.(e.g. can be estimated during training with running averages)
• 𝜇[𝑙] = 0
• for i=1…m
– 𝜇[𝑙] = 𝛽𝜇[𝑙] + 1 − 𝛽 𝑗=1
𝑏 𝑋𝑗𝑖
𝑏
Page 73
How does Batch Norm work?
• Learning on shifting input distributions
[Andrew Ng, Deep Learning Specialization, 2017]
© 2017 Coursera Inc.
Page 74
Why this is a problem with neural networks?
Limits the amount to which updating weights in earlier layers can affectthe distribution of values that the subsequent layers see
The values become more stable to stand later layers on
𝑊[1] 𝑊[2]𝑊[3] 𝑊[4]
𝑊[5]
[Andrew Ng, Deep Learning Specialization, 2017]
© 2017 Coursera Inc.
Page 75
Batch normalization: summary
• Speedup the learning process– Weaken the coupling between the earlier layers and later layers parameters
• Improves gradient flow through the network
• Allows higher learning rates
• Reduces the strong dependence on initialization
• Acts as a form of regularization in a funny way (as we will see)
Page 76
Resources
• Deep Learning Book, Chapter 8.
• Please see the following note:– http://cs231n.github.io/neural-networks-3/