Top Banner
Optimization for Deep Networks Ishan Misra
67

Optimization for Deep Networks

Nov 14, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Optimization for Deep Networks

Optimization for Deep NetworksIshan Misra

Page 2: Optimization for Deep Networks

Overview

• Vanilla SGD

• SGD + Momentum

• NAG

• Rprop

• AdaGrad

• RMSProp

• AdaDelta

• Adam

Page 3: Optimization for Deep Networks

More tricks

• Batch Normalization

• Natural Networks

Page 4: Optimization for Deep Networks

Gradient (Steepest) Descent

• Move in the opposite direction of the gradient

Page 5: Optimization for Deep Networks

Conjugate Gradient Methods

• See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning], Martens et al., 2010 [Deep Learning via Hessian Free optimization]

Page 6: Optimization for Deep Networks

Notation

Parameters of Network

Function of network parameters

Page 7: Optimization for Deep Networks

Properties of Loss function for SGD

Loss function over all samples must decompose into a loss function per sample

Page 8: Optimization for Deep Networks

Vanilla SGD

Parameters of

Network

Function of network

parameters

Iteration number

Step size/Learning rate

Page 9: Optimization for Deep Networks

SGD + Momentum

• Plain SGD can make erratic updates on non-smooth loss functions• Consider an outlier example which “throws off” the learning process

• Maintain some history of updates

• Physics example• A moving ball acquires “momentum”, at which point it becomes less

sensitive to the direct force (gradient)

Page 10: Optimization for Deep Networks

SGD + Momentum

Parameters of

Network

Function of network

parameters

Iteration number

Step size/Learning rate

Momentum

Page 11: Optimization for Deep Networks

SGD + Momentum

• At iteration 𝑡 you add updates from previous iteration 𝑡−𝑛 by weight 𝜇𝑛

• You effectively multiply your updates by 1

1−𝜇

Page 12: Optimization for Deep Networks

Nesterov Accelerate Gradient (NAG)

• Ilya Sutskever, 2012

• First make a jump as directed by momentum

• Then depending on where you land, correct the parameters

Page 13: Optimization for Deep Networks

NAG

Parameters of

Network

Function of network

parameters

Iteration number

Step size/Learning rate

Momentum

Page 14: Optimization for Deep Networks

NAG vs Standard Momentum

Page 15: Optimization for Deep Networks

Why anything new (beyond Momentum/NAG)?

• How to set learning rate and decay of learning rates?

• Ideally want adaptive learning rates

Page 16: Optimization for Deep Networks

Why anything new (beyond Momentum/NAG)?

• Neurons in each layer learn differently• Gradient magnitudes vary across layers

• Early layers get “vanishing gradients”

• Should ideally use separate adaptive learning rates• One of the reasons for having “gain” or lr multipliers in caffe

• Adaptive learning rate algorithms• Jacobs 1989 – agreement in sign between current gradient for a weight

and velocity for that weight

• Use larger mini-batches

Page 17: Optimization for Deep Networks

Adaptive Learning Rates

Page 18: Optimization for Deep Networks

Resilient Propagation (Rprop)

• Riedmiller and Braun 1993

• Address the problem of adaptive learning rate

• Increase the learning rate for a weight multiplicatively if signs of last two gradients agree

• Else decrease learning rate multiplicatively

Page 19: Optimization for Deep Networks

Rprop Update

Page 20: Optimization for Deep Networks

Rprop Initialization

• Initialize all updates at iteration 0 to constant value• If you set both learning rates to 1, you get “Manhattan update rule”

• Rprop effectively divides the gradient by its magnitude• You never update using the gradient itself, but by its sign

Page 21: Optimization for Deep Networks

Problems with Rprop

• Consider a weight that gets updates of 0.1 in nine mini batches, and -0.9 in tenth mini batch

• SGD would keep this weight roughly where it started

• Rprop would increment weight nine times by 𝛿, and then for the tenth update decrease the weight 𝛿• Effective update 9𝛿 − 𝛿 = 8𝛿

• Across mini-batches we scale updates very differently

Page 22: Optimization for Deep Networks

Adaptive Gradient (AdaGrad)

• Duchi et al., 2010

• We need to scale updates across mini-batches similarly

• Use gradient updates as an indicator to scaling

Page 23: Optimization for Deep Networks

Problems with AdaGrad

• Lowers the update size very aggressively

Page 24: Optimization for Deep Networks

RMSProp = Rprop + SGD

• Tieleman & Hinton et al., 2012 (Coursera slide 29, Lecture 6)

• Scale updates similarly across mini-batches

• Scale by decaying average of squared gradient• Rather than the sum of squared gradients in AdaGrad

Page 25: Optimization for Deep Networks

RMSProp

• Has shown success for training Recurrent Models

• Using Momentum generally does not show much improvement

Page 26: Optimization for Deep Networks

Fancy RMSProp

• “No more pesky learning rates” – Schaul et al.• Computes a diagonal Hessian and uses something similar to RMSProp

• Diagonal Hessian computation requires an additional Forward-Backward pass• Double the time of SGD

Page 27: Optimization for Deep Networks

Units of update

• SGD update is in terms of gradient

Page 28: Optimization for Deep Networks

Unitless updates

• Updates are not in units of parameters

Page 29: Optimization for Deep Networks

Hessian updates

Page 30: Optimization for Deep Networks

Hessian gives correct units

Page 31: Optimization for Deep Networks

AdaDelta

• Zeiler et al., 2012

• Get updates that match units

• Keep properties from RMSProp

• Updates should be of the form

Page 32: Optimization for Deep Networks

AdaDelta

• Approximate denominator by sqrt(decaying average of gradient squares)

• Approximate numerator by decaying average of squared updates

Page 33: Optimization for Deep Networks

AdaDelta

• Approximate denominator by sqrt(decaying average of gradient squares)

• Approximate numerator by decaying average of squared updates

Page 34: Optimization for Deep Networks

AdaDelta Update Rule

Page 35: Optimization for Deep Networks

Problems with AdaDelta

• The moving averages are biased by initialization of decay parameters

Page 36: Optimization for Deep Networks

Problems with AdaDelta

• Not the first intuitive shot at fixing “unit updates” problem

• This just maybe me nitpicking

• Why do updates have a time delay?• There is some explanation in the paper, not convinced …

Page 37: Optimization for Deep Networks

Adam

• Kingma & Ba, 2015

• Averages of gradient, or squared gradients

• Bias correction

Page 38: Optimization for Deep Networks

Adam update rule

Updates are not in the correct unit

Simplification, does not have decay over gamma_1

Page 39: Optimization for Deep Networks

AdaMax Adam

Page 40: Optimization for Deep Networks

Adam Results – Logistic Regression

Page 41: Optimization for Deep Networks

Adam Results - MLP

Page 42: Optimization for Deep Networks

Adam Results – Conv Nets

Page 43: Optimization for Deep Networks

Visualization

Alec Radford

Page 44: Optimization for Deep Networks

Batch NormalizationIoffe and Szegedy, 2015

Page 45: Optimization for Deep Networks

Distribution of input

• Having a fixed input distribution is known to help training of linear classifiers• Normalize inputs for SVM

• Normalize inputs for Deep Networks

Page 46: Optimization for Deep Networks

Distribution of input at each layer

• Each layer would benefit if its input had constant distribution

Page 47: Optimization for Deep Networks

Normalize the input to each layer!

Page 48: Optimization for Deep Networks

Is this normalization a good idea?

• Consider inputs to a sigmoid layer

• If normalized, the sigmoid may not ever “saturate”

Page 49: Optimization for Deep Networks

Modify normalization …

• Accommodate identity transform

Learned parameters

Page 50: Optimization for Deep Networks

Batch Normalization Results

Page 51: Optimization for Deep Networks

Batch Normalization for ensembles

Page 52: Optimization for Deep Networks

Natural GradientsRelated work: PRONG or Natural Neural Networks

Page 53: Optimization for Deep Networks

Gradients and orthonormal coordinates

• In Euclidean space, we define length using orthonormal coordinates

Page 54: Optimization for Deep Networks

What happens on a manifold?

• Use metric tensors (generalizes dot product to manifolds)

G is generally a symmetric PSD matrix.

It is a tensor because G’ = J^T G J, where J = jacobian(G)

Page 55: Optimization for Deep Networks

Gradient descent revisited

• What exactly does a gradient descent update step do?

Page 56: Optimization for Deep Networks

Gradient descent revisited

• What exactly does a gradient descent update step do?

Distance is measured in

orthonormal coordinates

Page 57: Optimization for Deep Networks

Relationship: Natural Gradient vs. Gradient

F is the Fisher information matrix

Page 58: Optimization for Deep Networks

PRONG: Projected Natural Gradient Descent

• Main idea is to avoid computing inverse of Fisher matrix

• Reparametrize the network so that Fisher matrix is identity• Hint: Whitening …

Page 59: Optimization for Deep Networks

Reparametrization

• Each layer has two components

The old 𝜃 contains all the

weights; Learned

Whitening matrix; Estimated

Page 60: Optimization for Deep Networks

Reparametrization

Page 61: Optimization for Deep Networks

Exact forms of weights

• 𝑈𝑖 is the normalized eigen-decomposition of Σ𝑖

Page 62: Optimization for Deep Networks

PRONG Update rule

Page 63: Optimization for Deep Networks

Similarity to Batch Normalization

Page 64: Optimization for Deep Networks

PRONG: Results

Page 65: Optimization for Deep Networks

PRONG: Results

Page 66: Optimization for Deep Networks

Thanks!

Page 67: Optimization for Deep Networks

References

• http://climin.readthedocs.org/en/latest/rmsprop.html

• http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

• Training a 3-Node Neural Network is NP-Complete (Blum and Rivest, 1993)

• Equilibrated adaptive learning rates for non-convex optimization

• Practical Recommendations for Gradient-Based Training of Deep Architectures

• Efficient BackPropagation - Yann Le Cun 1989

• Stochastic Gradient Descent Tricks – Leon Bottou