Top Banner
MIT 9.520/6.860, Fall 2018 Class 11: Neural networks – tips, tricks & software Andrzej Banburski
111

MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Jul 12, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

MIT 9.520/6.860, Fall 2018

Class 11: Neural networks – tips, tricks & software

Andrzej Banburski

Page 2: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Last time - Convolutional neural networks

source: github.com/vdumoulin/conv arithmetic

More Data and GPUs

AlexNet outmatches the ILSVRC 2012

Large-scale Datasets General Purpose GPUs

AlexNet Krizhevsky et al (2012)

A. Banburski

Page 3: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Overview

Initialization & hyper-parameter tuning

Optimization algorithms

Batchnorm & Dropout

Finite dataset woes

Software

A. Banburski

Page 4: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Initialization & hyper-parameter tuning

Consider the problem of training a neural network fθ(x) by minimizing aloss

L(θ, x) =

N∑i=1

li(yi, fθ(xi)) + λ|θ|2

with SGD and mini-batch size b:

θt+1 = θt − η1

b

∑i∈B∇θL(θt, xi) (1)

I How should we choose the initial set of parameters θ?

I How about the hyper-parameters η, λ and b?

A. Banburski

Page 5: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Initialization & hyper-parameter tuning

Consider the problem of training a neural network fθ(x) by minimizing aloss

L(θ, x) =

N∑i=1

li(yi, fθ(xi)) + λ|θ|2

with SGD and mini-batch size b:

θt+1 = θt − η1

b

∑i∈B∇θL(θt, xi) (1)

I How should we choose the initial set of parameters θ?

I How about the hyper-parameters η, λ and b?

A. Banburski

Page 6: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Initialization & hyper-parameter tuning

Consider the problem of training a neural network fθ(x) by minimizing aloss

L(θ, x) =

N∑i=1

li(yi, fθ(xi)) + λ|θ|2

with SGD and mini-batch size b:

θt+1 = θt − η1

b

∑i∈B∇θL(θt, xi) (1)

I How should we choose the initial set of parameters θ?

I How about the hyper-parameters η, λ and b?

A. Banburski

Page 7: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Weight Initialization

I First obvious observation: starting with 0 will make every weightupdate in the same way. Similarly, too big and we can run into NaN.

I What about θ0 = ε×N (0, 1), with ε ≈ 10−2?

I For a few layers this would seem to work nicely.

I If we go deeper however...

I Super slow update of earlier layers 10−2L for sigmoid or tanhactivations – vanishing gradients. ReLU activations do not suffer somuch from this.

A. Banburski

Page 8: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Weight Initialization

I First obvious observation: starting with 0 will make every weightupdate in the same way. Similarly, too big and we can run into NaN.

I What about θ0 = ε×N (0, 1), with ε ≈ 10−2?

I For a few layers this would seem to work nicely.

I If we go deeper however...

I Super slow update of earlier layers 10−2L for sigmoid or tanhactivations – vanishing gradients. ReLU activations do not suffer somuch from this.

A. Banburski

Page 9: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Weight Initialization

I First obvious observation: starting with 0 will make every weightupdate in the same way. Similarly, too big and we can run into NaN.

I What about θ0 = ε×N (0, 1), with ε ≈ 10−2?

I For a few layers this would seem to work nicely.

I If we go deeper however...

I Super slow update of earlier layers 10−2L for sigmoid or tanhactivations – vanishing gradients. ReLU activations do not suffer somuch from this.

A. Banburski

Page 10: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Weight Initialization

I First obvious observation: starting with 0 will make every weightupdate in the same way. Similarly, too big and we can run into NaN.

I What about θ0 = ε×N (0, 1), with ε ≈ 10−2?

I For a few layers this would seem to work nicely.

I If we go deeper however...

I Super slow update of earlier layers 10−2L for sigmoid or tanhactivations – vanishing gradients. ReLU activations do not suffer somuch from this.

A. Banburski

Page 11: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Weight Initialization

I First obvious observation: starting with 0 will make every weightupdate in the same way. Similarly, too big and we can run into NaN.

I What about θ0 = ε×N (0, 1), with ε ≈ 10−2?

I For a few layers this would seem to work nicely.

I If we go deeper however...

I Super slow update of earlier layers 10−2L for sigmoid or tanhactivations – vanishing gradients. ReLU activations do not suffer somuch from this.

A. Banburski

Page 12: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Xavier & He initializations

I For tanh and sigmoid activations, near origin we deal with a nearlylinear function y =Wx, with x = (x1, . . . , xnin). To stop vanishingand exploding gradients we need

Var(y) = Var(Wx) = Var(w1x1) + · · ·+ Var(wninxnin

)

I If we assume that W and x are i.i.d. and have zero mean, thenVar(y) = nVar(wi)Var(xi)

I If we want the inputs and outputs to have same variance, this givesus Var(wi) =

1nin

.

I Similar analysis for backward pass gives Var(wi) =1

nout.

I The compromise is the Xavier initialization [Glorot et al., 2010]:

Var(wi) =2

nin + nout(2)

I Heuristically, ReLU is half of the linear function, so we can take

Var(wi) =4

nin + nout(3)

An analysis in [He et al., 2015] confirms this.

A. Banburski

Page 13: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Xavier & He initializations

I For tanh and sigmoid activations, near origin we deal with a nearlylinear function y =Wx, with x = (x1, . . . , xnin). To stop vanishingand exploding gradients we need

Var(y) = Var(Wx) = Var(w1x1) + · · ·+ Var(wninxnin

)

I If we assume that W and x are i.i.d. and have zero mean, thenVar(y) = nVar(wi)Var(xi)

I If we want the inputs and outputs to have same variance, this givesus Var(wi) =

1nin

.

I Similar analysis for backward pass gives Var(wi) =1

nout.

I The compromise is the Xavier initialization [Glorot et al., 2010]:

Var(wi) =2

nin + nout(2)

I Heuristically, ReLU is half of the linear function, so we can take

Var(wi) =4

nin + nout(3)

An analysis in [He et al., 2015] confirms this.

A. Banburski

Page 14: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Xavier & He initializations

I For tanh and sigmoid activations, near origin we deal with a nearlylinear function y =Wx, with x = (x1, . . . , xnin). To stop vanishingand exploding gradients we need

Var(y) = Var(Wx) = Var(w1x1) + · · ·+ Var(wninxnin

)

I If we assume that W and x are i.i.d. and have zero mean, thenVar(y) = nVar(wi)Var(xi)

I If we want the inputs and outputs to have same variance, this givesus Var(wi) =

1nin

.

I Similar analysis for backward pass gives Var(wi) =1

nout.

I The compromise is the Xavier initialization [Glorot et al., 2010]:

Var(wi) =2

nin + nout(2)

I Heuristically, ReLU is half of the linear function, so we can take

Var(wi) =4

nin + nout(3)

An analysis in [He et al., 2015] confirms this.

A. Banburski

Page 15: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Xavier & He initializations

I For tanh and sigmoid activations, near origin we deal with a nearlylinear function y =Wx, with x = (x1, . . . , xnin). To stop vanishingand exploding gradients we need

Var(y) = Var(Wx) = Var(w1x1) + · · ·+ Var(wninxnin

)

I If we assume that W and x are i.i.d. and have zero mean, thenVar(y) = nVar(wi)Var(xi)

I If we want the inputs and outputs to have same variance, this givesus Var(wi) =

1nin

.

I Similar analysis for backward pass gives Var(wi) =1

nout.

I The compromise is the Xavier initialization [Glorot et al., 2010]:

Var(wi) =2

nin + nout(2)

I Heuristically, ReLU is half of the linear function, so we can take

Var(wi) =4

nin + nout(3)

An analysis in [He et al., 2015] confirms this.

A. Banburski

Page 16: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Xavier & He initializations

I For tanh and sigmoid activations, near origin we deal with a nearlylinear function y =Wx, with x = (x1, . . . , xnin). To stop vanishingand exploding gradients we need

Var(y) = Var(Wx) = Var(w1x1) + · · ·+ Var(wninxnin

)

I If we assume that W and x are i.i.d. and have zero mean, thenVar(y) = nVar(wi)Var(xi)

I If we want the inputs and outputs to have same variance, this givesus Var(wi) =

1nin

.

I Similar analysis for backward pass gives Var(wi) =1

nout.

I The compromise is the Xavier initialization [Glorot et al., 2010]:

Var(wi) =2

nin + nout(2)

I Heuristically, ReLU is half of the linear function, so we can take

Var(wi) =4

nin + nout(3)

An analysis in [He et al., 2015] confirms this.

A. Banburski

Page 17: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Xavier & He initializations

I For tanh and sigmoid activations, near origin we deal with a nearlylinear function y =Wx, with x = (x1, . . . , xnin). To stop vanishingand exploding gradients we need

Var(y) = Var(Wx) = Var(w1x1) + · · ·+ Var(wninxnin

)

I If we assume that W and x are i.i.d. and have zero mean, thenVar(y) = nVar(wi)Var(xi)

I If we want the inputs and outputs to have same variance, this givesus Var(wi) =

1nin

.

I Similar analysis for backward pass gives Var(wi) =1

nout.

I The compromise is the Xavier initialization [Glorot et al., 2010]:

Var(wi) =2

nin + nout(2)

I Heuristically, ReLU is half of the linear function, so we can take

Var(wi) =4

nin + nout(3)

An analysis in [He et al., 2015] confirms this.A. Banburski

Page 18: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Hyper-parameter tuning

How about the hyper-parameters η, λ and b

I How do we choose optimal η, λ and b?

I Basic idea: split your training dataset into a smaller training set anda cross-validation set.

– Run a coarse search (on a logarithmic scale) over the parameters forjust a few epochs of SGD and evaluate on the cross-validation set.

– Perform a finer search.

I Interestingly, [Bergstra and Bengio, 2012] shows that it is better torun the search randomly than on a grid.

source: [Bergstra and Bengio, 2012]

A. Banburski

Page 19: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Hyper-parameter tuning

How about the hyper-parameters η, λ and b

I How do we choose optimal η, λ and b?

I Basic idea: split your training dataset into a smaller training set anda cross-validation set.

– Run a coarse search (on a logarithmic scale) over the parameters forjust a few epochs of SGD and evaluate on the cross-validation set.

– Perform a finer search.

I Interestingly, [Bergstra and Bengio, 2012] shows that it is better torun the search randomly than on a grid.

source: [Bergstra and Bengio, 2012]

A. Banburski

Page 20: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Hyper-parameter tuning

How about the hyper-parameters η, λ and b

I How do we choose optimal η, λ and b?

I Basic idea: split your training dataset into a smaller training set anda cross-validation set.

– Run a coarse search (on a logarithmic scale) over the parameters forjust a few epochs of SGD and evaluate on the cross-validation set.

– Perform a finer search.

I Interestingly, [Bergstra and Bengio, 2012] shows that it is better torun the search randomly than on a grid.

source: [Bergstra and Bengio, 2012]

A. Banburski

Page 21: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Hyper-parameter tuning

How about the hyper-parameters η, λ and b

I How do we choose optimal η, λ and b?

I Basic idea: split your training dataset into a smaller training set anda cross-validation set.

– Run a coarse search (on a logarithmic scale) over the parameters forjust a few epochs of SGD and evaluate on the cross-validation set.

– Perform a finer search.

I Interestingly, [Bergstra and Bengio, 2012] shows that it is better torun the search randomly than on a grid.

source: [Bergstra and Bengio, 2012]

A. Banburski

Page 22: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Hyper-parameter tuning

How about the hyper-parameters η, λ and b

I How do we choose optimal η, λ and b?

I Basic idea: split your training dataset into a smaller training set anda cross-validation set.

– Run a coarse search (on a logarithmic scale) over the parameters forjust a few epochs of SGD and evaluate on the cross-validation set.

– Perform a finer search.

I Interestingly, [Bergstra and Bengio, 2012] shows that it is better torun the search randomly than on a grid.

source: [Bergstra and Bengio, 2012]

A. Banburski

Page 23: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Hyper-parameter tuning

How about the hyper-parameters η, λ and b

I How do we choose optimal η, λ and b?

I Basic idea: split your training dataset into a smaller training set anda cross-validation set.

– Run a coarse search (on a logarithmic scale) over the parameters forjust a few epochs of SGD and evaluate on the cross-validation set.

– Perform a finer search.

I Interestingly, [Bergstra and Bengio, 2012] shows that it is better torun the search randomly than on a grid.

source: [Bergstra and Bengio, 2012]

A. Banburski

Page 24: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Decaying learning rate

I To improve convergence of SGD, we have to use a decaying learningrate.

I Typically we use a scheduler – decrease η after some fixed numberof epochs.

I This allows the training loss to keep improving after it has plateaued

A. Banburski

Page 25: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Decaying learning rate

I To improve convergence of SGD, we have to use a decaying learningrate.

I Typically we use a scheduler – decrease η after some fixed numberof epochs.

I This allows the training loss to keep improving after it has plateaued

A. Banburski

Page 26: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Decaying learning rate

I To improve convergence of SGD, we have to use a decaying learningrate.

I Typically we use a scheduler – decrease η after some fixed numberof epochs.

I This allows the training loss to keep improving after it has plateaued

A. Banburski

Page 27: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Batch-size & learning rate

An interesting linear scaling relationship seems to exist between thelearning rate η and mini-batch size b:

I In the SGD update, they appear as a ratio ηb , with an additional

implicit dependence of the sum of gradients on b.

I If b� N , we can approximate SGD by a stochastic differentialequation with a noise scale g ≈ ηNb [Smit & Le, 2017].

I This means that instead of decaying η, we can increase batch sizedynamically.

source: [Smith et al., 2018]

I As b approaches N the dynamics become more and moredeterministic and we would expect this relationship to vanish.

A. Banburski

Page 28: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Batch-size & learning rate

An interesting linear scaling relationship seems to exist between thelearning rate η and mini-batch size b:

I In the SGD update, they appear as a ratio ηb , with an additional

implicit dependence of the sum of gradients on b.I If b� N , we can approximate SGD by a stochastic differential

equation with a noise scale g ≈ ηNb [Smit & Le, 2017].

I This means that instead of decaying η, we can increase batch sizedynamically.

source: [Smith et al., 2018]

I As b approaches N the dynamics become more and moredeterministic and we would expect this relationship to vanish.

A. Banburski

Page 29: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Batch-size & learning rate

An interesting linear scaling relationship seems to exist between thelearning rate η and mini-batch size b:

I In the SGD update, they appear as a ratio ηb , with an additional

implicit dependence of the sum of gradients on b.I If b� N , we can approximate SGD by a stochastic differential

equation with a noise scale g ≈ ηNb [Smit & Le, 2017].I This means that instead of decaying η, we can increase batch size

dynamically.

source: [Smith et al., 2018]

I As b approaches N the dynamics become more and moredeterministic and we would expect this relationship to vanish.

A. Banburski

Page 30: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Batch-size & learning rate

An interesting linear scaling relationship seems to exist between thelearning rate η and mini-batch size b:

I In the SGD update, they appear as a ratio ηb , with an additional

implicit dependence of the sum of gradients on b.I If b� N , we can approximate SGD by a stochastic differential

equation with a noise scale g ≈ ηNb [Smit & Le, 2017].I This means that instead of decaying η, we can increase batch size

dynamically.

source: [Smith et al., 2018]

I As b approaches N the dynamics become more and moredeterministic and we would expect this relationship to vanish.

A. Banburski

Page 31: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Batch-size & learning rate

An interesting linear scaling relationship seems to exist between thelearning rate η and mini-batch size b:

I In the SGD update, they appear as a ratio ηb , with an additional

implicit dependence of the sum of gradients on b.I If b� N , we can approximate SGD by a stochastic differential

equation with a noise scale g ≈ ηNb [Smit & Le, 2017].I This means that instead of decaying η, we can increase batch size

dynamically.

source: [Smith et al., 2018]

I As b approaches N the dynamics become more and moredeterministic and we would expect this relationship to vanish.

A. Banburski

Page 32: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Batch-size & learning rate

source: [Goyal et al., 2017]

A. Banburski

Page 33: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Overview

Initialization & hyper-parameter tuning

Optimization algorithms

Batchnorm & Dropout

Finite dataset woes

Software

A. Banburski

Page 34: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

SGD is kinda slow...

I GD – use all points each iteration to compute gradient

I SGD – use one point each iteration to compute gradient

I Faster: Mini-Batch – use a mini-batch of points each iteration tocompute gradient

A. Banburski

Page 35: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Alternatives to SGD

Are there reasonable alternatives outside of Newton method?

Accelerations

I Momentum

I Nesterov’s method

I Adagrad

I RMSprop

I Adam

I . . .

A. Banburski

Page 36: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

SGD with Momentum

We can try accelerating SGD

θt+1 = θt − η∇f(θt)

by adding a momentum/velocity term:

vt+1 = µvt − η∇f(θt)θt+1 = θt + vt+1

(4)

µ is a new ”momentum” hyper-parameter.

source: cs213n.github.io

A. Banburski

Page 37: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

SGD with Momentum

We can try accelerating SGD

θt+1 = θt − η∇f(θt)

by adding a momentum/velocity term:

vt+1 = µvt − η∇f(θt)θt+1 = θt + vt+1

(4)

µ is a new ”momentum” hyper-parameter.

source: cs213n.github.io

A. Banburski

Page 38: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

SGD with Momentum

We can try accelerating SGD

θt+1 = θt − η∇f(θt)

by adding a momentum/velocity term:

vt+1 = µvt − η∇f(θt)θt+1 = θt + vt+1

(4)

µ is a new ”momentum” hyper-parameter.

source: cs213n.github.io

A. Banburski

Page 39: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Nesterov Momentum

I Sometimes the momentum update can overshoot

I We can instead evaluate the gradient at the point where momentumtakes us:

vt+1 = µvt − η∇f(θt + µvt)

θt+1 = θt + vt+1

(5)

source: Geoff Hinton’s lecture

A. Banburski

Page 40: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Nesterov Momentum

I Sometimes the momentum update can overshoot

I We can instead evaluate the gradient at the point where momentumtakes us:

vt+1 = µvt − η∇f(θt + µvt)

θt+1 = θt + vt+1

(5)

source: Geoff Hinton’s lecture

A. Banburski

Page 41: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Nesterov Momentum

I Sometimes the momentum update can overshoot

I We can instead evaluate the gradient at the point where momentumtakes us:

vt+1 = µvt − η∇f(θt + µvt)

θt+1 = θt + vt+1

(5)

source: Geoff Hinton’s lecture

A. Banburski

Page 42: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Nesterov Momentum

I Sometimes the momentum update can overshoot

I We can instead evaluate the gradient at the point where momentumtakes us:

vt+1 = µvt − η∇f(θt + µvt)

θt+1 = θt + vt+1

(5)

source: Geoff Hinton’s lecture

A. Banburski

Page 43: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

AdaGrad

I An alternative way is to automatize the decay of the learning rate.

I The Adaptive Gradient algorithm does this by accumulatingmagnitudes of gradients

I AdaGrad accelerates in flat directions of optimization landscape andslows down in step ones.

A. Banburski

Page 44: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

AdaGrad

I An alternative way is to automatize the decay of the learning rate.I The Adaptive Gradient algorithm does this by accumulating

magnitudes of gradients

I AdaGrad accelerates in flat directions of optimization landscape andslows down in step ones.

A. Banburski

Page 45: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

AdaGrad

I An alternative way is to automatize the decay of the learning rate.I The Adaptive Gradient algorithm does this by accumulating

magnitudes of gradients

I AdaGrad accelerates in flat directions of optimization landscape andslows down in step ones.

A. Banburski

Page 46: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

AdaGrad

I An alternative way is to automatize the decay of the learning rate.I The Adaptive Gradient algorithm does this by accumulating

magnitudes of gradients

I AdaGrad accelerates in flat directions of optimization landscape andslows down in step ones.

A. Banburski

Page 47: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

RMSProp

Problem:The updates in AdaGrad always decrease the learning rate, sosome of the parameters can become un-learnable.

I Fix by Hinton: use weighted sum of the square magnitudes instead.I This assigns more weight to recent iterations. Useful if directions of

steeper or shallower descent suddenly change.

A. Banburski

Page 48: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

RMSProp

Problem:The updates in AdaGrad always decrease the learning rate, sosome of the parameters can become un-learnable.

I Fix by Hinton: use weighted sum of the square magnitudes instead.

I This assigns more weight to recent iterations. Useful if directions ofsteeper or shallower descent suddenly change.

A. Banburski

Page 49: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

RMSProp

Problem:The updates in AdaGrad always decrease the learning rate, sosome of the parameters can become un-learnable.

I Fix by Hinton: use weighted sum of the square magnitudes instead.I This assigns more weight to recent iterations. Useful if directions of

steeper or shallower descent suddenly change.

A. Banburski

Page 50: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

RMSProp

Problem:The updates in AdaGrad always decrease the learning rate, sosome of the parameters can become un-learnable.

I Fix by Hinton: use weighted sum of the square magnitudes instead.I This assigns more weight to recent iterations. Useful if directions of

steeper or shallower descent suddenly change.

A. Banburski

Page 51: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Adam

Adaptive Moment – a combination of the previous approaches.

[Kingma and Ba, 2014]

I Ridiculously popular – more than 13K citations!

I Probably because it comes with recommended parameters and camewith a proof of convergence (which was shown to be wrong).

A. Banburski

Page 52: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Adam

Adaptive Moment – a combination of the previous approaches.

[Kingma and Ba, 2014]

I Ridiculously popular – more than 13K citations!

I Probably because it comes with recommended parameters and camewith a proof of convergence (which was shown to be wrong).

A. Banburski

Page 53: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Adam

Adaptive Moment – a combination of the previous approaches.

[Kingma and Ba, 2014]

I Ridiculously popular – more than 13K citations!

I Probably because it comes with recommended parameters and camewith a proof of convergence (which was shown to be wrong).

A. Banburski

Page 54: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

So what should I use in practice?

I Adam is a good default in many cases.

I There exist datasets in which Adam and other adaptive methods donot generalize to unseen data at all! [Marginal Value of AdaptiveGradient Methods in Machine Learning]

I SGD with Momentum and a decay rate often outperforms Adam

(but requires tuning).

includegraphicsFigures/comp.png source:

github.com/YingzhenLi

A. Banburski

Page 55: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

So what should I use in practice?

I Adam is a good default in many cases.

I There exist datasets in which Adam and other adaptive methods donot generalize to unseen data at all! [Marginal Value of AdaptiveGradient Methods in Machine Learning]

I SGD with Momentum and a decay rate often outperforms Adam

(but requires tuning).

includegraphicsFigures/comp.png source:

github.com/YingzhenLi

A. Banburski

Page 56: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

So what should I use in practice?

I Adam is a good default in many cases.

I There exist datasets in which Adam and other adaptive methods donot generalize to unseen data at all! [Marginal Value of AdaptiveGradient Methods in Machine Learning]

I SGD with Momentum and a decay rate often outperforms Adam

(but requires tuning).

includegraphicsFigures/comp.png source:

github.com/YingzhenLi

A. Banburski

Page 57: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

So what should I use in practice?

I Adam is a good default in many cases.

I There exist datasets in which Adam and other adaptive methods donot generalize to unseen data at all! [Marginal Value of AdaptiveGradient Methods in Machine Learning]

I SGD with Momentum and a decay rate often outperforms Adam

(but requires tuning). includegraphicsFigures/comp.png source:

github.com/YingzhenLi

A. Banburski

Page 58: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Overview

Initialization & hyper-parameter tuning

Optimization algorithms

Batchnorm & Dropout

Finite dataset woes

Software

A. Banburski

Page 59: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Data pre-processing

Since our non-linearities change their behavior around the origin, it makessense to pre-process to zero-mean and unit variance.

x̂i =xi − E[xi]√

Var[xi](6)

source: cs213n.github.io

A. Banburski

Page 60: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Data pre-processing

Since our non-linearities change their behavior around the origin, it makessense to pre-process to zero-mean and unit variance.

x̂i =xi − E[xi]√

Var[xi](6)

source: cs213n.github.io

A. Banburski

Page 61: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Batch Normalization

A common technique is to repeat this throughout the deep network in adifferentiable way:

[Ioffe and Szegedy, 2015]

A. Banburski

Page 62: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Batch Normalization

A common technique is to repeat this throughout the deep network in adifferentiable way:

[Ioffe and Szegedy, 2015]

A. Banburski

Page 63: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Batch Normalization

In practice, a batchnorm layer is added after a conv or fully-connectedlayer, but before activations.

I In the original paper the authors claimed that this is meant toreduce covariate shift.

I More obviously, this reduces 2nd-order correlations between layers.Recently shown that it actually doesn’t change covariate shift!Instead it smooths out the landscape.

I In practice this reduces dependence on initialization and seems tostabilize the flow of gradient descent.

I Using BN usually nets you a gain of few % increase in test accuracy.

A. Banburski

Page 64: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Batch Normalization

In practice, a batchnorm layer is added after a conv or fully-connectedlayer, but before activations.

I In the original paper the authors claimed that this is meant toreduce covariate shift.

I More obviously, this reduces 2nd-order correlations between layers.Recently shown that it actually doesn’t change covariate shift!Instead it smooths out the landscape.

I In practice this reduces dependence on initialization and seems tostabilize the flow of gradient descent.

I Using BN usually nets you a gain of few % increase in test accuracy.

A. Banburski

Page 65: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Batch Normalization

In practice, a batchnorm layer is added after a conv or fully-connectedlayer, but before activations.

I In the original paper the authors claimed that this is meant toreduce covariate shift.

I More obviously, this reduces 2nd-order correlations between layers.Recently shown that it actually doesn’t change covariate shift!Instead it smooths out the landscape.

I In practice this reduces dependence on initialization and seems tostabilize the flow of gradient descent.

I Using BN usually nets you a gain of few % increase in test accuracy.

A. Banburski

Page 66: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Batch Normalization

In practice, a batchnorm layer is added after a conv or fully-connectedlayer, but before activations.

I In the original paper the authors claimed that this is meant toreduce covariate shift.

I More obviously, this reduces 2nd-order correlations between layers.Recently shown that it actually doesn’t change covariate shift!Instead it smooths out the landscape.

I In practice this reduces dependence on initialization and seems tostabilize the flow of gradient descent.

I Using BN usually nets you a gain of few % increase in test accuracy.

A. Banburski

[Santurkar, Tsipras, Ilyas, Madry, 2018]

Page 67: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Batch Normalization

In practice, a batchnorm layer is added after a conv or fully-connectedlayer, but before activations.

I In the original paper the authors claimed that this is meant toreduce covariate shift.

I More obviously, this reduces 2nd-order correlations between layers.Recently shown that it actually doesn’t change covariate shift!Instead it smooths out the landscape.

I In practice this reduces dependence on initialization and seems tostabilize the flow of gradient descent.

I Using BN usually nets you a gain of few % increase in test accuracy.

A. Banburski

Page 68: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Batch Normalization

In practice, a batchnorm layer is added after a conv or fully-connectedlayer, but before activations.

I In the original paper the authors claimed that this is meant toreduce covariate shift.

I More obviously, this reduces 2nd-order correlations between layers.Recently shown that it actually doesn’t change covariate shift!Instead it smooths out the landscape.

I In practice this reduces dependence on initialization and seems tostabilize the flow of gradient descent.

I Using BN usually nets you a gain of few % increase in test accuracy.

A. Banburski

Page 69: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Dropout

Another common technique: during forward pass, set some of theweights to 0 randomly with probability p. Typical choice is p = 50%.

I The idea is to prevent co-adaptation of neurons.

I At test want to remove the randomness. A good approximation is tomultiply the neural network by p.

I Dropout is more commonly applied for fully-connected layers,though its use is waning.

A. Banburski

Page 70: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Dropout

Another common technique: during forward pass, set some of theweights to 0 randomly with probability p. Typical choice is p = 50%.

I The idea is to prevent co-adaptation of neurons.

I At test want to remove the randomness. A good approximation is tomultiply the neural network by p.

I Dropout is more commonly applied for fully-connected layers,though its use is waning.

A. Banburski

Page 71: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Dropout

Another common technique: during forward pass, set some of theweights to 0 randomly with probability p. Typical choice is p = 50%.

I The idea is to prevent co-adaptation of neurons.

I At test want to remove the randomness. A good approximation is tomultiply the neural network by p.

I Dropout is more commonly applied for fully-connected layers,though its use is waning.

A. Banburski

Page 72: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Dropout

Another common technique: during forward pass, set some of theweights to 0 randomly with probability p. Typical choice is p = 50%.

I The idea is to prevent co-adaptation of neurons.

I At test want to remove the randomness. A good approximation is tomultiply the neural network by p.

I Dropout is more commonly applied for fully-connected layers,though its use is waning.

A. Banburski

Page 73: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Dropout

Another common technique: during forward pass, set some of theweights to 0 randomly with probability p. Typical choice is p = 50%.

I The idea is to prevent co-adaptation of neurons.

I At test want to remove the randomness. A good approximation is tomultiply the neural network by p.

I Dropout is more commonly applied for fully-connected layers,though its use is waning.

A. Banburski

Page 74: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Overview

Initialization & hyper-parameter tuning

Optimization algorithms

Batchnorm & Dropout

Finite dataset woes

Software

A. Banburski

Page 75: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Finite dataset woes

While we are entering the Big Data age, in practice we often findourselves with insufficient data to sufficiently train our deep neuralnetworks.

I What if collecting more data is slow/difficult?

I Can we squeeze out more from what we already have?

A. Banburski

Page 76: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Finite dataset woes

While we are entering the Big Data age, in practice we often findourselves with insufficient data to sufficiently train our deep neuralnetworks.

I What if collecting more data is slow/difficult?

I Can we squeeze out more from what we already have?

A. Banburski

Page 77: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Invariance problem

An often-repeated claim about CNNs is that they are invariant to smalltranslations. Independently of whether this is true, they are not invariantto most other types of transformations:

source: cs213n.github.io

A. Banburski

Page 78: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Data augmentation

I Can greatly increase the amount of data by performing:

– Translations– Rotations– Reflections– Scaling– Cropping– Adding Gaussian Noise– Adding Occlusion– Interpolation– etc.

I Crucial for achieving state-of-the-art performance!

I For example, ResNet improves from 11.66% to 6.41% error onCIFAR-10 dataset and from 44.74% to 27.22% on CIFAR-100.

A. Banburski

Page 79: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Data augmentation

I Can greatly increase the amount of data by performing:

– Translations– Rotations– Reflections– Scaling– Cropping– Adding Gaussian Noise– Adding Occlusion– Interpolation– etc.

I Crucial for achieving state-of-the-art performance!

I For example, ResNet improves from 11.66% to 6.41% error onCIFAR-10 dataset and from 44.74% to 27.22% on CIFAR-100.

A. Banburski

Page 80: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Data augmentation

I Can greatly increase the amount of data by performing:

– Translations– Rotations– Reflections– Scaling– Cropping– Adding Gaussian Noise– Adding Occlusion– Interpolation– etc.

I Crucial for achieving state-of-the-art performance!

I For example, ResNet improves from 11.66% to 6.41% error onCIFAR-10 dataset and from 44.74% to 27.22% on CIFAR-100.

A. Banburski

Page 81: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Data augmentation

I Can greatly increase the amount of data by performing:

– Translations– Rotations– Reflections– Scaling– Cropping– Adding Gaussian Noise– Adding Occlusion– Interpolation– etc.

I Crucial for achieving state-of-the-art performance!

I For example, ResNet improves from 11.66% to 6.41% error onCIFAR-10 dataset and from 44.74% to 27.22% on CIFAR-100.

A. Banburski

Page 82: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Data augmentation

source: github.com/aleju/imgaug

A. Banburski

Page 83: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Transfer Learning

What if you truly have too little data?

I If your data has sufficient similarity to a bigger dataset, the you’re inluck!

I Idea: take a model trained for example on ImageNet.

I Freeze all but last few layers and retrain on your small data. Thebigger your dataset, the more layers you have to retrain.

source: [Haase et al., 2014]

A. Banburski

Page 84: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Transfer Learning

What if you truly have too little data?

I If your data has sufficient similarity to a bigger dataset, the you’re inluck!

I Idea: take a model trained for example on ImageNet.

I Freeze all but last few layers and retrain on your small data. Thebigger your dataset, the more layers you have to retrain.

source: [Haase et al., 2014]

A. Banburski

Page 85: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Transfer Learning

What if you truly have too little data?

I If your data has sufficient similarity to a bigger dataset, the you’re inluck!

I Idea: take a model trained for example on ImageNet.

I Freeze all but last few layers and retrain on your small data. Thebigger your dataset, the more layers you have to retrain.

source: [Haase et al., 2014]

A. Banburski

Page 86: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Transfer Learning

What if you truly have too little data?

I If your data has sufficient similarity to a bigger dataset, the you’re inluck!

I Idea: take a model trained for example on ImageNet.

I Freeze all but last few layers and retrain on your small data. Thebigger your dataset, the more layers you have to retrain.

source: [Haase et al., 2014]

A. Banburski

Page 87: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Overview

Initialization & hyper-parameter tuning

Optimization algorithms

Batchnorm & Dropout

Finite dataset woes

Software

A. Banburski

Page 88: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Software overview

A. Banburski

Page 89: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Software overview

A. Banburski

Page 90: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Why use frameworks?

I You don’t have to implement everything yourself.

I Many inbuilt modules allow quick iteration of ideas – building aneural network becomes putting simple blocks together andcomputing backprop is a breeze.

I Someone else already wrote CUDA code to efficiently run trainingon GPUs (or TPUs).

A. Banburski

Page 91: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Why use frameworks?

I You don’t have to implement everything yourself.

I Many inbuilt modules allow quick iteration of ideas – building aneural network becomes putting simple blocks together andcomputing backprop is a breeze.

I Someone else already wrote CUDA code to efficiently run trainingon GPUs (or TPUs).

A. Banburski

Page 92: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Why use frameworks?

I You don’t have to implement everything yourself.

I Many inbuilt modules allow quick iteration of ideas – building aneural network becomes putting simple blocks together andcomputing backprop is a breeze.

I Someone else already wrote CUDA code to efficiently run trainingon GPUs (or TPUs).

A. Banburski

Page 93: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Main design difference

source: Introduction to Chainer

A. Banburski

Page 94: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

PyTorch concepts

Similar in code to numpy.

I Tensor: nearly identical to np.array, can run on GPU just with

I Autograd: package for automatic computation of backprop andconstruction of computational graphs.

I Module: neural network layer storing weights

I Dataloader: class for simplifying efficient data loading

A. Banburski

Page 95: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

PyTorch concepts

Similar in code to numpy.

I Tensor: nearly identical to np.array, can run on GPU just with

I Autograd: package for automatic computation of backprop andconstruction of computational graphs.

I Module: neural network layer storing weights

I Dataloader: class for simplifying efficient data loading

A. Banburski

Page 96: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

PyTorch concepts

Similar in code to numpy.

I Tensor: nearly identical to np.array, can run on GPU just with

I Autograd: package for automatic computation of backprop andconstruction of computational graphs.

I Module: neural network layer storing weights

I Dataloader: class for simplifying efficient data loading

A. Banburski

Page 97: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

PyTorch concepts

Similar in code to numpy.

I Tensor: nearly identical to np.array, can run on GPU just with

I Autograd: package for automatic computation of backprop andconstruction of computational graphs.

I Module: neural network layer storing weights

I Dataloader: class for simplifying efficient data loading

A. Banburski

Page 98: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

PyTorch concepts

Similar in code to numpy.

I Tensor: nearly identical to np.array, can run on GPU just with

I Autograd: package for automatic computation of backprop andconstruction of computational graphs.

I Module: neural network layer storing weights

I Dataloader: class for simplifying efficient data loading

A. Banburski

Page 99: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

PyTorch - optimization

A. Banburski

Page 100: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

PyTorch - ResNet in one page

@jeremyphowardA. Banburski

Page 101: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Tensorflow static graphs

source: cs213n.github.ioA. Banburski

Page 102: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Keras wrapper - closer to PyTorch

source: cs213n.github.ioA. Banburski

Page 103: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Tensorboard - very useful tool for visualization

A. Banburski

Page 104: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Tensorflow overview

I Main difference – uses static graphs. Longer code, but moreoptimized. In practice PyTorch is faster to experiment on.

I With Keras wrapper code is more similar to PyTorch however.

I Can use TPUs

A. Banburski

Page 105: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Tensorflow overview

I Main difference – uses static graphs. Longer code, but moreoptimized. In practice PyTorch is faster to experiment on.

I With Keras wrapper code is more similar to PyTorch however.

I Can use TPUs

A. Banburski

Page 106: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

Tensorflow overview

I Main difference – uses static graphs. Longer code, but moreoptimized. In practice PyTorch is faster to experiment on.

I With Keras wrapper code is more similar to PyTorch however.

I Can use TPUs

A. Banburski

Page 107: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

But

I Tensorflow has added dynamic batching, which makes dynamicgraphs possible.

I PyTorch is merging with Caffe2, which will provide static graphs too!

I Which one to choose then?

– PyTorch is more popular in the research community for easydevelopment and debugging.

– In the past a better choice for production was Tensorflow. Still theonly choice if you want to use TPUs.

A. Banburski

Page 108: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

But

I Tensorflow has added dynamic batching, which makes dynamicgraphs possible.

I PyTorch is merging with Caffe2, which will provide static graphs too!

I Which one to choose then?

– PyTorch is more popular in the research community for easydevelopment and debugging.

– In the past a better choice for production was Tensorflow. Still theonly choice if you want to use TPUs.

A. Banburski

Page 109: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

But

I Tensorflow has added dynamic batching, which makes dynamicgraphs possible.

I PyTorch is merging with Caffe2, which will provide static graphs too!

I Which one to choose then?

– PyTorch is more popular in the research community for easydevelopment and debugging.

– In the past a better choice for production was Tensorflow. Still theonly choice if you want to use TPUs.

A. Banburski

Page 110: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

But

I Tensorflow has added dynamic batching, which makes dynamicgraphs possible.

I PyTorch is merging with Caffe2, which will provide static graphs too!

I Which one to choose then?

– PyTorch is more popular in the research community for easydevelopment and debugging.

– In the past a better choice for production was Tensorflow. Still theonly choice if you want to use TPUs.

A. Banburski

Page 111: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...web.mit.edu/9.520/www/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks

But

I Tensorflow has added dynamic batching, which makes dynamicgraphs possible.

I PyTorch is merging with Caffe2, which will provide static graphs too!

I Which one to choose then?

– PyTorch is more popular in the research community for easydevelopment and debugging.

– In the past a better choice for production was Tensorflow. Still theonly choice if you want to use TPUs.

A. Banburski