Top Banner
Tips for Deep Learning
53

Tips for Deep Learning

Dec 25, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Tips for Deep Learning

Tips for Deep Learning

Page 2: Tips for Deep Learning

Neural Network

Good Results on Testing Data?

Good Results on Training Data?

Step 3: pick the best function

Step 2: goodness of function

Step 1: define a set of function

YES

YES

NO

NO

Overfitting!

Recipe of Deep Learning

Page 3: Tips for Deep Learning

Do not always blame Overfitting

Deep Residual Learning for Image Recognitionhttp://arxiv.org/abs/1512.03385

Testing Data

Overfitting?

Training Data

Not well trained

Page 4: Tips for Deep Learning

Neural Network

Good Results on Testing Data?

Good Results on Training Data?

YES

YES

Recipe of Deep Learning

Different approaches for different problems.

e.g. dropout for good results on testing data

Page 5: Tips for Deep Learning

Good Results on Testing Data?

Good Results on Training Data?

YES

YES

Recipe of Deep Learning

New activation function

Adaptive Learning Rate

Early Stopping

Regularization

Dropout

Page 6: Tips for Deep Learning

Hard to get the power of Deep …

Deeper usually does not imply better.

Results on Training Data

Page 7: Tips for Deep Learning

Vanishing Gradient Problem

Larger gradients

Almost random Already converge

based on random!?

Learn very slow Learn very fast

1x

2x

……

Nx

……

……

……

……

……

……

……

y1

y2

yM

Smaller gradients

Page 8: Tips for Deep Learning

Vanishing Gradient Problem

1x

2x

……

Nx

……

……

……

……

……

……

……

𝑦1

𝑦2

𝑦𝑀

……

ො𝑦1

ො𝑦2

ො𝑦𝑀

𝑙

Intuitive way to compute the derivatives …

𝜕𝑙

𝜕𝑤=?

+∆𝑤

+∆𝑙

∆𝑙

∆𝑤

Smaller gradients

Large input

Small output

Page 9: Tips for Deep Learning

ReLU

• Rectified Linear Unit (ReLU)

Reason:

• Fast to compute• Biological reason Only 1~4% neurons active in brain

• Infinite sigmoid with different biases

∞−0𝜎 𝑧 + 𝜉 𝑑𝜉 = log(1 + 𝑒𝑧) ≈ 𝑅𝑒𝐿𝑈(𝑧)

• Vanishing gradient problem𝑧

𝑎

𝑎 = 𝑧

𝑎 = 0

𝜎 𝑧

[Xavier Glorot, AISTATS’11][Andrew L. Maas, ICML’13][Kaiming He, arXiv’15]

Page 10: Tips for Deep Learning

ReLU

1x

2x

1y

2y

0

0

0

0

𝑧

𝑎

𝑎 = 𝑧

𝑎 = 0

Page 11: Tips for Deep Learning

ReLU

1x

2x

1y

2y

A Thinner linear network

Do not have smaller gradients

𝑧

𝑎

𝑎 = 𝑧

𝑎 = 0

Page 12: Tips for Deep Learning

MNIST dataset4 layers feedforward NN, 100 nodes for each hidden layerSGD with learning rate 0.01, batch size 32

𝜵𝑾𝟑

𝜵𝑾𝟏

𝜵𝑾𝟐

𝜵𝑾𝟎 (input layer)

Epochs

Gra

die

nt

abs.

mo

vin

g av

era

ge

Sigmoid

Activation Function Comparison

Gra

die

nt

abs.

mo

vin

g av

era

ge

Epochs

ReLU

𝜵𝑾𝟎 (input layer)

𝜵𝑾𝟏

𝜵𝑾𝟐

𝜵𝑾𝟑

Courtesy of 李維道同學

𝑊𝑛: Weights for neurons in the n’th layer

Page 13: Tips for Deep Learning

ReLU - variant

𝑧

𝑎

𝑎 = 𝑧

𝑎 = 0.01𝑧

𝐿𝑒𝑎𝑘𝑦 𝑅𝑒𝐿𝑈

𝑧

𝑎

𝑎 = 𝑧

𝑎 = 𝛼𝑧

𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑅𝑒𝐿𝑈

α also learned by gradient descent

Page 14: Tips for Deep Learning

Maxout

• Learnable activation function [Ian J. Goodfellow, ICML’13]

Max

1x

2x

Input

Max

+ 5

+ 7

+ −1

+ 1

7

1

Max

Max

+ 1

+ 2

+ 4

+ 3

2

4

ReLU is a special cases of Maxout

You can have more than 2 elements in a group.

neuron

Page 15: Tips for Deep Learning

Maxout

Max

x

1

Input+ 𝑧1

+ 𝑧2

𝑎

𝑚𝑎𝑥 𝑧1 , 𝑧2

𝑤𝑏

0

0

𝑥

𝑧 = 𝑤𝑥 + 𝑏𝑎

x

1

Input ReLU𝑧

𝑤

𝑏

𝑎

𝑥

𝑧1 = 𝑤𝑥 + 𝑏

𝑎

𝑧2 =0

ReLU is a special cases of Maxout

Page 16: Tips for Deep Learning

Maxout

Max

x

1

Input+ 𝑧1

+ 𝑧2

𝑎

𝑚𝑎𝑥 𝑧1 , 𝑧2

𝑤𝑏

𝑤′

𝑏′

𝑥

𝑧 = 𝑤𝑥 + 𝑏𝑎

x

x

Input ReLU𝑧

𝑤

𝑏

𝑎

𝑥

𝑧1 = 𝑤𝑥 + 𝑏

𝑎

𝑧2 = 𝑤′𝑥 + 𝑏′

Learnable Activation Function

More than ReLU

Page 17: Tips for Deep Learning

Maxout

• Learnable activation function [Ian J. Goodfellow, ICML’13]

• Activation function in maxout network can be any piecewise linear convex function

• How many pieces depending on how many elements in a group

2 elements in a group 3 elements in a group

Page 18: Tips for Deep Learning

Maxout - Training

• Given a training data x, we know which z would be the max

Max

1x

2x

Input

Max𝑥

+ 𝑧11

+ 𝑧21

+ 𝑧31

+ 𝑧41

𝑎11

𝑎21

Max

Max

+ 𝑧12

+ 𝑧22

+ 𝑧32

+ 𝑧42

𝑎12

𝑎22

𝑎1 𝑎2

𝑚𝑎𝑥 𝑧11, 𝑧2

1

Page 19: Tips for Deep Learning

Maxout - Training

• Given a training data x, we know which z would be the max

• Train this thin and linear network

1x

2x

Input

𝑥

+ 𝑧11

+ 𝑧21

+ 𝑧31

+ 𝑧41

𝑎11

𝑎21

+ 𝑧12

+ 𝑧22

+ 𝑧32

+ 𝑧42

𝑎12

𝑎22

𝑎1 𝑎2

Different thin and linear network for different examples

Page 20: Tips for Deep Learning

Good Results on Testing Data?

Good Results on Training Data?

YES

YES

Recipe of Deep Learning

New activation function

Adaptive Learning Rate

Early Stopping

Regularization

Dropout

Page 21: Tips for Deep Learning

Review

𝑤1

𝑤2

Larger Learning Rate

Smaller Learning Rate

Adagrad

𝑤𝑡+1 ← 𝑤𝑡 −𝜂

σ𝑖=0𝑡 𝑔𝑖 2

𝑔𝑡

Use first derivative to estimate second derivative

Page 22: Tips for Deep Learning

RMSProp

𝑤1

𝑤2

Error Surface can be very complex when training NN.

Larger Learning Rate

Smaller Learning Rate

Page 23: Tips for Deep Learning

RMSProp

𝑤1 ← 𝑤0 −𝜂

𝜎0𝑔0

……

𝑤2 ← 𝑤1 −𝜂

𝜎1𝑔1

𝑤𝑡+1 ← 𝑤𝑡 −𝜂

𝜎𝑡𝑔𝑡

𝜎0 = 𝑔0

𝜎1 = 𝛼 𝜎0 2 + 1 − 𝛼 𝑔1 2

𝑤3 ← 𝑤2 −𝜂

𝜎2𝑔2 𝜎2 = 𝛼 𝜎1 2 + 1 − 𝛼 𝑔2 2

𝜎𝑡 = 𝛼 𝜎𝑡−1 2 + 1 − 𝛼 𝑔𝑡 2

Root Mean Square of the gradients with previous gradients being decayed

Page 24: Tips for Deep Learning

Hard to find optimal network parameters

TotalLoss

The value of a network parameter w

Very slow at the plateau

Stuck at local minima

𝜕𝐿 ∕ 𝜕𝑤= 0

Stuck at saddle point

𝜕𝐿 ∕ 𝜕𝑤= 0

𝜕𝐿 ∕ 𝜕𝑤≈ 0

Page 25: Tips for Deep Learning

In physical world ……

• Momentum

How about put this phenomenon in gradient descent?

Page 26: Tips for Deep Learning

Review: Vanilla Gradient Descent

Start at position 𝜃0

Compute gradient at 𝜃0

Move to 𝜃1 = 𝜃0 - η𝛻𝐿 𝜃0

Compute gradient at 𝜃1

Move to 𝜃2 = 𝜃1 – η𝛻𝐿 𝜃1Movement

Gradient

……

𝜃0

𝜃1

𝜃2

𝜃3

𝛻𝐿 𝜃0

𝛻𝐿 𝜃1

𝛻𝐿 𝜃2

𝛻𝐿 𝜃3

Stop until 𝛻𝐿 𝜃𝑡 ≈ 0

Page 27: Tips for Deep Learning

MomentumStart at point 𝜃0

Compute gradient at 𝜃0

Move to 𝜃1 = 𝜃0 + v1

Compute gradient at 𝜃1

Movement v0=0

Movement v1 = λv0 - η𝛻𝐿 𝜃0

Movement v2 = λv1 - η𝛻𝐿 𝜃1

Move to 𝜃2 = 𝜃1 + v2Movement

Gradient

𝜃0

𝜃1

𝜃2

𝜃3

𝛻𝐿 𝜃0𝛻𝐿 𝜃1

𝛻𝐿 𝜃2

𝛻𝐿 𝜃3 Movement not just based on gradient, but previous movement.

Movementof last step

Movement: movement of last step minus gradient at present

Page 28: Tips for Deep Learning

Momentum

vi is actually the weighted sum of all the previous gradient:

𝛻𝐿 𝜃0 ,𝛻𝐿 𝜃1 , … 𝛻𝐿 𝜃𝑖−1

v0 = 0

v1 = - η𝛻𝐿 𝜃0

v2 = - λ η𝛻𝐿 𝜃0 - η𝛻𝐿 𝜃1

……

Start at point 𝜃0

Compute gradient at 𝜃0

Move to 𝜃1 = 𝜃0 + v1

Compute gradient at 𝜃1

Movement v0=0

Movement v1 = λv0 - η𝛻𝐿 𝜃0

Movement v2 = λv1 - η𝛻𝐿 𝜃1

Move to 𝜃2 = 𝜃1 + v2

Movement not just based on gradient, but previous movement

Movement: movement of last step minus gradient at present

Page 29: Tips for Deep Learning

Movement = Negative of 𝜕𝐿∕𝜕𝑤 + Momentum

Momentum

cost

𝜕𝐿∕𝜕𝑤 = 0

Still not guarantee reaching global minima, but give some hope ……

Negative of 𝜕𝐿 ∕ 𝜕𝑤

Momentum

Real Movement

Page 30: Tips for Deep Learning

Adam RMSProp + Momentum

for momentumfor RMSprop

Page 31: Tips for Deep Learning

Good Results on Testing Data?

Good Results on Training Data?

YES

YES

Recipe of Deep Learning

New activation function

Adaptive Learning Rate

Early Stopping

Regularization

Dropout

Page 32: Tips for Deep Learning

Early Stopping

Epochs

TotalLoss

Training set

Testing set

Stop at here

Validation set

http://keras.io/getting-started/faq/#how-can-i-interrupt-training-when-the-validation-loss-isnt-decreasing-anymoreKeras:

Page 33: Tips for Deep Learning

Good Results on Testing Data?

Good Results on Training Data?

YES

YES

Recipe of Deep Learning

New activation function

Adaptive Learning Rate

Early Stopping

Regularization

Dropout

Page 34: Tips for Deep Learning

Regularization

• New loss function to be minimized

• Find a set of weight not only minimizing original cost but also close to zero

Original loss(e.g. minimize square error, cross entropy …)

,, 21 ww

(usually not consider biases)

Regularization term

𝜃 2 = 𝑤12 + 𝑤2

2 +⋯

L2 regularization:

𝐿′ 𝜃 = 𝐿 𝜃 +𝜆

2‖θ‖2

2

Page 35: Tips for Deep Learning

Regularization

• New loss function to be minimized

Gradient: www

LL

Update:w

www ttt

L1

tt w

ww

L

w

wt

L1

Closer to zero

Weight Decay

𝐿′ 𝜃 = 𝐿 𝜃 +𝜆

2‖θ‖2

2

𝜃 2 = 𝑤12 + 𝑤2

2 +⋯

L2 regularization:

Page 36: Tips for Deep Learning

Regularization

• New loss function to be minimized

www

sgnLL

Update:

www tt

L1

tt w

ww sgn

L

w

ww tt

Lsgn

Magnitude decreases by a constant

12

1L L

211ww

L1 regularization (LASSO):

w

wt

L1

Compare to L2

Magnitude decay by a factor

Page 37: Tips for Deep Learning

https://en.wikipedia.org/wiki/Lasso_(statistics)

L1 regularization forces some attributes to be EXACTLY zero Less attributes and corresponding coefficients.

Page 38: Tips for Deep Learning

L1-Regularization and Exact Recovery

Page 39: Tips for Deep Learning

Regularization - Weight Decay

• Our brain prunes out the useless link between neurons.

Doing the same thing to machine’s brain improves the performance.

Page 40: Tips for Deep Learning

Good Results on Testing Data?

Good Results on Training Data?

YES

YES

Recipe of Deep Learning

New activation function

Adaptive Learning Rate

Early Stopping

Regularization

Dropout

Page 41: Tips for Deep Learning

DropoutTraining:

Each time before updating the parameters

Each neuron has p% to dropout

Page 42: Tips for Deep Learning

DropoutTraining:

Each time before updating the parameters

Each neuron has p% to dropout

Using the new network for training

The structure of the network is changed.

Thinner!

For each mini-batch, we resample the dropout neurons

Page 43: Tips for Deep Learning

DropoutTesting:

No dropout

If the dropout rate at training is p%, all the weights times 1-p%

Assume that the dropout rate is 50%. If a weight w = 1 by training, set 𝑤 = 0.5 for testing.

Page 44: Tips for Deep Learning

Dropout- Intuitive Reason

Training

Testing

Dropout (腳上綁重物)

No dropout(拿下重物後就變很強)

Page 45: Tips for Deep Learning

Dropout - Intuitive Reason

When teams up, if everyone expect the partner will do the work, nothing will be done finally.

However, if you know your partner will dropout, you will do better.

我的 partner 會擺爛,所以我要好好做

When testing, no one dropout actually, so obtaining good results eventually.

Page 46: Tips for Deep Learning

Dropout - Intuitive Reason

• Why the weights should multiply (1-p)% (dropout rate) when testing?

Training of Dropout Testing of Dropout

𝑤1

𝑤2

𝑤3

𝑤4

𝑧

𝑤1

𝑤2

𝑤3

𝑤4

𝑧′

Assume dropout rate is 50%

0.5 ×

0.5 ×

0.5 ×

0.5 ×

No dropout

Weights from training

𝑧′ ≈ 2𝑧

𝑧′ ≈ 𝑧

Weights multiply 1-p%

Page 47: Tips for Deep Learning

Dropout is a kind of ensemble.

Ensemble

Network1

Network2

Network3

Network4

Train a bunch of networks with different structures

Training Set

Set 1 Set 2 Set 3 Set 4

Page 48: Tips for Deep Learning

Dropout is a kind of ensemble.

Ensemble

y1

Network1

Network2

Network3

Network4

Testing data x

y2 y3 y4

average

Page 49: Tips for Deep Learning

Dropout is a kind of ensemble.

Training of Dropout

minibatch1

……

Using one mini-batch to train one network

Some parameters in the network are shared

minibatch2

minibatch3

minibatch4

M neurons

2M possible networks

Page 50: Tips for Deep Learning

Dropout is a kind of ensemble.

testing data xTesting of Dropout

……

average

y1 y2 y3

All the weights multiply 1-p%

≈ y?????

Page 51: Tips for Deep Learning

Testing of Dropout

w1 w2

x1 x2

w1 w2

x1 x2

w1 w2

x1 x2

w1 w2

x1 x2

z=w1x1+w2x2 z=w2x2

z=w1x1 z=0

x1 x2

w1 w2

1

2

1

2

x1 x2

w1 w2

z=w1x1+w2x2

𝑧 =1

2𝑤1𝑥1 +

1

2𝑤2𝑥2

Page 52: Tips for Deep Learning

Neural Network

Good Results on Testing Data?

Good Results on Training Data?

Step 3: pick the best function

Step 2: goodness of function

Step 1: define a set of function

YES

YES

NO

NO

Overfitting!

Recipe of Deep Learning

Page 53: Tips for Deep Learning

Live Demo