Top Banner
Lecture 37: ConvNets (Cont’d) and Training CS 4670/5670 Sean Bell [http://bbabenko.tumblr.com/post/83319141207/convolutional-learnings-things-i-learned-by]
58

Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Jul 07, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Lecture 37: ConvNets (Cont’d) and TrainingCS 4670/5670 Sean Bell

[http://bbabenko.tumblr.com/post/83319141207/convolutional-learnings-things-i-learned-by]

Page 2: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(Unrelated) Dog vs Food

[Karen Zack, @teenybiscuit]

Page 3: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

[Karen Zack, @teenybiscuit]

(Unrelated) Dog vs Food

Page 4: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

[Karen Zack, @teenybiscuit]

(Unrelated) Dog vs Food

Page 5: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(Recap) BackpropFrom Geoff Hinton’s seminar at Stanford yesterday

Page 6: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(Recap) Backprop

Gradient:∂L∂θ

= ∂L∂θ1

∂L∂θ2

…⎡

⎣⎢⎢

⎦⎥⎥

All of the weights and biases in the network, stacked together

Parameters: θ = θ1 θ2 !⎡⎣

⎤⎦

Intuition: “How fast would the error change if I change myself by a little bit”

Page 7: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

x h(1) LFunction Function s...

θ (1) θ (n)Forward Propagation: compute the activations and loss

L∂L∂s

...

Backward Propagation: compute the gradient (“error signal”)

∂L∂h(1)

Function

∂L∂θ (n)

Function

∂L∂θ (1)

∂L∂x

(Recap) Backprop

Page 8: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(Recap)A ConvNet is a sequence of convolutional layers, interspersed with

activation functions (and possibly other layer types)

Page 9: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(Recap)

Page 10: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(Recap)

Page 11: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(Recap)

Page 12: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(Recap)

Page 13: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(Recap)

Page 14: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(Recap)

Page 15: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(Recap)

Page 16: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Web demo 1: Convolution

http://cs231n.github.io/convolutional-networks/

[Karpathy 2016]

Page 17: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Web demo 2: ConvNet in a Browser

http://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html

[Karpathy 2014]

Page 18: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: Stride

Input

Weights

During convolution, the weights “slide” along the input to generate each output

Output

Page 19: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: Stride

Input

During convolution, the weights “slide” along the input to generate each output

Output

Page 20: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: Stride

Input

During convolution, the weights “slide” along the input to generate each output

Output

Page 21: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: Stride

Input

During convolution, the weights “slide” along the input to generate each output

Output

Page 22: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: Stride

Input

During convolution, the weights “slide” along the input to generate each output

Output

Page 23: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: Stride

Input

During convolution, the weights “slide” along the input to generate each output

Output

Page 24: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: Stride

Input

During convolution, the weights “slide” along the input to generate each output

Recall that at each position, we are doing a 3D sum:

hr = xrijkWijkijk∑ + b

(channel, row, column)

Page 25: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: Stride

Input

Output

But we can also convolve with a stride, e.g. stride = 2

Page 26: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: Stride

Input

Output

But we can also convolve with a stride, e.g. stride = 2

Page 27: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: Stride

Input

Output

But we can also convolve with a stride, e.g. stride = 2

Page 28: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: Stride

Input

- Notice that with certain strides, we may not be able to cover all of the input

Output

- The output is also half the size of the input

But we can also convolve with a stride, e.g. stride = 2

Page 29: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

0 0 0 0 0 0 0 0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 0 0 0 0 0 0 0

Convolution: PaddingWe can also pad the input with zeros. Here, pad = 1, stride = 2

Output

Input

Page 30: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

0 0 0 0 0 0 0 0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 0 0 0 0 0 0 0

Convolution: PaddingWe can also pad the input with zeros. Here, pad = 1, stride = 2

Output

Input

Page 31: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

0 0 0 0 0 0 0 0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 0 0 0 0 0 0 0

Convolution: Padding

Input

We can also pad the input with zeros. Here, pad = 1, stride = 2

Output

Page 32: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

0 0 0 0 0 0 0 0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 0 0 0 0 0 0 0

Convolution: PaddingWe can also pad the input with zeros. Here, pad = 1, stride = 2

Output

Input

Page 33: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: How big is the output?

0 0 0 0 0 0 0 0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 0 0 0 0 0 0 0

width win

stride s

kernel k

pp

wout =win + 2p − k

s⎢⎣⎢

⎥⎦⎥+1

In general, the output has size:

Page 34: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: How big is the output?

0 0 0 0 0 0 0 0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 0 0 0 0 0 0 0

Example: k=3, s=1, p=1

wout =win + 2p − k

s⎢⎣⎢

⎥⎦⎥+1

= win + 2 − 31

⎢⎣⎢

⎥⎦⎥+1

= win

width win p

stride s

kernel k

VGGNet [Simonyan 2014] uses filters of this shapep

Page 35: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Pooling

Figure: Andrej Karpathy

For most ConvNets, convolution is often followed by pooling:- Creates a smaller representation while retaining the most important information- The “max” operation is the most common- Why might “avg” be a poor choice?

Page 36: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Pooling

Page 37: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Max Pooling

Figure: Andrej Karpathy

What’s the backprop rule for max pooling?- In the forward pass, store the index that took the max- The backprop gradient is the input gradient at that index

Page 38: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Example ConvNet

Figure: Andrej Karpathy

Page 39: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Example ConvNet

Figure: Andrej Karpathy

Page 40: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Example ConvNet

Figure: Andrej Karpathy

Page 41: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Example ConvNet

Figure: Andrej Karpathy

10x3x3 conv filters, stride 1, pad 12x2 pool filters, stride 2

Page 42: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Example: AlexNet [Krizhevsky 2012]

FC

Figure: [Karnowski 2015] (with corrections)

3 227x227

max full

“max”: max pooling “norm”: local response normalization “full”: fully connected

1000

classscores

Page 43: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Example: AlexNet [Krizhevsky 2012]

Figure: [Karnowski 2015]

zoom in

Page 44: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Questions?

Page 45: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

… why so many layers?

[Szegedy et al, 2014]

How do you actually train these things?

Page 46: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

How do you actually train these things?

Gather labeled data

Find a ConvNet architecture

Minimize the loss

Roughly speaking:

Page 47: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Training a convolutional neural network

• Split and preprocess your data

• Choose your network architecture

• Initialize the weights

• Find a learning rate and regularization strength

• Minimize the loss and monitor progress

• Fiddle with knobs

Page 48: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Mini-batch Gradient DescentLoop:

1. Sample a batch of training data (~100 images)

2. Forwards pass: compute loss (avg. over batch)

3. Backwards pass: compute gradient

4. Update all parameters

Note: usually called “stochastic gradient descent” even though SGD has a batch size of 1

Page 49: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Regularization

L = Ldata + Lreg Lreg = λ 12W 2

2

[Andrej Karpathy http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html]

Regularization reduces overfitting:

Page 50: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Overfitting

[Image: https://en.wikipedia.org/wiki/File:Overfitted_Data.png]

Overfitting: modeling noise in the training set instead of the “true” underlying relationship

Underfitting: insufficiently modeling the relationship in the training set

General rule: models that are “bigger” or have more capacity are more likely to overfit

Page 51: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(0) Dataset splitSplit your data into “train”, “validation”, and “test”:

Dataset

Train

Validation

Test

Page 52: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Train

Validation

Test

Train: gradient descent and fine-tuning of parameters

Validation: determining hyper-parameters (learning rate, regularization strength, etc) and picking an architecture

Test: estimate real-world performance (e.g. accuracy = fraction correctly classified)

(0) Dataset split

Page 53: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Train

Validation

Test

(0) Dataset split

To avoid false discovery, once we have used a test set once, we should not use it again (but nobody follows this rule, since it’s expensive to collect datasets)

Be careful with false discovery:

Instead, try and avoid looking at the test score until the end

Page 54: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Train Test

(0) Dataset splitCross-validation: cycle which data is used as validation

Val

TestTrain TrainVal

TestTrain TrainVal

Average scores across validation splits

TrainVal Test

Page 55: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(1) Data preprocessing

Figure: Andrej Karpathy

Preprocess the data so that learning is better conditioned:

Page 56: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(1) Data preprocessing

Slide: Andrej Karpathy

In practice, you may also see PCA and Whitening of the data:

Page 57: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(1) Data preprocessing

Figure: Alex Krizhevsky

For ConvNets, typically only the mean is subtracted.

A per-channel mean also works (one value per R,G,B).

Page 58: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(1) Data preprocessingAugment the data — extract random crops from the input, with slightly jittered offsets. Without this, typical ConvNets (e.g. [Krizhevsky 2012]) overfit the data.

E.g. 224x224 patches extracted from 256x256 images

Randomly reflect horizontally

Perform the augmentation live during training

Figure: Alex Krizhevsky