Top Banner
Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted Willke [email protected] Tuesday, February 25, 2020
40

Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

May 21, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Deep Learning Theory and PracticeLecture 15

Recurrent Neural Networks

Dr. Ted Willke [email protected]

Tuesday, February 25, 2020

Page 2: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Review of Lecture 14• GoogLeNet’s Inception module

2

- Multiple filter sizes in one module

- Dimensionality reduction to reduce ops

Page 3: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Review of Lecture 14

3

• Full GoogLeNet architecture- Eliminated expensive FC layers for classification- Global Average Pooling

- Used auxiliary classification units to inject gradient- Better choice: Batch Normalization

(a batch of activations)

(desired transform)

(Ioffe and Szegedy 2015)Let model learn the best distribution!(not forced to have mean 0, var of 1, but still standardized)

Page 4: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Review of Lecture 14

4

- Implementing Batch Norm in PyTorch

- Fit a residual mapping to combat optimization issues

- Also added “bottleneck” layers for dimensionality reduction

- By convention, normalize the signal at neural input.

• ResNet: Residual connections, deep networks

Page 5: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Review of Lecture 14

5

- Other practicalities

- Want this to hold for both forward and backward pass

- Xavier, Xavier/2 initialization1. Zero mean

2. Want same variance into and out of neuron

3. Gaussian or uniform distribution

( number of inputs)

(Xavier)

var(wi) =1

nin

- Initialize the weights with this variance and zero mean:

nin

- Take average of fan-in/fan-out to include backprop

var(wi) =2

nin + nout(Glorot & Bengio)

- Multiply by 2 for ReLU (since 0 for half of input)

var(wi) =2

nin(He et al. 2015)

See torch.nn.init!

Page 6: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Review of Lecture 14

6

• Rapid advancement in accuracy and efficiency • Work needed on processing rates and power

Page 7: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Review of Lecture 14

7

Many other notable architectures and advancements…

•Network in Network: Influenced both GoogLeNet and ResNet

•More general use of “identity mappings” from ResNet

•Wider and aggregated networks

•Networks with stochastic depth

•Fractal networks

•Densely connected convolutional networks

•Efficient networks (compression)

… to name a few!

Page 8: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Today’s Lecture

•Introduction to recurrent neural networks

8(Many slides adapted from Stanford’s excellent CS231n course. Thank you Fei-Fei Li, Justin Johnson, and Serena Young!)

Page 9: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Feedforward neural networks

9

One input

One output

Page 10: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Recurrent neural networks process sequences

10

One-to-one One-to-many

E.g., image captioning Image to sequence of words

Page 11: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Recurrent neural networks process sequences

11

One-to-one One-to-many Many-to-one

E.g., sentiment classification Sequence of words to sentiment

Page 12: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Recurrent neural networks process sequences

12

One-to-one One-to-many Many-to-one Many-to-many

E.g., machine translation sequence of words tosequence of words

Page 13: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Recurrent neural networks process sequences

13

One-to-one One-to-many Many-to-one Many-to-many Many-to-many

E.g., video classification at frame-level

Page 14: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Sequential processing of non-sequential data

14

Classify images by “sequential spatial attention”

(Gregor et al., “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015)

Page 15: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Sequential processing of non-sequential data

15

Generate images one portion at a time!

(Gregor et al., “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015)

Page 16: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Recurrent neural network

16

Want to predict a vectorat time steps

Page 17: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Recurrent neural network

17

Can process a sequence of vectors by applyinga recurrence formula at each time step:

xt

ht = fW(ht−1, xt)

new state some function with params, W

old state input vector at time t

The same function and parameters are used at every time step!

Page 18: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

A “plain” recurrent neural network

18

ht = fW(ht−1, xt)

The state consists of a single “hidden” vector :h

yt = Whyht

ht = tanh (Whhht−1 + Wxhxt)

Page 19: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

The RNN computation graph

19

Recurrent feedback can be ‘unrolled’ into a computation graph.

Page 20: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

The RNN computation graph

20

Recurrent feedback can be ‘unrolled’ into a computation graph.

Page 21: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

The RNN computation graph

21

Recurrent feedback can be ‘unrolled’ into a computation graph.

Page 22: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

The RNN computation graph

22

Reuse the same weight matrix at each time step!

When thinking about backprop, separate gradient will flow back for each time step.

Page 23: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

The RNN computation graph (many-to-many)

23A second network can generate the output that we desire.

Page 24: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

The RNN computation graph (many-to-many)

24Can compute an individual loss at every time step as well (e.g., softmax loss of labels)

Page 25: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

The RNN computation graph (many-to-many)

25Total loss for entire training step is sum of losses (compute gradient of loss w.r.t. W)

Page 26: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

The RNN computation graph (many-to-one)

26Final hidden state summarizes all context (can best predict, e.g., sentiment)

Page 27: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

The RNN computation graph (one-to-many)

27Fixed size input, variably sized output (e.g., image captioning)

y1 y2

Page 28: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Sequence-to-sequence

28

Many-to-one: Encode input sequence into single vector…

Variably-sized input and output (e.g., machine translation)

One-to-many: Decode output sequencefrom single input vector….

Page 29: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Example: Character-level language model

29

Predicting the next character…

Vocabulary: [h,e,l,o]

Training sequence: “hello”

Page 30: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Example: Character-level language model

30

Predicting the next character…

Vocabulary: [h,e,l,o]

Training sequence: “hello”

ht = tanh (Whhht−1 + Wxhxt)

Page 31: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Example: Character-level language model

31

Predicting the next character…

Vocabulary: [h,e,l,o]

Training sequence: “hello”

Page 32: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Example: Character-level language modelsampling

32

Vocabulary: [h,e,l,o]

At test time, sample charactersone at a time andfeed back to model

Page 33: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Example: Character-level language modelsampling

33

Vocabulary: [h,e,l,o]

At test time, sample charactersone at a time andfeed back to model

Page 34: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Example: Character-level language modelsampling

34

Vocabulary: [h,e,l,o]

At test time, sample charactersone at a time andfeed back to model

Page 35: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Example: Character-level language modelsampling

35

Vocabulary: [h,e,l,o]

At test time, sample charactersone at a time andfeed back to model

Page 36: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

36

Backpropagation through timeRun forward through entire sequence to compute loss, then backwardthrough entire sequence to compute gradient.

Page 37: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

37

Truncated backpropagation through time

Run forward and backwardthrough ‘chunks’ of the sequence instead of entire sequence.

Page 38: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

38

Truncated backpropagation through time

Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps.

Page 39: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

39

Truncated backpropagation through time

Page 40: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture15.pdf · Deep Learning Theory and Practice Lecture 15 Recurrent Neural Networks Dr. Ted

Further reading

• Zhang, A., Lipton, Z. C., Li, M., and Smola, A. J. (2020) Dive into Deep Learning, Release 0.7.1. https://d2l.ai/

• Stanford CS231n Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/

• Abu-Mostafa, Y. S., Magdon-Ismail, M., Lin, H.-T. (2012) Learning from data. AMLbook.com.

• Goodfellow et al. (2016) Deep Learning. https://www.deeplearningbook.org/

• Boyd, S., and Vandenberghe, L. (2018) Introduction to Applied Linear Algebra - Vectors, Matrices, and Least Squares. http://vmls-book.stanford.edu/

• VanderPlas, J. (2016) Python Data Science Handbook. https://jakevdp.github.io/PythonDataScienceHandbook/

40