Top Banner
Neural Network Part 4: Recurrent Neural Networks CS 760@UW-Madison
36

Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

May 21, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Neural Network Part 4: Recurrent Neural Networks

CS 760@UW-Madison

Page 2: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Goals for the lecture

you should understand the following concepts• sequential data• computational graph• recurrent neural networks (RNN) and the

advantage• training recurrent neural networks• LSTM and GRU• encoder-decoder RNNs

2

Page 3: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Introduction

Page 4: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Recurrent neural networks

• Dates back to (Rumelhart et al., 1986) • A family of neural networks for handling sequential data, which

involves variable length inputs or outputs

• Especially, for natural language processing (NLP)

Page 5: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Sequential data

• Each data point: A sequence of vectors 𝑥(#), for 1 ≤ 𝑡 ≤ 𝜏• Batch data: many sequences with different lengths 𝜏• Label: can be a scalar, a vector, or even a sequence

• Example• Sentiment analysis• Machine translation

Page 6: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Example: machine translation

Figure from: devblogs.nvidia.com

Page 7: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

More complicated sequential data

• Data point: two dimensional sequences like images• Label: different type of sequences like text sentences

• Example: image captioning

Page 8: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Image captioning

Figure from the paper “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, by Justin Johnson, Andrej Karpathy, Li Fei-Fei

Page 9: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Computational graphs

Page 10: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

A typical dynamic system

𝑠(#*+) = 𝑓(𝑠 # ; 𝜃)Figure from Deep Learning, Goodfellow, Bengio and Courville

Page 11: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

A system driven by external data

𝑠(#*+) = 𝑓(𝑠 # , 𝑥(#*+); 𝜃)

Figure from Deep Learning, Goodfellow, Bengio and Courville

Page 12: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Compact view

𝑠(#*+) = 𝑓(𝑠 # , 𝑥(#*+); 𝜃)

Figure from Deep Learning, Goodfellow, Bengio and Courville

Page 13: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Compact view

𝑠(#*+) = 𝑓(𝑠 # , 𝑥(#*+); 𝜃)

Figure from Deep Learning, Goodfellow, Bengio and Courville

Key: the same 𝑓 and 𝜃 for all time steps

square: one step time delay

Page 14: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Recurrent neural networks (RNN)

Page 15: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Recurrent neural networks

• Use the same computational function and parameters across different time steps of the sequence

• Each time step: takes the input entry and the previous hidden state to compute the output entry

• Loss: typically computed at every time step

Page 16: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Recurrent neural networks

Figure from Deep Learning, by Goodfellow, Bengio and Courville

Label

Loss

Output

State

Input

Page 17: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Recurrent neural networks

Figure from Deep Learning, Goodfellow, Bengio and Courville

Math formula:

Page 18: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Advantage

• Hidden state: a lossy summary of the past• Shared functions and parameters: greatly reduce the capacity

and good for generalization in learning• Explicitly use the prior knowledge that the sequential data can

be processed by in the same way at different time step (e.g., NLP)

Page 19: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Advantage

• Hidden state: a lossy summary of the past• Shared functions and parameters: greatly reduce the capacity

and good for generalization in learning• Explicitly use the prior knowledge that the sequential data can

be processed by in the same way at different time step (e.g., NLP)

• Yet still powerful (actually universal): any function computable by a Turing machine can be computed by such a recurrent network of a finite size (see, e.g., Siegelmann and Sontag (1995))

Page 20: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Training RNN

• Principle: unfold the computational graph, and use backpropagation

• Called back-propagation through time (BPTT) algorithm• Can then apply any general-purpose gradient-based techniques

Page 21: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Training RNN

• Principle: unfold the computational graph, and use backpropagation

• Called back-propagation through time (BPTT) algorithm• Can then apply any general-purpose gradient-based techniques

• Conceptually: first compute the gradients of the internal nodes, then compute the gradients of the parameters

Page 22: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Recurrent neural networks

Figure from Deep Learning, Goodfellow, Bengio and Courville

Math formula:

Page 23: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Recurrent neural networks

Figure from Deep Learning, Goodfellow, Bengio and Courville

Gradient at 𝐿(#): (total loss is sum of those at different time steps)

Page 24: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Recurrent neural networks

Figure from Deep Learning, Goodfellow, Bengio and Courville

Gradient at 𝑜(#):

Page 25: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent
Page 26: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Recurrent neural networks

Figure from Deep Learning, Goodfellow, Bengio and Courville

Gradient at 𝑠(3):

Page 27: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Recurrent neural networks

Figure from Deep Learning, Goodfellow, Bengio and Courville

Gradient at 𝑠(#):

Page 28: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Recurrent neural networks

Figure from Deep Learning, Goodfellow, Bengio and Courville

Gradient at parameter 𝑉:

Page 29: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

• What happens to the magnitude of the gradients as we backpropagatethrough many layers?

– If the weights are small, the gradients shrink exponentially.

– If the weights are big the gradients grow exponentially.

• Typical feed-forward neural nets can cope with these exponential effects because they only have a few hidden layers.

• In an RNN trained on long sequences (e.g. 100 time steps) the gradients can easily explode or vanish.

– We can avoid this by initializing the weights very carefully.

• Even with good initial weights, its very hard to detect that the current target output depends on an input from many time-steps ago.

– So RNNs have difficulty dealing with long-range dependencies.

The problem of exploding/vanishing gradient

Page 30: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Long Short-Term Memory (LSTM) Cell

it ot

ft

Input Gate Output Gate

Forget Gate

ht

30

xt ht-1

Cell

ct-1

ct = ft ⊗ ct−1 +

it ⊗ tanhWxtht−1

⎛⎝⎜

⎞⎠⎟

xt ht-1 xt ht-1

xt

ht-1

W

Wi Wo

Wf

ft =σ Wf

xtht−1

⎛⎝⎜

⎞⎠⎟+ bf

⎛⎝⎜

⎞⎠⎟

ht = ot ⊗ tanhct

Similarly for it, ot

* Dashed line indicates time-lag

Page 31: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Gated Recurrent Unit (GRU) Celll

31Figure modified from Christopher Olahhttps://colah.github.io/posts/2015-08-Understanding-LSTMs/

any

Page 32: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Seq2seq (Encoder Decoder)

e1e0

x1

e2

x2

e3

x3

e4

x4

e5

x5

e6

x6

e7

x7

d1

y1

d0 d2

y2

d3

y3

d4

y4

知 识 就 是 力 量 <end>

<end>powerisknowledge

Good for machine translation

Two RNNs

Encoding ends when reading <end>

Decoding ends when generating <end>

All input encoded in e7 (difficult)

Encoder can be CNN on image instead(image captioning)

Page 33: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Seq2seq with Attention

e1e0

x1

e2

x2

e3

x3

e4

x4

e5

x5

e6

x6

e7

x7

d1

y1

d0 d2

y2

d3

y3

d4

y4

知 识 就 是 力 量 <end>

<end>powerisknowledge

a11 a12 a13 a17…

c1

𝑐+ =6#7+

8

𝑎+#𝑒#

6#7+

8

𝑎+# = 1

context

attention weights

Page 34: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Seq2seq with Attention

e1e0

x1

e2

x2

e3

x3

e4

x4

e5

x5

e6

x6

e7

x7

d1

y1

d0 d2

y2

d3

y3

d4

y4

知 识 就 是 力 量 <end>

<end>powerisknowledge

a21 a22 a23 a27…

c2

𝑐; =6#7+

8

𝑎;#𝑒#

6#7+

8

𝑎;# = 1

a24

… and so on

Page 35: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

Attention weights

e1e0

x1

e2

x2

e3

x3

e4

x4

e5

x5

e6

x6

e7

x7

d1

y1

d0 d2

y2

d3

y3

d4

y4

知 识 就 是 力 量 <end>

<end>powerisknowledge

a21 a22 a23 a27…

c2

a24

ast: the amount of attention ys should pay to et

ds-1

et

zst

as. = softmax(zs.)

https://arxiv.org/abs/1409.0473https://arxiv.org/abs/1409.3215https://arxiv.org/abs/1406.1078

Page 36: Neural Network Part 4: Recurrent Neural Networkspages.cs.wisc.edu/~jerryzhu/cs760/10_neural-networks-4.pdf · •recurrent neural networks (RNN) and the advantage •training recurrent

THANK YOUSome of the slides in these lectures have been adapted/borrowed from materials developed by Yingyu Liang, Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Matt Gormley, Elad Hazan, Tom Dietterich, Pedro Domingos, and Geoffrey

Hinton.