Top Banner
Sequence-to-Sequence Models Kevin Duh Johns Hopkins University May 2019
53

Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

May 21, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Sequence-to-Sequence Models

Kevin DuhJohns Hopkins University

May 2019

Page 2: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Outline

1. Problem Definition

2. Recurrent Model with Attention

3. Transformer Model

Page 3: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Machine Learning Abstractions

!3

Page 4: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Machine Learning Abstractions

• Training data

• Input: x / Output: y

• Lots of {(xi,yi)} i=1,2,…,N

!3

Page 5: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Machine Learning Abstractions

• Training data

• Input: x / Output: y

• Lots of {(xi,yi)} i=1,2,…,N

• Goal: Build model F(x) on training data, generalize to test data: yprediction = F(xtest) , yprediction vs ytruth

!3

Page 6: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Machine Learning Abstractions

• Training data

• Input: x / Output: y

• Lots of {(xi,yi)} i=1,2,…,N

• Goal: Build model F(x) on training data, generalize to test data: yprediction = F(xtest) , yprediction vs ytruth

• What is the structure of x and y?

!3

Page 7: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Standard classification problem

• x is a vector in RD

• y is a label from {class1, class2, class3, … classK}

• A neural net for F(x):

• x=[x1; x2; x3; x4]

• h=nonlinear(W*x)

• y=softmax(M*h)

!4

Page 8: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Image feature: x = 960x720 256 RGB vector

From: https://commons.wikimedia.org/wiki/File:This_is_a_very_cute_dog.jpg

Image classification example

y = {dog, cat, squirrel, alligator, dinosaur}

Page 9: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

More complex problems

!6

Page 10: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

More complex problems• Complex Input:

• x is a sequence of L vectors/words: RDxL

• y is a label from {class1, class2, class3, … classK}

• Example: mention span to NE type classification

!6

Page 11: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

More complex problems• Complex Input:

• x is a sequence of L vectors/words: RDxL

• y is a label from {class1, class2, class3, … classK}

• Example: mention span to NE type classification

• Complex Input and Output:

• x is a sequence of L vectors/words

• y is a sequence of J vectors/words

!6

Page 12: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Sequence Output Example: Image Captioning

!7

Caption text generation output space: { all possible English sentences }

a cute dog a very cute dog

super cute puppy adorable puppy looking at me

….Image feature:

x = 960x720 256 RGB vector

Page 13: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Sequence-to-Sequence Example: Machine Translation

!8

das Haus ist gross the house is big

Page 14: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Sequence-to-Sequence Example: Named Entity Recognition

!9

Albert lives in Baltimore PER NONE NONE LOCNER

Tagger

Page 15: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Handling sequences

Page 16: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Handling sequences• For sequence input:

• We need an “encoder” to convert arbitrary length input to some fixed-length hidden representation

• Without this, may be hard to apply matrix operations

Page 17: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Handling sequences• For sequence input:

• We need an “encoder” to convert arbitrary length input to some fixed-length hidden representation

• Without this, may be hard to apply matrix operations

• For sequence output:

• We need a “decoder” to generate arbitrary length output

• One method: generate one word at a time, until special <stop> token

Page 18: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Example: Machine Translation

!11

das Haus ist gross the house is big

das Haus ist gross

Page 19: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Example: Machine Translation

!11

das Haus ist gross the house is big

das Haus ist gross

Encoder“Sentence Vector”

Page 20: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Example: Machine Translation

!11

das Haus ist gross the house is big

das Haus ist gross

Encoder“Sentence Vector”

Decoder

Page 21: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Example: Machine Translation

!11

das Haus ist gross the house is big

das Haus ist gross

Encoder“Sentence Vector”

Decoder

step 1: the

Page 22: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Example: Machine Translation

!11

das Haus ist gross the house is big

das Haus ist gross

Encoder“Sentence Vector”

Decoder

step 1: thestep 2: house

Page 23: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Example: Machine Translation

!11

das Haus ist gross the house is big

das Haus ist gross

Encoder“Sentence Vector”

Decoder

step 1: thestep 2: housestep 3: is

Page 24: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Example: Machine Translation

!11

das Haus ist gross the house is big

das Haus ist gross

Encoder“Sentence Vector”

Decoder

step 1: thestep 2: housestep 3: isstep 4: big

Page 25: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Example: Machine Translation

!11

das Haus ist gross the house is big

das Haus ist gross

Encoder“Sentence Vector”

Decoder

step 1: thestep 2: housestep 3: isstep 4: bigstep 5: <stop>

Page 26: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Example: Machine Translation

!11

das Haus ist gross the house is big

das Haus ist gross

Encoder“Sentence Vector”

Decoder

step 1: thestep 2: housestep 3: isstep 4: bigstep 5: <stop>

Each step applies a softmax over all vocab

Page 27: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Outline

1. Problem Definition

2. Recurrent Model with Attention

3. Transformer Model

Page 28: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Sequence modeling with a recurrent network

!13

the house is big .The following animations courtesy of Philipp Koehn:

http://mt-class.org/jhu

Page 29: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Sequence modeling with a recurrent network

!14

the house is big .

Page 30: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Sequence modeling with a recurrent network

!15

the house is big .

Page 31: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Sequence modeling with a recurrent network

!16

the house is big .

Page 32: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Sequence modeling with a recurrent network

!17

the house is big .

Page 33: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Sequence modeling with a recurrent network

!18

the house is big .

Page 34: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Recurrent models for sequence-to-sequence problems

• We can use these models for both input and output

• For output, there is the constraint of left-to-right generation

• For input, we are provided the whole sentence at once, we can do both left-to-right and right-to-left modeling

• The recurrent units may be based on LSTM, GRU, etc.

Page 35: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Bidirectional Encoder for Input Sequence

Word embedding: word meaning in isolation Hidden state of each Recurrent Neural Net (RNN): word meaning in this sentence

Page 36: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Left-to-Right Decoder

• Input context comes from encoder

• Each output is informed by current hidden state and previous output word

• Hidden state is updated at every step

Page 37: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

In detail: each step

!22

Context contains information from encoder/input

(simplified view)

Page 38: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

What connects the encoder and decoder}Input context is a fixed-dim vector:

weighted average of all L vectors in RNN

How to compute weighting? Attention mechanism:

Note this changes at each step i What’s paid attention has more influence on next prediction

si-1

ci

hj

⍺0 ⍺1 ⍺2 ⍺3 ⍺4 ⍺5 ⍺6

Page 39: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

To wrap up: Recurrent models with attention}1. Encoder takes in

arbitrary length input

2. Decoder generates output one word at a time, using current hidden state, input context (from attention), and previous output

Note: we can add layers to make this model “deeper”

Page 40: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Outline

1. Problem Definition

2. Recurrent Model with Attention

3. Transformer Model

Page 41: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Motivation of Transformer Model

Page 42: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Motivation of Transformer Model

• RNNs are great, but have two demerits:

Page 43: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Motivation of Transformer Model

• RNNs are great, but have two demerits:

• Sequential structure is hard to parallelize, may slow down GPU computation

Page 44: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Motivation of Transformer Model

• RNNs are great, but have two demerits:

• Sequential structure is hard to parallelize, may slow down GPU computation

• Still has to model some kinds of long-term dependency (though addressed by LSTM/GRU)

Page 45: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Motivation of Transformer Model

• RNNs are great, but have two demerits:

• Sequential structure is hard to parallelize, may slow down GPU computation

• Still has to model some kinds of long-term dependency (though addressed by LSTM/GRU)

• Transformers solve the sequence-to-sequence problem using only attention mechanisms, no RNN

Page 46: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Long-term dependency• Dependencies between:

• Input-output words

• Two input words

• Two output words

}Attention mechanism

“shortens” path between input and output words.

What about others?

Page 47: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Attention, more abstractly}Previous attention formulation:

Abstract formulation: Scaled dot-product for queries Q, keys K, values V

si-1

ci

hj

⍺0 ⍺1 ⍺2 ⍺3 ⍺4 ⍺5 ⍺6

query

key & values

(relevance)

Page 48: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Multi-head Attention

• For expressiveness, do at scaled dot-product attention multiple times

• Add different linear transform for each key, query, value

Page 49: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Putting it together

• Multiple (N) layers

• For encoder-decoder attention, Q: previous decoder layer, K and V: output of encoder

• For encoder self-attention, Q/K/V all come from previous encoder layer

• For decoder self-attention, allow each position to attend to all positions up to that position

• Positional encoding for word order

Page 50: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

From: https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html

Page 51: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

From: https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html

Page 52: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Summary1. Problem Definition:

• Sequence-to-sequence problems are more complex, but can be solved by (a) encoding input to fixed representations and (b) decoding output one at a time

2. Recurrent Model with Attention

• Bidirectional RNN encoder, RNN decoder, attention-based context vector tying it together

3. Transformer Model

• Another way to solve sequence problems, without using sequential models

Page 53: Sequence-to-Sequence Modelskevinduh/notes/1905-kduh-Seq2Seq.pdf · Recurrent models for sequence-to-sequence problems • We can use these models for both input and output • For

Research directions

• Lots!!