Lecture 12: Sequence to sequence modelsfall97.class.vision/slides/12.pdf · [Cho et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation]

Lecture 12SRTTU – A.Akhavan 1 ۱۳۹۷آذر ۱۰شنبه،

Lecture 12: Sequence to sequence models

Alireza Akhavan Pour

CLASS.VISION

https://akhavanpour.ir/

http://class.vision/

Lecture 12SRTTU – A.Akhavan

Sequence to sequence model: Introduction and concepts

2 ۱۳۹۷آذر ۱۰شنبه،





Sequence to sequence model

Jane visite l’Afrique en septembre

Jane is visiting Africa in September.

𝑥<1> 𝑥<2> 𝑥<3> 𝑥<4> 𝑥<5>

𝑦<1> 𝑦<2> 𝑦<3> 𝑦<4> 𝑦<5> 𝑦<6>

[Cho et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation]

[Sutskever et al., 2014. Sequence to sequence learning with neural networks]

𝑎<0>

𝑥<1> 𝑥<𝑇𝑥>

⋯



Sequence to sequence model

Jane visite l’Afrique en septembre


𝑥<1> 𝑥<2> 𝑥<3> 𝑥<4> 𝑥<5>

𝑦<1> 𝑦<2> 𝑦<3> 𝑦<4> 𝑦<5> 𝑦<6>

[Cho et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation]

[Sutskever et al., 2014. Sequence to sequence learning with neural networks]

𝑎<0>


⋯

Encoder Decoder



A cat sitting on a chair𝑦<1> 𝑦<2> 𝑦<3> 𝑦<4> 𝑦<5>

55×55 × 96 27×27 ×96 27×27 ×256 13×13 ×256

11 × 11

s = 4

3 × 3

s = 2

MAX-POOL

5 × 5

same

3 × 3

s = 2

MAX-POOL

13×13 ×384

3 × 3

same

3 × 3

=

13×13 ×384 13×13 ×256 6×6 ×256

3 × 3 3 × 3

s = 2

MAX-POOL

⋮

9216

Softmax

1000

⋮

4096

⋮

4096

[Mao et. al., 2014. Deep captioning with multimodal recurrent neural networks]

[Vinyals et. al., 2014. Show and tell: Neural image caption generator]

[Karpathy and Li, 2015. Deep visual-semantic alignments for generating image descriptions]

𝑦<6>

ො𝑦<𝑇𝑦>

𝑥

ො𝑦<1> ො𝑦<2>

⋯

Image captioning



Language model: 𝑎<0>

𝑥<1>

ො𝑦<1> ො𝑦<2> ො𝑦<𝑇𝑦>

⋯

𝑥<2>

Machine translation as building a conditional language model

൯𝒑(𝒚<𝟏>, … , 𝒚<𝑻𝒚>

= ො𝑦<1>



Language model:

Machine translation: 𝑎<0>

𝑥<1>

ො𝑦<1>

𝑥<𝑇𝑥>

ො𝑦<𝑇𝑦>

⋯⋯

𝑎<0>

ො𝑦<1> ො𝑦<2> ො𝑦<𝑇𝑦>

⋯

Machine translation as building a conditional language model

൯𝒑(𝒚<𝟏>, … , 𝒚<𝑻𝒚>

0وکتور State ی کهencoderایجاد کرده

൯𝒑 𝒚<𝟏>, … , 𝒚<𝑻𝒚> 𝒙<𝟏>, … , 𝒙<𝑻𝒙>Conditional language model



Jane visite l’Afrique en septembre. 𝑃(𝑦<1>, … , 𝑦<𝑇𝑦>| 𝑥)


Jane is going to be visiting Africa in September.

In September, Jane will visit Africa.

Her African friend welcomed Jane in September.

arg max𝑦<1>,…,𝑦<𝑇𝑦>

𝑃(ො𝑦<1> , ො𝑦<2> , … , 𝑦<𝑇𝑦>| 𝑥)

Finding the most likely translation

English

French




Jane is going to be visiting Africa in September.

𝑎<0>

𝑥<1>

ො𝑦<1>

𝑥<𝑇𝑥>

ො𝑦<𝑇𝑦>

⋯⋯

Why not a greedy search?

arg max𝑦

𝑃(ො𝑦<1> , ො𝑦<2> , … , ො𝑦<𝑇𝑦> | 𝑥)arg max𝑦

𝑃(ො𝑦<1> , ො𝑦<2> , … , ො𝑦<𝑇𝑦> | 𝑥)

𝑃(Jane is going | 𝑥) > 𝑃(Jane is visiting | 𝑥)



Beam search

11 ۱۳۹۷آذر ۱۰شنبه،



𝑎<0>

𝑥<1>

ො𝑦<1>

𝑥<𝑇𝑥>

⋯

a

in

jane

september

zulu

⋮

⋮

⋮

⋮

10000

𝑃(𝑦<1> | 𝑥)

Step 1

Beam search algorithmB = 3 (Beam width)

French English



a

in

jane

september

zulu

⋮

⋮

⋮

⋮

Step 1

10000

𝑎<0>


⋯

𝑎<0>


⋯

𝑎<0>


⋯

Step 2

Beam search algorithm

a

aaron

september

zulu

ො𝑦<1>in

ො𝑦<2>

in

𝑃 𝑦<2> 𝑥, "𝑖𝑛")

𝑃 𝑦<1>, 𝑦<2> 𝑥) = 𝑃 𝑦<1> 𝑥) 𝑃 𝑦<2> 𝑥, 𝑦<1>)a

visitingis

zulu

𝑗𝑎𝑛𝑒 ො𝑦<2>𝑃 𝑦<2> 𝑥, "𝑗𝑎𝑛𝑒")

a

zulu

𝑠𝑒𝑝𝑡𝑒𝑚𝑏𝑒𝑟 ො𝑦<2>

(B=3)



a

in

jane

september

zulu

⋮

⋮

⋮

⋮

Step 1

10000

𝑎<0>


⋯

𝑎<0>


⋯

𝑎<0>


⋯

Step 2

Beam search algorithm

a

aaron

september

zulu

ො𝑦<1>in

ො𝑦<2>

in

𝑃 𝑦<2> 𝑥, "𝑖𝑛")

𝑃 𝑦<1>, 𝑦<2> 𝑥) = 𝑃 𝑦<1> 𝑥) 𝑃 𝑦<2> 𝑥, 𝑦<1>)a

visitingis

zulu

𝑗𝑎𝑛𝑒 ො𝑦<2>𝑃 𝑦<2> 𝑥, "𝑗𝑎𝑛𝑒")

a

zulu

𝑠𝑒𝑝𝑡𝑒𝑚𝑏𝑒𝑟 ො𝑦<2>

(B=3)



Beam search (𝐵 = 3)in september

jane is

jane visits

𝑎<0>


ො𝑦<3>

⋯

septemberin

𝑃(𝑦<1>, 𝑦<2>| 𝑥) jane visits africa in september. <EOS>

𝑎<0>


ො𝑦<3>

⋯

isjane

𝑎<0>


ො𝑦<3>

⋯

visitsjane

خروجی، احتمال ها را نیز ذخیره کرده ایم3برای هر کدام از این



Refinements to beam search

16 ۱۳۹۷آذر ۱۰شنبه،



Length normalization

17 ۱۳۹۷آذر ۱۰شنبه،

arg max𝑦ෑ

𝑡=1

𝑇𝑦

𝑃 𝑦<𝑡> 𝑥, 𝑦<1>, … , 𝑦<𝑡−1>)

arg max𝑦

𝑡=1

𝑇𝑦

log 𝑃 𝑦<𝑡> 𝑥, 𝑦<1>, … , 𝑦<𝑡−1>)

𝑡=1

𝑇𝑦

log 𝑃 𝑦<𝑡> 𝑥, 𝑦<1>, … , 𝑦<𝑡−1>)

𝑃(𝑦<1>, … , 𝑦<𝑇𝑦>| 𝑥) = 𝑃(𝑦<1> | 𝑥) P(𝑦<2> | 𝑥, 𝑦<1>) …P(𝑦<𝑇𝑦> | 𝑥, 𝑦<1>, … , 𝑦<𝑇𝑦−1>)

!وابسته به طول خروجی

1

𝑇𝑦𝑇𝑦𝛼

𝜶 = 𝟎. 𝟕

𝜶 = 𝟎 ? 𝜶 = 𝟏 ?



Beam search discussion

18 ۱۳۹۷آذر ۱۰شنبه،

Beam width B?

Unlike exact search algorithms like BFS (Breadth First Search) or

DFS (Depth First Search), Beam Search runs faster but is not

guaranteed to find exact maximum for arg max𝑦𝑃(𝑦|𝑥).

you might see in the production setting B=10.

B=100, B=1000 are uncommon (sometimes used in research

settings)

Large B: Better result, slowerSmall B: worse result, faster



Error analysis on beam search

19 ۱۳۹۷آذر ۱۰شنبه،



Example

20 ۱۳۹۷آذر ۱۰شنبه،

Jane visite l’Afrique en septembre.

Human: Jane visits Africa in September.

Algorithm: Jane visited Africa last September.

𝑎<0>


⋯

(𝒚∗)

(𝒚)

RNN Beam search

Jane visits Africa …



Error analysis on beam search

21 ۱۳۹۷آذر ۱۰شنبه،

Human: Jane visits Africa in September. (𝑦∗)

Algorithm: Jane visited Africa last September. ( ො𝑦)

Case 1:

Beam search chose ො𝑦. But 𝑦∗ attains higher 𝑃 𝑦 𝑥 .

Conclusion: Beam search is at fault.

Case 2:

𝑦∗ is a better translation than ො𝑦. But RNN predicted 𝑃 𝑦∗ 𝑥 < 𝑃 ො𝑦 𝑥 .

Conclusion: RNN model is at fault.

(P(y* | X) > P(y | X))

(P(y* | X) <= P(y | X))



Error analysis process

22 ۱۳۹۷آذر ۱۰شنبه،

Jane visits Africa in September.

Jane visited Africa

last September.

Human Algorithm 𝑃 𝑦∗ 𝑥 𝑃 ො𝑦 𝑥 At fault?

Figures out what faction of errors are “due to” beam

search vs. RNN model



منابع

• https://www.coursera.org/specializations/deep-learning

• https://towardsdatascience.com/sequence-to-sequence-model-introduction-and-concepts-44d9b41cd42d


https://www.coursera.org/specializations/deep-learning

https://towardsdatascience.com/sequence-to-sequence-model-introduction-and-concepts-44d9b41cd42d

Lecture 12: Sequence to sequence modelsfall97.class.vision/slides/12.pdf · [Cho et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation]

Documents

Lecture 12: Sequence to sequence modelsfall97.class.vision/slides/12.pdf · [Cho et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation]