Lecture 12SRTTU – A.Akhavan 1 ۱۳۹۷آذر ۱۰شنبه،
Lecture 12: Sequence to sequence models
Alireza Akhavan Pour
CLASS.VISION
Lecture 12SRTTU – A.Akhavan
Sequence to sequence model: Introduction and concepts
2 ۱۳۹۷آذر ۱۰شنبه،
Lecture 12SRTTU – A.Akhavan 3 ۱۳۹۷آذر ۱۰شنبه،
Lecture 12SRTTU – A.Akhavan 4 ۱۳۹۷آذر ۱۰شنبه،
Sequence to sequence model
Jane visite l’Afrique en septembre
Jane is visiting Africa in September.
𝑥<1> 𝑥<2> 𝑥<3> 𝑥<4> 𝑥<5>
𝑦<1> 𝑦<2> 𝑦<3> 𝑦<4> 𝑦<5> 𝑦<6>
[Cho et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation]
[Sutskever et al., 2014. Sequence to sequence learning with neural networks]
𝑎<0>
𝑥<1> 𝑥<𝑇𝑥>
⋯
Lecture 12SRTTU – A.Akhavan 5 ۱۳۹۷آذر ۱۰شنبه،
Sequence to sequence model
Jane visite l’Afrique en septembre
Jane is visiting Africa in September.
𝑥<1> 𝑥<2> 𝑥<3> 𝑥<4> 𝑥<5>
𝑦<1> 𝑦<2> 𝑦<3> 𝑦<4> 𝑦<5> 𝑦<6>
[Cho et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation]
[Sutskever et al., 2014. Sequence to sequence learning with neural networks]
𝑎<0>
𝑥<1> 𝑥<𝑇𝑥>
⋯
Encoder Decoder
Lecture 12SRTTU – A.Akhavan 6 ۱۳۹۷آذر ۱۰شنبه،
A cat sitting on a chair𝑦<1> 𝑦<2> 𝑦<3> 𝑦<4> 𝑦<5>
55×55 × 96 27×27 ×96 27×27 ×256 13×13 ×256
11 × 11
s = 4
3 × 3
s = 2
MAX-POOL
5 × 5
same
3 × 3
s = 2
MAX-POOL
13×13 ×384
3 × 3
same
3 × 3
=
13×13 ×384 13×13 ×256 6×6 ×256
3 × 3 3 × 3
s = 2
MAX-POOL
⋮
9216
Softmax
1000
⋮
4096
⋮
4096
[Mao et. al., 2014. Deep captioning with multimodal recurrent neural networks]
[Vinyals et. al., 2014. Show and tell: Neural image caption generator]
[Karpathy and Li, 2015. Deep visual-semantic alignments for generating image descriptions]
𝑦<6>
ො𝑦<𝑇𝑦>
𝑥
ො𝑦<1> ො𝑦<2>
⋯
Image captioning
Lecture 12SRTTU – A.Akhavan 7 ۱۳۹۷آذر ۱۰شنبه،
Language model: 𝑎<0>
𝑥<1>
ො𝑦<1> ො𝑦<2> ො𝑦<𝑇𝑦>
⋯
𝑥<2>
Machine translation as building a conditional language model
൯𝒑(𝒚<𝟏>, … , 𝒚<𝑻𝒚>
= ො𝑦<1>
Lecture 12SRTTU – A.Akhavan 8 ۱۳۹۷آذر ۱۰شنبه،
Language model:
Machine translation: 𝑎<0>
𝑥<1>
ො𝑦<1>
𝑥<𝑇𝑥>
ො𝑦<𝑇𝑦>
⋯⋯
𝑎<0>
ො𝑦<1> ො𝑦<2> ො𝑦<𝑇𝑦>
⋯
Machine translation as building a conditional language model
൯𝒑(𝒚<𝟏>, … , 𝒚<𝑻𝒚>
0وکتور State ی کهencoderایجاد کرده
൯𝒑 𝒚<𝟏>, … , 𝒚<𝑻𝒚> 𝒙<𝟏>, … , 𝒙<𝑻𝒙>Conditional language model
Lecture 12SRTTU – A.Akhavan 9 ۱۳۹۷آذر ۱۰شنبه،
Jane visite l’Afrique en septembre. 𝑃(𝑦<1>, … , 𝑦<𝑇𝑦>| 𝑥)
Jane is visiting Africa in September.
Jane is going to be visiting Africa in September.
In September, Jane will visit Africa.
Her African friend welcomed Jane in September.
arg max𝑦<1>,…,𝑦<𝑇𝑦>
𝑃(ො𝑦<1> , ො𝑦<2> , … , 𝑦<𝑇𝑦>| 𝑥)
Finding the most likely translation
English
French
Lecture 12SRTTU – A.Akhavan 10 ۱۳۹۷آذر ۱۰شنبه،
Jane is visiting Africa in September.
Jane is going to be visiting Africa in September.
𝑎<0>
𝑥<1>
ො𝑦<1>
𝑥<𝑇𝑥>
ො𝑦<𝑇𝑦>
⋯⋯
Why not a greedy search?
arg max𝑦
𝑃(ො𝑦<1> , ො𝑦<2> , … , ො𝑦<𝑇𝑦> | 𝑥)arg max𝑦
𝑃(ො𝑦<1> , ො𝑦<2> , … , ො𝑦<𝑇𝑦> | 𝑥)
𝑃(Jane is going | 𝑥) > 𝑃(Jane is visiting | 𝑥)
Lecture 12SRTTU – A.Akhavan 12 ۱۳۹۷آذر ۱۰شنبه،
𝑎<0>
𝑥<1>
ො𝑦<1>
𝑥<𝑇𝑥>
⋯
a
in
jane
september
zulu
⋮
⋮
⋮
⋮
10000
𝑃(𝑦<1> | 𝑥)
Step 1
Beam search algorithmB = 3 (Beam width)
French English
Lecture 12SRTTU – A.Akhavan 13 ۱۳۹۷آذر ۱۰شنبه،
a
in
jane
september
zulu
⋮
⋮
⋮
⋮
Step 1
10000
𝑎<0>
𝑥<1> 𝑥<𝑇𝑥>
⋯
𝑎<0>
𝑥<1> 𝑥<𝑇𝑥>
⋯
𝑎<0>
𝑥<1> 𝑥<𝑇𝑥>
⋯
Step 2
Beam search algorithm
a
aaron
september
zulu
ො𝑦<1>in
ො𝑦<2>
in
𝑃 𝑦<2> 𝑥, "𝑖𝑛")
𝑃 𝑦<1>, 𝑦<2> 𝑥) = 𝑃 𝑦<1> 𝑥) 𝑃 𝑦<2> 𝑥, 𝑦<1>)a
visitingis
zulu
𝑗𝑎𝑛𝑒 ො𝑦<2>𝑃 𝑦<2> 𝑥, "𝑗𝑎𝑛𝑒")
a
zulu
𝑠𝑒𝑝𝑡𝑒𝑚𝑏𝑒𝑟 ො𝑦<2>
(B=3)
Lecture 12SRTTU – A.Akhavan 14 ۱۳۹۷آذر ۱۰شنبه،
a
in
jane
september
zulu
⋮
⋮
⋮
⋮
Step 1
10000
𝑎<0>
𝑥<1> 𝑥<𝑇𝑥>
⋯
𝑎<0>
𝑥<1> 𝑥<𝑇𝑥>
⋯
𝑎<0>
𝑥<1> 𝑥<𝑇𝑥>
⋯
Step 2
Beam search algorithm
a
aaron
september
zulu
ො𝑦<1>in
ො𝑦<2>
in
𝑃 𝑦<2> 𝑥, "𝑖𝑛")
𝑃 𝑦<1>, 𝑦<2> 𝑥) = 𝑃 𝑦<1> 𝑥) 𝑃 𝑦<2> 𝑥, 𝑦<1>)a
visitingis
zulu
𝑗𝑎𝑛𝑒 ො𝑦<2>𝑃 𝑦<2> 𝑥, "𝑗𝑎𝑛𝑒")
a
zulu
𝑠𝑒𝑝𝑡𝑒𝑚𝑏𝑒𝑟 ො𝑦<2>
(B=3)
Lecture 12SRTTU – A.Akhavan 15 ۱۳۹۷آذر ۱۰شنبه،
Beam search (𝐵 = 3)in september
jane is
jane visits
𝑎<0>
𝑥<1> 𝑥<𝑇𝑥>
ො𝑦<3>
⋯
septemberin
𝑃(𝑦<1>, 𝑦<2>| 𝑥) jane visits africa in september. <EOS>
𝑎<0>
𝑥<1> 𝑥<𝑇𝑥>
ො𝑦<3>
⋯
isjane
𝑎<0>
𝑥<1> 𝑥<𝑇𝑥>
ො𝑦<3>
⋯
visitsjane
خروجی، احتمال ها را نیز ذخیره کرده ایم3برای هر کدام از این
Lecture 12SRTTU – A.Akhavan
Length normalization
17 ۱۳۹۷آذر ۱۰شنبه،
arg max𝑦ෑ
𝑡=1
𝑇𝑦
𝑃 𝑦<𝑡> 𝑥, 𝑦<1>, … , 𝑦<𝑡−1>)
arg max𝑦
𝑡=1
𝑇𝑦
log 𝑃 𝑦<𝑡> 𝑥, 𝑦<1>, … , 𝑦<𝑡−1>)
𝑡=1
𝑇𝑦
log 𝑃 𝑦<𝑡> 𝑥, 𝑦<1>, … , 𝑦<𝑡−1>)
𝑃(𝑦<1>, … , 𝑦<𝑇𝑦>| 𝑥) = 𝑃(𝑦<1> | 𝑥) P(𝑦<2> | 𝑥, 𝑦<1>) …P(𝑦<𝑇𝑦> | 𝑥, 𝑦<1>, … , 𝑦<𝑇𝑦−1>)
!وابسته به طول خروجی
1
𝑇𝑦𝑇𝑦𝛼
𝜶 = 𝟎. 𝟕
𝜶 = 𝟎 ? 𝜶 = 𝟏 ?
Lecture 12SRTTU – A.Akhavan
Beam search discussion
18 ۱۳۹۷آذر ۱۰شنبه،
Beam width B?
Unlike exact search algorithms like BFS (Breadth First Search) or
DFS (Depth First Search), Beam Search runs faster but is not
guaranteed to find exact maximum for arg max𝑦𝑃(𝑦|𝑥).
you might see in the production setting B=10.
B=100, B=1000 are uncommon (sometimes used in research
settings)
Large B: Better result, slowerSmall B: worse result, faster
Lecture 12SRTTU – A.Akhavan
Error analysis on beam search
19 ۱۳۹۷آذر ۱۰شنبه،
Lecture 12SRTTU – A.Akhavan
Example
20 ۱۳۹۷آذر ۱۰شنبه،
Jane visite l’Afrique en septembre.
Human: Jane visits Africa in September.
Algorithm: Jane visited Africa last September.
𝑎<0>
𝑥<1> 𝑥<𝑇𝑥>
⋯
(𝒚∗)
(𝒚)
RNN Beam search
Jane visits Africa …
Lecture 12SRTTU – A.Akhavan
Error analysis on beam search
21 ۱۳۹۷آذر ۱۰شنبه،
Human: Jane visits Africa in September. (𝑦∗)
Algorithm: Jane visited Africa last September. ( ො𝑦)
Case 1:
Beam search chose ො𝑦. But 𝑦∗ attains higher 𝑃 𝑦 𝑥 .
Conclusion: Beam search is at fault.
Case 2:
𝑦∗ is a better translation than ො𝑦. But RNN predicted 𝑃 𝑦∗ 𝑥 < 𝑃 ො𝑦 𝑥 .
Conclusion: RNN model is at fault.
(P(y* | X) > P(y | X))
(P(y* | X) <= P(y | X))
Lecture 12SRTTU – A.Akhavan
Error analysis process
22 ۱۳۹۷آذر ۱۰شنبه،
Jane visits Africa in September.
Jane visited Africa
last September.
Human Algorithm 𝑃 𝑦∗ 𝑥 𝑃 ො𝑦 𝑥 At fault?
Figures out what faction of errors are “due to” beam
search vs. RNN model
Lecture 12SRTTU – A.Akhavan 23 ۱۳۹۷آذر ۱۰شنبه،
منابع
• https://www.coursera.org/specializations/deep-learning
• https://towardsdatascience.com/sequence-to-sequence-model-introduction-and-concepts-44d9b41cd42d