Neural Machine Translation: A Machine Learning Perspective
Tie-Yan Liu
Principal Researcher, Microsoft Research
IEEE Fellow, ACM Distinguished Member
Neural Machine Translation
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 2
Neural Machine Translation
Encoder: from input word sequence to intermediate context
Decoder: from intermediate context to distribution of output word sequence
FNN
CNN
RNN
Various choices of implementing the encoder or decoder
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 3
Neural Machine Translation
• Example: RNN-based implementation
• Attention mechanism• Using personalized context vector 𝑐𝑡 = σ𝑗=1
𝑇𝑥 𝛼𝑡𝑗ℎ𝑗 ,
where 𝛼𝑡𝑗 is importance of 𝑥𝑗 is to 𝑦𝑡
(Bengio, ICLR 2015)
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 4
Fast Development of NMT - GNMT
• RNN as encoder/decoder• Stacked LSTM-RNN (8 layers for encoder and decoder respectively)
• Each layer is trained on a separate GPU for speed-up
• Standard attention model
• Residual connection for better gradient flow
• Significant improvement over shallow models• 39.92 vs. 31.3 (Bengio, ICLR 2015) on En-Fr
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 5
Fast Development of NMT – ConvS2S
• CNN as encoder/decoder• Convolutional block structure
• Gated linear unit + Residual connection
• 15 layers for encoder and decoder respectively
• Multi-step attention • Separate attention mechanism for each
decoding layer
• Comparable to (slightly better than) RNN-based NMT models• 40.46 vs. 39.92 (GNMT) on En-Fr
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 6
Fast Development of NMT - Transformer
• FNN as encoder/decoder• 6 layers (each with two sub-layers) for
encoder and decoder respectively
• Relying entirely on attention (including multi-head self-attention) to draw global dependency between input and output
• Comparable to (slightly better than) RNN-based and CNN-based NMT models• 41.0 vs. 40.46 (ConvS2S) vs. 39.92 (GNMT) on En-Fr
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 7
Fast Development of NMT- Summary
Algorithms Framework Algorithm #layers:
encoder-decoder
English->French (36M pairs)
English->German (4.5M pairs)
BLEU Training cost BLEU Training cost
Bengio, ICLR 2015
Theano(open source)
GRU-RNN 1-1 31.3 - - -
GNMTTensorFlow(no code)
LSTM-RNN 8-8 39.9296 K80, 6 days
24.6 -
TransformerTensorFlow
(open source)FNN+
attention12-12 41.0
8 P100,4.5 days
28.48 P100
3.5 days
ConvS2STorch
(open source)CNN 15-15 40.46
8 M40, 37 days
25.161 M40
18.5 days
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 8
What’s Done?
• These works verified the strong representation power of deep neural networks:• No matter FNN, CNN, or RNN, all can be used
to fit bilingual training data, and achieve good translation performance when sufficiently large training data are given.
• However, this is not surprising at all• Already indicated by the universal
approximation theorem, decades ago.
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 9
What’s Missing?
• Many unique challenges of machine translation have not been addressed• Relying on huge amount of bilingual
training data
• Relying on myopic beam search during inference
• Using likelihood maximization for both training and inference, which differs from true evaluation measure (BLEU)
• … …
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 10
Leveraging Reinforcement Learning to Tackle these Challenges
Dual learning• Leveraging the symmetric structure of
machine translation to enable effective learning from monolingual data through reinforcement learning.
Predictive inference• Using end-to-end BLEU as delayed reward
to train value networks
• Using value networks to guide forward-looking search along the decoding tree
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 11
Dual Learning for NMTNIPS 2016, IJCAI 2017, ICML 2017
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 12
Traditional Solutions to Insufficient Training Data
Tie-Yan Liu @ Microsoft Research Asia 13
Label propagation Transductive learning
Multi-task learning Transfer Learning
7/25/2017
A New View: The Beauty of Symmetry
• Symmetry is almost everywhere in our world, and also in machine translation!
Tie-Yan Liu @ Microsoft Research Asia 14
Hello! 你好!
7/25/2017
Dual Learning
• A new learning framework that leverages the symmetric (primal-dual) structure of AI tasks to obtain effective feedback or regularization signals to enhance the learning process, especially when lacking labeled training data.
Tie-Yan Liu @ Microsoft Research Asia 157/25/2017
Dual Learning for Machine Translation
Tie-Yan Liu @ Microsoft Research Asia 16
English sentence 𝑥 Chinese sentence𝑦 = 𝑓(𝑥)New English sentence
𝑥′ = 𝑔(𝑦)
Feedback signals during the loop:• 𝑅 𝑥, 𝑥′; 𝑓, 𝑔 : BLEU of 𝑥′ given 𝑥; • 𝐿(𝑦; 𝑓), 𝐿(𝑥′; 𝑔): likelihood and syntactic
correctness of 𝑦 and 𝑥’; • 𝑅 𝑥, 𝑦; 𝑓 , 𝑅 𝑦, 𝑥′; 𝑔 : dictionary based
translation correspondence, etc.
Primal Task 𝑓: 𝑥 → 𝑦
Dual Task 𝑔: 𝑦 → 𝑥
Environment
Agent
Environment
Agent
Ch->En translation
En->Ch translation
Policy gradient is used to improve both primal and dual models according to feedback signals
(NIPS 2016)
7/25/2017
Experimental Setting
• Baseline: • State-of-art NMT model, trained
using 100% bilingual data• Neural Machine Translation by Jointly
Learning to Align and Translate, by Bengio’s group (ICLR 2015)
• Our algorithm:• Step 1: Initialization
• Start from a weak NMT model learned from only 10% training data
• Step 2: Dual learning• Use the policy gradient algorithm to
update the dual models based on monolingual data
Tie-Yan Liu @ Microsoft Research Asia 17
NMT with 10%bilingual data
Dual learning with10% bilingual data
NMT with 100%bilingual data
BLEU score: French->English
↑ 0.3
↓ 5.0
Starting from initial models obtained from only 10% bilingual data, dual learning can achieve similar accuracy as the NMT model learned from 100% bilingual data!
7/25/2017
Probabilistic Nature
• The primal-dual structure implies strong probabilistic connections between the two tasks.
• This can also be used to improve supervised learning, and perhaps even inference• Structural regularizer to enhance supervised learning• Additional criterion to improve inference
Tie-Yan Liu @ Microsoft Research Asia 18
𝑃 𝑥, 𝑦 = 𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓 = 𝑃 𝑦 𝑃 𝑥 𝑦; 𝑔
Primal View Dual View
7/25/2017
“Dual” Supervised Learning
Tie-Yan Liu @ Microsoft Research Asia 19
Labeled data 𝑥Predicted label
𝑦 = 𝑓(𝑥)Reconstructed data 𝑥′ = 𝑔(𝑦)
Feedback signals during the loop:• 𝑅 𝑥, 𝑓, 𝑔 = |𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓 − 𝑃 𝑦 𝑃 𝑥 𝑦; 𝑔 |: the gap between
the joint probability 𝑃(𝑥, 𝑦) obtained in two directions
Primal Task 𝑓: 𝑥 → 𝑦
Dual Task 𝑔: 𝑦 → 𝑥
Environment
Bob
Environment
Alice
min |𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓 − 𝑃 𝑦 𝑃 𝑥 𝑦; 𝑔 |
max log𝑃(𝑦|𝑥; 𝑓)
max log𝑃(𝑥|𝑦; 𝑔)
(ICML 2017)
7/25/2017
Experimental Results
Tie-Yan Liu @ Microsoft Research Asia 20
Theoretical Analysis• Dual supervised learning generalizes better than standard supervised learning
The product space of the two models satisfying probabilistic duality: 𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓 = 𝑃 𝑦 𝑃 𝑥 𝑦; 𝑔
En->Fr Fr->En En->De De->En
NMT Dual learning↑2.1↑0.9
↑1.4↑0.1
7/25/2017
“Dual” Inference
Tie-Yan Liu @ Microsoft Research Asia 21
Test data 𝑥Predicted label
𝑦 = 𝑓(𝑥)Reconstructed data 𝑥′ = 𝑔(𝑦)
Primal Task 𝑓: 𝑥 → 𝑦
Dual Task 𝑔: 𝑦 → 𝑥
Environment
Bob
Environment
Alice
𝑃 𝑦 𝑥 =𝑃 𝑥 𝑦 𝑃 𝑦
𝑃 𝑥
Choose the 𝑦 that can maximize 𝑃(𝑦|𝑥; 𝑓)Standard inference
Choose the 𝑦 that can maximize both 𝑃(𝑦|𝑥; 𝑓) and𝑃 𝑦 𝑃(𝑥|𝑦)
𝑃(𝑥)
Dual inference: leverage both the primal model and the dual model for testing
(IJCAI 2017)
7/25/2017
Experimental Results
Tie-Yan Liu @ Microsoft Research Asia 22
• Dual inference has generalization guarantee although training and inference become a little inconsistent.
The generalization bound for dual inference is comparable to that of standard inference.
Theoretical AnalysisEn->Fr Fr->En En->De De->En
NMT Dual learning↑0.5↑0.4
↑1.2 ↑0.5
7/25/2017
Inference with Predicted Reward
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 23
Standard Inference Process in NMT
• Beam search + likelihood maximization
emb
LSTM/GRUc
Encoder
I
emb
LSTM/GRU
Iove
emb
LSTM/GRU
China
我
你
爱
喜欢 中国
中国
Decoder
爱
喜欢 中国
中国
…
…
…
…
他
At each step, select the top-𝑘 words with the largest translation probability
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 24
Standard Inference Process in NMT
• Beam search + likelihood maximization
emb
LSTM/GRUc
Encoder
I
emb
LSTM/GRU
Iove
emb
LSTM/GRU
China
我
你
爱
喜欢 中国
中国
Decoder
爱
喜欢 中国
中国
…
…
…
…
他
At each step, select the top-𝑘 words with the largest translation probability
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 25
Standard Inference Process in NMT
• Beam search + likelihood maximization
emb
LSTM/GRUc
Encoder
I
emb
LSTM/GRU
Iove
emb
LSTM/GRU
China
我
你
爱
喜欢 中国
中国
Decoder
爱
喜欢 中国
中国
…
…
…
…
他
At each step, select the top-𝑘 words with the largest translation probability
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 26
Inference Errors
• Likelihood ≠ BLEU!
• Myopic local search ≠ global optimum
emb
LSTM/GRUc
Encoder
I
emb
LSTM/GRU
Iove
emb
LSTM/GRU
China
我
你
爱
喜欢 中国
中国
Decoder
爱
喜欢 中国
中国
…
…
…
…
他We need a method to predict long-term rewards (e.g., BLEU) during inference
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 27
Inspired by AlphaGo
CNN
Value NetworkPolicy Network
h1 h2 h3 htx
c1 c2 c3
r3r2r1
…
x1 x2 x3 xtx
…
Attention
NMT Model
Word Probability
h1 h2 h3 htx
c1 c2 c3
r3r2r1
…
x1 x2 x3 xtx
y1 y2 y3
…Future Bleu
Attention
Value Network
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 28
Value Networks to Predict Long-term Reward
• Value function in NMT• Value function 𝑣𝜋(𝑥, 𝑦<𝑡) estimates the (delayed) BLEU score of the final
translation for source sentence 𝑥, if we continue to decode from the partially decoded sentence 𝑦<𝑡 according to NMT model 𝜋.
• Information for estimating long-term BLEU score
𝑣𝜋(𝑥, 𝑦<𝑡)
Semantic correlation between source 𝑥 and partially decoded target 𝑦<𝑡
Effectiveness/coverage of the attention mechanism between encoder and decoder
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 29
Design of Value Networks
h1 h2 h3 htx
c1 c2 c3
r3r2r1
…
x1 x2 x3 xtx
y1 y2 y3
…
…SM
Module
CC Module
Semantic Matching (SM) Module
r3r2r1…
Attention
Context-Coverage (CC) Module
c1 c2 c3 htxh2h1… …
Mean pooling Mean pooling
usm
Mean pooling Mean pooling
ucc
htxh2h1
Compute the semantic correlation between source and target sentences based on the mean pooling of the encoder/decoder hidden states
Compute the effectiveness/coverage of the attention mechanism based on the mean pooling
of the encoder hidden states and the contexts7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 30
Training of Value Networks
• Training data• Generated by Monte-Carlo method
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 31
Randomly pick a source sentence 𝑥from the original training dataset
Generate 𝑦<𝑡 for 𝑥using 𝜋 with
random selected 𝑡.
Generate 𝐾complete
translations for each 𝑦<𝑡 using 𝜋.
Compute the BLEU score for each
translation using the ground truth sentence 𝑦
Use the averaged BLEU score as the
labelled value for 𝑣𝜋(𝑥, 𝑦<𝑡)
Training of Value Networks
• Pairwise ranking loss minimization• For two partial sentences 𝑦𝑝,1 and 𝑦𝑝,2 for each 𝑥, where 𝑦𝑝,1 has a larger
BLEU score than 𝑦𝑝,2, we define the following loss function:
• Why not directly optimizing BLEU score?• BLEU score is computed according to 𝑛-gram precision (𝑛=1,2,3,4): the
regression of BLEU scores is more sensitive than pairwise classification (which just needs to differentiates good and bad candidates).
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 32
Experimental Results
• Baselines• BS: standard beam search
• BSO: beam search guided by predicted instant (but not delayed) BLEU score for the partial decoding result
• Observations• Our proposed approach is
consistently better than the baselines
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 33
More Challenges to NMT
Black magics in algorithm tuning
High computational load
Latent semantics in texts
Greedy one-pass decoding
Decreased diversity in tech roadmap
Insufficient emphasis on teaching
Little Attention on the beauty of translation
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 34
Black Magics in Algorithm Tuning
• Hyper-parameter tuning• Structure: #layers, #nodes, activation types, skip connection, attention, …
• Learning rate, momentum, initialization, drop out, batch normalization, …
• Unreliable results• Many published results cannot be readily
reproduced
• Good empirical performances might result from “overfitting” the test data (especially considering that test data for NMT is too small to be statistically robust)
Tie-Yan Liu @ Microsoft Research Asia 357/25/2017
High Computational Load
• Current NMT models are clumsy and require long training time and huge computational power• GNMT: 96 GPUs for one week (WMT En→Fr)
• ConvS2S: 8 GPUs for 5+ weeks (WMT En→Fr)
• Lightweight NMT is desirable• Training NMT model in one hour!
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 36
Latent Semantics in Texts
• Whole != Sum of its parts• 杀鸡取卵≠ “杀”+“鸡”+“取”+“卵”;
• Other examples: 春风化雨、登堂入室、饮鸩止渴、… …
• Almost mission-impossible for low-frequency phases (no enough contexts), if we use statistical learning only.
• Semantic NMT is sorely needed• Combination with linguistics-based methods or external dictionary
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 37
Greedy One-pass Decoding
• One-pass sequential inference can hardly be optimal• 僧推月下门 vs. 僧敲月下门• 红楼梦:披阅十载、增删五次,字字都是血、十年不寻常
• Multi-round refinement in decoder will be better
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 38
Decreased Diversity in Tech Roadmap
• Deep neural networks are banishing other types of learning algorithms• Google’s MultiModel is used to address different workloads
• Even for small-sample problems, people tend to consider DNN as their first choice
• Other technologies such as SVM and Bayesian networks are gradually ignored
• Richness and diversity are necessary to ensure healthy development of science and technology• Investigation on non-DNN algorithms for NMT
Tie-Yan Liu @ Microsoft Research Asia 397/25/2017
Insufficient Emphasis on Teaching
• Learning vs. Teaching• “没有教不好的学生,只有不会教的老师”
• “因材施教、教学相长”• Today, almost all efforts are
put on “how to learn”, but little on “how to teach”!
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 40
Little Attention on the Beauty of Translation
• 译事三难:信、达、雅。• “信”指意义不背原文,译文准确,不歪曲,不遗漏,也不要随意增减意思• “达”指不拘泥于原文形式,译文通顺明白• “雅”则指译文时选用的词语要得体,追求文章本身的古雅,简明优雅
• How to appropriately model “雅” in the training process?• 文化、习俗、典故、工整、艺术、新颖… …
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 41
http://research.microsoft.com/users/tyliu/
7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 42