Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Neural Machine Translation: A Machine Learning Perspective

Tie-Yan Liu

Principal Researcher, Microsoft Research

IEEE Fellow, ACM Distinguished Member

Neural Machine Translation

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 2


Encoder: from input word sequence to intermediate context

Decoder: from intermediate context to distribution of output word sequence

FNN

CNN

RNN

Various choices of implementing the encoder or decoder



• Example: RNN-based implementation

• Attention mechanism• Using personalized context vector 𝑐𝑡 = σ𝑗=1

𝑇𝑥 𝛼𝑡𝑗ℎ𝑗 ,

where 𝛼𝑡𝑗 is importance of 𝑥𝑗 is to 𝑦𝑡

(Bengio, ICLR 2015)


Fast Development of NMT - GNMT

• RNN as encoder/decoder• Stacked LSTM-RNN (8 layers for encoder and decoder respectively)

• Each layer is trained on a separate GPU for speed-up

• Standard attention model

• Residual connection for better gradient flow

• Significant improvement over shallow models• 39.92 vs. 31.3 (Bengio, ICLR 2015) on En-Fr


Fast Development of NMT – ConvS2S

• CNN as encoder/decoder• Convolutional block structure

• Gated linear unit + Residual connection

• 15 layers for encoder and decoder respectively

• Multi-step attention • Separate attention mechanism for each

decoding layer

• Comparable to (slightly better than) RNN-based NMT models• 40.46 vs. 39.92 (GNMT) on En-Fr


Fast Development of NMT - Transformer

• FNN as encoder/decoder• 6 layers (each with two sub-layers) for

encoder and decoder respectively

• Relying entirely on attention (including multi-head self-attention) to draw global dependency between input and output

• Comparable to (slightly better than) RNN-based and CNN-based NMT models• 41.0 vs. 40.46 (ConvS2S) vs. 39.92 (GNMT) on En-Fr


Fast Development of NMT- Summary

Algorithms Framework Algorithm #layers:

encoder-decoder

English->French (36M pairs)

English->German (4.5M pairs)

BLEU Training cost BLEU Training cost

Bengio, ICLR 2015

Theano(open source)

GRU-RNN 1-1 31.3 - - -

GNMTTensorFlow(no code)

LSTM-RNN 8-8 39.9296 K80, 6 days

24.6 -

TransformerTensorFlow

(open source)FNN+

attention12-12 41.0

8 P100,4.5 days

28.48 P100

3.5 days

ConvS2STorch

(open source)CNN 15-15 40.46

8 M40, 37 days

25.161 M40

18.5 days


What’s Done?

• These works verified the strong representation power of deep neural networks:• No matter FNN, CNN, or RNN, all can be used

to fit bilingual training data, and achieve good translation performance when sufficiently large training data are given.

• However, this is not surprising at all• Already indicated by the universal

approximation theorem, decades ago.


What’s Missing?

• Many unique challenges of machine translation have not been addressed• Relying on huge amount of bilingual

training data

• Relying on myopic beam search during inference

• Using likelihood maximization for both training and inference, which differs from true evaluation measure (BLEU)

• … …


Leveraging Reinforcement Learning to Tackle these Challenges

Dual learning• Leveraging the symmetric structure of

machine translation to enable effective learning from monolingual data through reinforcement learning.

Predictive inference• Using end-to-end BLEU as delayed reward

to train value networks

• Using value networks to guide forward-looking search along the decoding tree


Dual Learning for NMTNIPS 2016, IJCAI 2017, ICML 2017


Traditional Solutions to Insufficient Training Data

Tie-Yan Liu @ Microsoft Research Asia 13

Label propagation Transductive learning

Multi-task learning Transfer Learning

7/25/2017

A New View: The Beauty of Symmetry

• Symmetry is almost everywhere in our world, and also in machine translation!


Hello! 你好！

7/25/2017

Dual Learning

• A new learning framework that leverages the symmetric (primal-dual) structure of AI tasks to obtain effective feedback or regularization signals to enhance the learning process, especially when lacking labeled training data.

Tie-Yan Liu @ Microsoft Research Asia 157/25/2017

Dual Learning for Machine Translation


English sentence 𝑥 Chinese sentence𝑦 = 𝑓(𝑥)New English sentence

𝑥′ = 𝑔(𝑦)

Feedback signals during the loop:• 𝑅 𝑥, 𝑥′; 𝑓, 𝑔 : BLEU of 𝑥′ given 𝑥; • 𝐿(𝑦; 𝑓), 𝐿(𝑥′; 𝑔): likelihood and syntactic

correctness of 𝑦 and 𝑥’; • 𝑅 𝑥, 𝑦; 𝑓 , 𝑅 𝑦, 𝑥′; 𝑔 : dictionary based

translation correspondence, etc.

Primal Task 𝑓: 𝑥 → 𝑦

Dual Task 𝑔: 𝑦 → 𝑥

Environment

Agent

Environment

Agent

Ch->En translation

En->Ch translation

Policy gradient is used to improve both primal and dual models according to feedback signals

(NIPS 2016)

7/25/2017

Experimental Setting

• Baseline: • State-of-art NMT model, trained

using 100% bilingual data• Neural Machine Translation by Jointly

Learning to Align and Translate, by Bengio’s group (ICLR 2015)

• Our algorithm:• Step 1: Initialization

• Start from a weak NMT model learned from only 10% training data

• Step 2: Dual learning• Use the policy gradient algorithm to

update the dual models based on monolingual data


NMT with 10%bilingual data

Dual learning with10% bilingual data

NMT with 100%bilingual data

BLEU score: French->English

↑ 0.3

↓ 5.0

Starting from initial models obtained from only 10% bilingual data, dual learning can achieve similar accuracy as the NMT model learned from 100% bilingual data!

7/25/2017

Probabilistic Nature

• The primal-dual structure implies strong probabilistic connections between the two tasks.

• This can also be used to improve supervised learning, and perhaps even inference• Structural regularizer to enhance supervised learning• Additional criterion to improve inference


𝑃 𝑥, 𝑦 = 𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓 = 𝑃 𝑦 𝑃 𝑥 𝑦; 𝑔

Primal View Dual View

7/25/2017

“Dual” Supervised Learning


Labeled data 𝑥Predicted label

𝑦 = 𝑓(𝑥)Reconstructed data 𝑥′ = 𝑔(𝑦)

Feedback signals during the loop:• 𝑅 𝑥, 𝑓, 𝑔 = |𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓 − 𝑃 𝑦 𝑃 𝑥 𝑦; 𝑔 |: the gap between

the joint probability 𝑃(𝑥, 𝑦) obtained in two directions



Environment

Bob

Environment

Alice

min |𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓 − 𝑃 𝑦 𝑃 𝑥 𝑦; 𝑔 |

max log𝑃(𝑦|𝑥; 𝑓)

max log𝑃(𝑥|𝑦; 𝑔)

(ICML 2017)

7/25/2017

Experimental Results


Theoretical Analysis• Dual supervised learning generalizes better than standard supervised learning

The product space of the two models satisfying probabilistic duality: 𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓 = 𝑃 𝑦 𝑃 𝑥 𝑦; 𝑔

En->Fr Fr->En En->De De->En

NMT Dual learning↑2.1↑0.9

↑1.4↑0.1

7/25/2017

“Dual” Inference


Test data 𝑥Predicted label

𝑦 = 𝑓(𝑥)Reconstructed data 𝑥′ = 𝑔(𝑦)



Environment

Bob

Environment

Alice

𝑃 𝑦 𝑥 =𝑃 𝑥 𝑦 𝑃 𝑦

𝑃 𝑥

Choose the 𝑦 that can maximize 𝑃(𝑦|𝑥; 𝑓)Standard inference

Choose the 𝑦 that can maximize both 𝑃(𝑦|𝑥; 𝑓) and𝑃 𝑦 𝑃(𝑥|𝑦)

𝑃(𝑥)

Dual inference: leverage both the primal model and the dual model for testing

(IJCAI 2017)

7/25/2017



• Dual inference has generalization guarantee although training and inference become a little inconsistent.

The generalization bound for dual inference is comparable to that of standard inference.

Theoretical AnalysisEn->Fr Fr->En En->De De->En

NMT Dual learning↑0.5↑0.4

↑1.2 ↑0.5

7/25/2017

Inference with Predicted Reward


Standard Inference Process in NMT

• Beam search + likelihood maximization

emb

LSTM/GRUc

Encoder

I

emb

LSTM/GRU

Iove

emb

LSTM/GRU

China

我

你

爱

喜欢中国

中国

Decoder

爱

喜欢中国

中国

…

…

…

…

他

At each step, select the top-𝑘 words with the largest translation probability




emb

LSTM/GRUc

Encoder

I

emb

LSTM/GRU

Iove

emb

LSTM/GRU

China

我

你

爱

喜欢中国

中国

Decoder

爱

喜欢中国

中国

…

…

…

…

他





emb

LSTM/GRUc

Encoder

I

emb

LSTM/GRU

Iove

emb

LSTM/GRU

China

我

你

爱

喜欢中国

中国

Decoder

爱

喜欢中国

中国

…

…

…

…

他



Inference Errors

• Likelihood ≠ BLEU!

• Myopic local search ≠ global optimum

emb

LSTM/GRUc

Encoder

I

emb

LSTM/GRU

Iove

emb

LSTM/GRU

China

我

你

爱

喜欢中国

中国

Decoder

爱

喜欢中国

中国

…

…

…

…

他We need a method to predict long-term rewards (e.g., BLEU) during inference


Inspired by AlphaGo

CNN

Value NetworkPolicy Network

h1 h2 h3 htx

c1 c2 c3

r3r2r1

…

x1 x2 x3 xtx

…

Attention

NMT Model

Word Probability

h1 h2 h3 htx

c1 c2 c3

r3r2r1

…

x1 x2 x3 xtx

y1 y2 y3

…Future Bleu

Attention

Value Network


Value Networks to Predict Long-term Reward

• Value function in NMT• Value function 𝑣𝜋(𝑥, 𝑦<𝑡) estimates the (delayed) BLEU score of the final

translation for source sentence 𝑥, if we continue to decode from the partially decoded sentence 𝑦<𝑡 according to NMT model 𝜋.

• Information for estimating long-term BLEU score

𝑣𝜋(𝑥, 𝑦<𝑡)

Semantic correlation between source 𝑥 and partially decoded target 𝑦<𝑡

Effectiveness/coverage of the attention mechanism between encoder and decoder


Design of Value Networks

h1 h2 h3 htx

c1 c2 c3

r3r2r1

…

x1 x2 x3 xtx

y1 y2 y3

…

…SM

Module

CC Module

Semantic Matching (SM) Module

r3r2r1…

Attention

Context-Coverage (CC) Module

c1 c2 c3 htxh2h1… …

Mean pooling Mean pooling

usm

Mean pooling Mean pooling

ucc

htxh2h1

Compute the semantic correlation between source and target sentences based on the mean pooling of the encoder/decoder hidden states

Compute the effectiveness/coverage of the attention mechanism based on the mean pooling

of the encoder hidden states and the contexts7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 30

Training of Value Networks

• Training data• Generated by Monte-Carlo method


Randomly pick a source sentence 𝑥from the original training dataset

Generate 𝑦<𝑡 for 𝑥using 𝜋 with

random selected 𝑡.

Generate 𝐾complete

translations for each 𝑦<𝑡 using 𝜋.

Compute the BLEU score for each

translation using the ground truth sentence 𝑦

Use the averaged BLEU score as the

labelled value for 𝑣𝜋(𝑥, 𝑦<𝑡)

Training of Value Networks

• Pairwise ranking loss minimization• For two partial sentences 𝑦𝑝,1 and 𝑦𝑝,2 for each 𝑥, where 𝑦𝑝,1 has a larger

BLEU score than 𝑦𝑝,2, we define the following loss function:

• Why not directly optimizing BLEU score?• BLEU score is computed according to 𝑛-gram precision (𝑛=1,2,3,4): the

regression of BLEU scores is more sensitive than pairwise classification (which just needs to differentiates good and bad candidates).



• Baselines• BS: standard beam search

• BSO: beam search guided by predicted instant (but not delayed) BLEU score for the partial decoding result

• Observations• Our proposed approach is

consistently better than the baselines


More Challenges to NMT

Black magics in algorithm tuning

High computational load

Latent semantics in texts

Greedy one-pass decoding

Decreased diversity in tech roadmap

Insufficient emphasis on teaching

Little Attention on the beauty of translation


Black Magics in Algorithm Tuning

• Hyper-parameter tuning• Structure: #layers, #nodes, activation types, skip connection, attention, …

• Learning rate, momentum, initialization, drop out, batch normalization, …

• Unreliable results• Many published results cannot be readily

reproduced

• Good empirical performances might result from “overfitting” the test data (especially considering that test data for NMT is too small to be statistically robust)


High Computational Load

• Current NMT models are clumsy and require long training time and huge computational power• GNMT: 96 GPUs for one week (WMT En→Fr)

• ConvS2S: 8 GPUs for 5+ weeks (WMT En→Fr)

• Lightweight NMT is desirable• Training NMT model in one hour!


Latent Semantics in Texts

• Whole != Sum of its parts• 杀鸡取卵≠ “杀”+“鸡”+“取”+“卵”;

• Other examples: 春风化雨、登堂入室、饮鸩止渴、… …

• Almost mission-impossible for low-frequency phases (no enough contexts), if we use statistical learning only.

• Semantic NMT is sorely needed• Combination with linguistics-based methods or external dictionary


Greedy One-pass Decoding

• One-pass sequential inference can hardly be optimal• 僧推月下门 vs. 僧敲月下门• 红楼梦：披阅十载、增删五次，字字都是血、十年不寻常

• Multi-round refinement in decoder will be better


Decreased Diversity in Tech Roadmap

• Deep neural networks are banishing other types of learning algorithms• Google’s MultiModel is used to address different workloads

• Even for small-sample problems, people tend to consider DNN as their first choice

• Other technologies such as SVM and Bayesian networks are gradually ignored

• Richness and diversity are necessary to ensure healthy development of science and technology• Investigation on non-DNN algorithms for NMT


Insufficient Emphasis on Teaching

• Learning vs. Teaching• “没有教不好的学生，只有不会教的老师”

• “因材施教、教学相长”• Today, almost all efforts are

put on “how to learn”, but little on “how to teach”!


Little Attention on the Beauty of Translation

• 译事三难：信、达、雅。• “信”指意义不背原文，译文准确，不歪曲，不遗漏，也不要随意增减意思• “达”指不拘泥于原文形式，译文通顺明白• “雅”则指译文时选用的词语要得体，追求文章本身的古雅，简明优雅

• How to appropriately model “雅” in the training process?• 文化、习俗、典故、工整、艺术、新颖… …


[email protected]

http://research.microsoft.com/users/tyliu/


mailto:[email protected]

http://research.microsoft.com/users/tyliu/

Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Documents

Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal