Background Tree Structure Enhanced NMT Experiments Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings from the Association for Computational Linguistics, 2017 Huadong Chen Shujian Huang David Chiang Jiajun Chen 3 May 2019 Presented by: Kevin Liang
22
Embed
Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019 · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Background Tree Structure Enhanced NMT Experiments
Improved Neural Machine Translationwith a Syntax-Aware Encoder and Decoder
Proceedings from the Association for Computational Linguistics, 2017
Huadong Chen Shujian Huang David ChiangJiajun Chen
3 May 2019
Presented by: Kevin Liang
Background Tree Structure Enhanced NMT Experiments
Motivation
Over the past few years, neural machine translation (NMT) modelshave set new state-of-the-arts across many language pairs, mostlyby using an encoder-decoder structure
Can we use source-side syntax to improve our modelperformance?
Bidirectional tree encoderTree-coverage model
Background Tree Structure Enhanced NMT Experiments
Neural Machine Translation (NMT)
Given a source sentence x = x1, ...xi, ..., xI and a target sentencey = y1, ...yj , ..., yJ , NMT seeks to model:
P (y|x; θ) =J∏
j=1
P (yj |y<j ,x; θ) (1)
where θ are the model parameters and y<j are the wordsgenerated before yj .
Background Tree Structure Enhanced NMT Experiments
Encoder-Decoder Structure with Attention
Background Tree Structure Enhanced NMT Experiments
Gated Recurrent Units (GRUs)
GRUs are a common choice for gated recurrent neural networkunit. GRUs are a simpler version of long short-term memory(LSTM) units, and often perform about as well.
Notation used in the paper (and these slides):
ht = GRU(ht−1, xt, ...) (2)
Background Tree Structure Enhanced NMT Experiments
Encoder Model
Use a bidirectional GRU to encode each word of the input sequence−→hi = GRU(
−−→hi−1, si) (3)
←−hi = GRU(
←−−hi−1, si) (4)
where si is the word embedding for xi.
The annotation for each source word xi is the concatenation ofboth the forward and backward hidden states:
←→hi =
[−→hi←−hi
](5)
Background Tree Structure Enhanced NMT Experiments
Decoder Model
The decoder hidden states dj are computed as:
dj = GRU(dj−1, tj−1, cj) (6)
where tj−1 is the word embedding of the (j − 1)th target word, djis the decoder’s hidden state at time j, and cj is the context vectorat time j.
The probability of generating the j-th word yj :
P (yj |y<j ,x; θ) = softmax(WV dj) (7)
where WV is either the transposed word embedding matrix, orlearned separately.
Background Tree Structure Enhanced NMT Experiments
Attention mechanism
Attention weights are computed using decoder state and theencoder states:
ej,i = v>a tanh(Wadj−1 + Ua←→hi ) (8)
αj,i =exp(ej,i)∑I
i′=1 exp(ej,i′)(9)
These attention scores are used to compute a context vector ci,which is a weighted sum of the source encodings, weightedaccording to the attention vector:
cj =
I∑i−1
αj,i←→hi (10)
Background Tree Structure Enhanced NMT Experiments
Encoder-Decoder Structure with Attention (revisited)
Background Tree Structure Enhanced NMT Experiments
Syntactic trees
Assume we have source-side syntactic trees, which can becomputed before translation:
Each node is given an index, with each leaf node labeled 1, ..., I.For any node with index k, let p(k) denote the index of node k’sparent, and let L(k) and R(k) denote the indices of node k’s leftand right children, respectively
Background Tree Structure Enhanced NMT Experiments
Tree-GRU Encoder
Build a tree encoder on top of the sequential encoder.
If node k is a leaf node:
Node k hidden state is just the sequential encoder encoding
h↑k =←→hk (11)
Else (node k is an interior node):
Node k hidden state is a function of the hidden states of theleft child hL(k) and right child hR(k)
h↑k = f(h↑L(k), h↑R(k)) (12)
Background Tree Structure Enhanced NMT Experiments
Tree-GRU
To make the tree encoder consistent with the GRU sequentialencoders, the authors use Tree-GRU units:
rL = σ(U(rL)L h↑L(k) + U
(rL)R h↑R(k) + b(rL)) (13)
rR = σ(U(rR)L h↑L(k) + U
(rR)R h↑R(k) + b(rR)) (14)
zL = σ(U(zL)L h↑L(k) + U
(zL)R h↑R(k) + b(zL)) (15)
zR = σ(U(zR)L h↑L(k) + U
(zR)R h↑R(k) + b(zR)) (16)
z = σ(U(z)L h↑L(k) + U
(z)R h↑R(k) + b(z)) (17)
h̃↑k = tanh(UL(rL � h↑L(k)) + UR(rR � h↑R(k))
)(18)
h↑k = zL � h↑L(k) + zR � h↑R(k) + z � h̃↑k (19)
where rL, rR are the reset gates and zL, zR are the update gatesfor the left and right children, and z is the update for the internalhidden state h̃↑k
Background Tree Structure Enhanced NMT Experiments
Bottom-up tree encoder
Background Tree Structure Enhanced NMT Experiments
Bidirectional tree encoder
Learned representations of a bottom-up encoder node is based onlyon its subtree; information above it in the tree is missing. This canbe addressed by adding a top-down encoder as follows:
If node k is root:
Node k top-down encoding is a function of the bottom-upencoding of node k
h↓k = tanh(Wh↑k + b) (20)
Else:
Node k top-down encoding is produced by a sequential GRUrunning from root down the tree to node k
h↓k = GRU(h↓p(k), h↑k) (21)
Final encoding for each node is obtained by concatenatingbottom-up and top-down hidden states:
hlk =
[h↑kh↓k
](22)
Background Tree Structure Enhanced NMT Experiments
Bidirectional tree encoder
Background Tree Structure Enhanced NMT Experiments
Issues using tree encoder
Syntactic phrases in the sourceare often translated intodiscontinuous words in theoutput.
Non-leaf nodes, which containmore information, are oftenattended to more often than leafnodes, which may lead toover-translation
Background Tree Structure Enhanced NMT Experiments
Word Coverage Model
Coverage vectors have been previously proposed to make attentiontime-dependent, which affect the calculation of the attention scoreas follows:
ej,i = v>a tanh(Wadj−1 + Uahi+VaCj−1,i) (23)
The authors propose incorporating additional source treeinformation by adding the coverage vectors and attention weightsof each child:
Cj,i = GRU(Cj−1,i, αj,i, dj−1, hi,
Cj−1,L(i), αj,L(i),
Cj−1,R(i), αj,R(i)).
(24)
Background Tree Structure Enhanced NMT Experiments
Chinese sentences parsed with the Berkeley Parser1
Compare against 3 models/techniques:
NMT: standard attention NMT model2
Tree-LSTM: attention NMT model with Tree-LSTM encoder3
Coverage: attention NMT model with word coverage4
1Slav Petrov and Dan Klein. 2007. Improved inference for unlexicalized parsing. In Proc. NAACL HLT. pages404–411. http://www.aclweb.org/anthology/N/N07/N07-1051.
2Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learningto align and translate. In ICLR 2015. http://arxiv.org/abs/1409.0473.
3Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa Tsuruoka. 2016. Tree-to-sequence attentional neuralmachine translation. In Proc. ACL. pages 823–833. http://www.aclweb.org/anthology/P16-1078.
4Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. 2016. Modeling coverage for neuralmachine translation. In Proc. ACL. pages 76–85. http://www.aclweb.org/anthology/P16-1008.
Background Tree Structure Enhanced NMT Experiments
Chinese-English BLEU-4 Scores
Background Tree Structure Enhanced NMT Experiments
Tree-LSTM vs Tree-GRU encoder
Previous table:
Experiments with LSTM sequence encoder:
Background Tree Structure Enhanced NMT Experiments
Encoding size
Previous table:
Experiments with bidirectional embedding size halved:
Background Tree Structure Enhanced NMT Experiments
Takeaways
For the encoder, tree-encoder’s using syntax do better thanpurely sequential ones, and bidirectional tree-encoders arebetter than bottom-up only ones.
Coverage helps. Tree-coverage helps more.
Using the same type of cell (GRU vs LSTM) for both thesequential and tree encodings is better. LSTM-LSTM isslightly better than GRU-GRU, but more expensive.
Gains of the bidirectional tree encoding is due to more thanjust the larger embedding size.