Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019 · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Background Tree Structure Enhanced NMT Experiments

Improved Neural Machine Translationwith a Syntax-Aware Encoder and Decoder

Proceedings from the Association for Computational Linguistics, 2017

Huadong Chen Shujian Huang David ChiangJiajun Chen

3 May 2019

Presented by: Kevin Liang


Motivation

Over the past few years, neural machine translation (NMT) modelshave set new state-of-the-arts across many language pairs, mostlyby using an encoder-decoder structure

Can we use source-side syntax to improve our modelperformance?

Bidirectional tree encoderTree-coverage model


Neural Machine Translation (NMT)

Given a source sentence x = x1, ...xi, ..., xI and a target sentencey = y1, ...yj , ..., yJ , NMT seeks to model:

P (y|x; θ) =J∏

j=1

P (yj |y<j ,x; θ) (1)

where θ are the model parameters and y<j are the wordsgenerated before yj .


Encoder-Decoder Structure with Attention


Gated Recurrent Units (GRUs)

GRUs are a common choice for gated recurrent neural networkunit. GRUs are a simpler version of long short-term memory(LSTM) units, and often perform about as well.

Notation used in the paper (and these slides):

ht = GRU(ht−1, xt, ...) (2)


Encoder Model

Use a bidirectional GRU to encode each word of the input sequence−→hi = GRU(

−−→hi−1, si) (3)

←−hi = GRU(

←−−hi−1, si) (4)

where si is the word embedding for xi.

The annotation for each source word xi is the concatenation ofboth the forward and backward hidden states:

←→hi =

[−→hi←−hi

](5)


Decoder Model

The decoder hidden states dj are computed as:

dj = GRU(dj−1, tj−1, cj) (6)

where tj−1 is the word embedding of the (j − 1)th target word, djis the decoder’s hidden state at time j, and cj is the context vectorat time j.

The probability of generating the j-th word yj :

P (yj |y<j ,x; θ) = softmax(WV dj) (7)

where WV is either the transposed word embedding matrix, orlearned separately.


Attention mechanism

Attention weights are computed using decoder state and theencoder states:

ej,i = v>a tanh(Wadj−1 + Ua←→hi ) (8)

αj,i =exp(ej,i)∑I

i′=1 exp(ej,i′)(9)

These attention scores are used to compute a context vector ci,which is a weighted sum of the source encodings, weightedaccording to the attention vector:

cj =

I∑i−1

αj,i←→hi (10)


Encoder-Decoder Structure with Attention (revisited)


Syntactic trees

Assume we have source-side syntactic trees, which can becomputed before translation:

Each node is given an index, with each leaf node labeled 1, ..., I.For any node with index k, let p(k) denote the index of node k’sparent, and let L(k) and R(k) denote the indices of node k’s leftand right children, respectively


Tree-GRU Encoder

Build a tree encoder on top of the sequential encoder.

If node k is a leaf node:

Node k hidden state is just the sequential encoder encoding

h↑k =←→hk (11)

Else (node k is an interior node):

Node k hidden state is a function of the hidden states of theleft child hL(k) and right child hR(k)

h↑k = f(h↑L(k), h↑R(k)) (12)


Tree-GRU

To make the tree encoder consistent with the GRU sequentialencoders, the authors use Tree-GRU units:

rL = σ(U(rL)L h↑L(k) + U

(rL)R h↑R(k) + b(rL)) (13)

rR = σ(U(rR)L h↑L(k) + U

(rR)R h↑R(k) + b(rR)) (14)

zL = σ(U(zL)L h↑L(k) + U

(zL)R h↑R(k) + b(zL)) (15)

zR = σ(U(zR)L h↑L(k) + U

(zR)R h↑R(k) + b(zR)) (16)

z = σ(U(z)L h↑L(k) + U

(z)R h↑R(k) + b(z)) (17)

h̃↑k = tanh(UL(rL � h↑L(k)) + UR(rR � h↑R(k))

)(18)

h↑k = zL � h↑L(k) + zR � h↑R(k) + z � h̃↑k (19)

where rL, rR are the reset gates and zL, zR are the update gatesfor the left and right children, and z is the update for the internalhidden state h̃↑k


Bottom-up tree encoder


Bidirectional tree encoder

Learned representations of a bottom-up encoder node is based onlyon its subtree; information above it in the tree is missing. This canbe addressed by adding a top-down encoder as follows:

If node k is root:

Node k top-down encoding is a function of the bottom-upencoding of node k

h↓k = tanh(Wh↑k + b) (20)

Else:

Node k top-down encoding is produced by a sequential GRUrunning from root down the tree to node k

h↓k = GRU(h↓p(k), h↑k) (21)

Final encoding for each node is obtained by concatenatingbottom-up and top-down hidden states:

hlk =

[h↑kh↓k

](22)


Bidirectional tree encoder


Issues using tree encoder

Syntactic phrases in the sourceare often translated intodiscontinuous words in theoutput.

Non-leaf nodes, which containmore information, are oftenattended to more often than leafnodes, which may lead toover-translation


Word Coverage Model

Coverage vectors have been previously proposed to make attentiontime-dependent, which affect the calculation of the attention scoreas follows:

ej,i = v>a tanh(Wadj−1 + Uahi+VaCj−1,i) (23)

The authors propose incorporating additional source treeinformation by adding the coverage vectors and attention weightsof each child:

Cj,i = GRU(Cj−1,i, αj,i, dj−1, hi,

Cj−1,L(i), αj,L(i),

Cj−1,R(i), αj,R(i)).

(24)


Data

NIST Chinese-English translation task: 1.6M sentence pairs

Chinese sentences parsed with the Berkeley Parser1

Compare against 3 models/techniques:

NMT: standard attention NMT model2

Tree-LSTM: attention NMT model with Tree-LSTM encoder3

Coverage: attention NMT model with word coverage4

1Slav Petrov and Dan Klein. 2007. Improved inference for unlexicalized parsing. In Proc. NAACL HLT. pages404–411. http://www.aclweb.org/anthology/N/N07/N07-1051.

2Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learningto align and translate. In ICLR 2015. http://arxiv.org/abs/1409.0473.

3Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa Tsuruoka. 2016. Tree-to-sequence attentional neuralmachine translation. In Proc. ACL. pages 823–833. http://www.aclweb.org/anthology/P16-1078.

4Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. 2016. Modeling coverage for neuralmachine translation. In Proc. ACL. pages 76–85. http://www.aclweb.org/anthology/P16-1008.


Chinese-English BLEU-4 Scores


Tree-LSTM vs Tree-GRU encoder

Previous table:

Experiments with LSTM sequence encoder:


Encoding size

Previous table:

Experiments with bidirectional embedding size halved:


Takeaways

For the encoder, tree-encoder’s using syntax do better thanpurely sequential ones, and bidirectional tree-encoders arebetter than bottom-up only ones.

Coverage helps. Tree-coverage helps more.

Using the same type of cell (GRU vs LSTM) for both thesequential and tree encodings is better. LSTM-LSTM isslightly better than GRU-GRU, but more expensive.

Gains of the bidirectional tree encoding is due to more thanjust the larger embedding size.

Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019 · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Documents