Top Banner
On Using Monolingual Corpora in Neural Machine Translation Presentation by: Ander Martinez Sanchez (D1) 松本研
18

On using monolingual corpora in neural machine translation

Apr 15, 2017

Download

Engineering

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: On using monolingual corpora in neural machine translation

On Using Monolingual Corpora in Neural Machine Translation

Presentation by: Ander Martinez Sanchez (D1) 松本研

Page 2: On using monolingual corpora in neural machine translation

Abstract● Recent NMT showed promising results

○ Because of good corpora

● Investigate how to leverage abundant monolingual corpora● Up to +1.96 BLEU on a low-resource pair (Turkish-English)

○ +1.59 on a focused domain pair (Chinese-English chat messages)

● Also benefits high resource languages○ +0.39 BLEU on Chinese-English○ +0.47 BLEU on German-English

Page 3: On using monolingual corpora in neural machine translation

Introduction● Goal: Improve NMT by using monolingual data● By: Integrating a Language Model (LM) for the target language (English)● For:

a. Resource-limited pair: Turkish-Englishb. Domain restricted translation: Chinese-English SMS chatsc. High-resource pairs: German-English and Chinese-English

● Article structure:a. Recent workb. Basic model architecturec. Shallow and deep fusion approachesd. Datasetse. Experiments and results

Page 4: On using monolingual corpora in neural machine translation

Background: Neural Machine TranslationSMT

● Theory: Maximize p(y|x). Bayes: p(y) ← language model● Reality: Systems tend to model

○ fj(x, y) ← a feature, like pair-wise statistics

○ C is a normalization constant. Often ignored.

NMT

● A single network optimizes log p(x|y), including feature extraction and C● Typically, encoder-decoder framework● Once the conditional distribution is learnt,

○ find a translation using, for instance a beam search algorithm

Page 5: On using monolingual corpora in neural machine translation

Model Description

Figure from [Ling et al. 2015]

1. Word embeddings2. Annotation vectors

(Encoding)3. y word embeddings4. Decoder hidden state5. Alignment model

6. Context vector

7. Deep output layer and softmax

Optimize:

Page 6: On using monolingual corpora in neural machine translation

Integrating Language Model into the Decoder

● Two methods for integrating a LM: “shallow fusion” and “deep fusion”● Both, the Language Model (LM) and the Translation Model (TM) are pre-trained.● The LM is based on Recurrent Neural Networks (RNNLM) [Mikolov et al. 2011]

○ Very similar to the TM, but without steps (5) and (6) in the previous slide

Shallow Fusion Deep Fusion

● NMT computes `p’ for each next word● New score is the summation to the word score

and the hypothesis for t-1● K top hypotheses are selected as candidate● THEN, rescore the hypothesis using a weighted

sum.

● Concatenate the hidden states of the LM and TM before the Deep Output Layer.

● The model is finetunned.○ Only for the parameters involved.

Page 7: On using monolingual corpora in neural machine translation

Integrating Language Model into the Decoder

Deep Fusion - Balancing the LM and the TM

● In some cases the LM is more informative than in others. Examples:○ Articles: because Chinese doesn’t have them, in Zh-En the LM is more informative than TM○ Nouns: The LM is less informative in this case.

● Controller mechanism added.○ The hidden state of the LM ( ) is multiplied by gt

vg and bg are learnt parameters.○

● Intuitively, this decides the importance of the LM for each word.

Page 8: On using monolingual corpora in neural machine translation
Page 9: On using monolingual corpora in neural machine translation

Datasets1. Chinese-English (Zh-En)

a. from NIST OpenMT’15 challengei. SMS/Chatii. Conversational Telephone Speech (CTS)iii. Newsgroups/weblogs from DARPA BOLT Project

b. Chinese part on character-levelc. Restricted to CTS (<--- ?????)

2. Turkish-English (Tr-En)a. WIT and SETimes parallel corpora (TEDx talks)b. Turkish tokenized as subword-units (Zemberek)

3. German-English (De-En)4. Czech-English (Cs-En)

a. WMT’15. Weird sentences dropped.

5. Monolingual Corpora: English Gigaword (LDC)

Page 10: On using monolingual corpora in neural machine translation

Datasets

Page 11: On using monolingual corpora in neural machine translation

SettingsNMT

● Vocabulary sizes for Zh-En and Tr-En: Zh (10k) Tr (30k) En (40)● Vocabylary sizes for Cs-En and De-En: 200k using sampled softmax

○ [Jean et al. 2014]

● Size of recurrent units: Zh-En (1200), Tr-En (1000) OTHERS?● Adadelta with minibatches of 80● Clip the gradient to 5 if L2 is exceeding● Non-recurrent layers have dropout [Hinton et al. 2012]

○ and gaussian noise (mean: 0, std: 0.001) to prevent overfitting [Graves, 2011]

● Early stopped on development set BLEU● Weight matrices initialized as random orthonormal

Page 12: On using monolingual corpora in neural machine translation

SettingsLM

● For each English vocabulary (3 variations) constructed LM with○ LSTM of 2,400 units (Zh-En and Tr-En)○ LSTM of 2,000 units (Cs-En and De-En)

● Optimized with○ RMSProp (Zh-En and Tr-En) [Tieleman and Hilton, 2012]○ Adam (Cs-En and De-En) [Kingma and Ba, 2014]

● Sentences with more than 10% of UNK were discarded● Early stopped on perplexity

Page 13: On using monolingual corpora in neural machine translation

Settings● Shallow Fusion

○ Beta (Eq. 5) selected to maximize translation performance on dev set○ Range in (0.001 and 0.1)○ Renormalize softmax of LM without EOS and OOV symbols

■ Maybe due to domain differences in LM and TM

● Deep Fusion○ Finetunned parameters of Deep Output Layer and the controller

■ RMSProp: ● Dropout prob: 0.56 ● STD of weight noise: 0.005● Reduce level of regularization after 10K updates

■ Adadelta: Scaling down update steps by 0.01

● Handling Rare Words: For De- and Cs- cases copy UNK from source using attention mechanism. (Improved +1.0 BLEU)

Page 14: On using monolingual corpora in neural machine translation

ResultsZh-En: OpenMT’15

● Phrase-Based (PB) SMT [Koehn et al. 2003]

○ Rescoring with external neural LM (+CSLM)■ [Schwenk]

● Hierarchical Phrase-Based SMT (HPB) [Chiang, 2005]

○ +CSLM

● NMT, NMT+Shallow, NMT+Deep● Except CTS, +Deep helps● NMT outperformed Phrase-Based SMT

Page 15: On using monolingual corpora in neural machine translation

ResultsTr-En: IWSLT’14

● Using Deep Fusion○ +1.19 BLEU○ Outperformed the best previously reported result [Yilmaz et al. 2013]

Page 16: On using monolingual corpora in neural machine translation

ResultsCs-En and De-En: WMT’15

● Shallow: +0.09 and +0.29 BLEU● Deep: +0.39 and +0.47

Page 17: On using monolingual corpora in neural machine translation

Analysis● Depends heavily on domain similarity● In the case of Zh the domain if very different (Conversational vs News)

○ This is supported by the high perplexity

● Perplexity in Tr is lower, which led to larger improvement for shallow and deep○ Perplexity is even lower for De- and Cs-; so, larger the improvement

● For the case of deep the weight of LM is regulated through the controller○ For more similar domains it will be more active

○ In the case of De- and Cs- the controller

was more active.■ Correlates with +BLEU

■ Deep can adapt better

to domain mismatch

Page 18: On using monolingual corpora in neural machine translation

Conclusion and Future Work● 2 methods were presented and empirically evaluated.● For Chinese and Turkish, the deep fusion approach achieve better result than

existing SMT● Also improvement was observed for high resource pairs● The improvement depends heavily in the domain match between LM and ™

○ In the case were the domain matched, there was improvement for both the shallow and deep

approach

● Suggests that domain adaption for the LM may improve translations