On using monolingual corpora in neural machine translation

On Using Monolingual Corpora in Neural Machine Translation

Presentation by: Ander Martinez Sanchez (D1) 松本研

Abstract● Recent NMT showed promising results

○ Because of good corpora

● Investigate how to leverage abundant monolingual corpora● Up to +1.96 BLEU on a low-resource pair (Turkish-English)

○ +1.59 on a focused domain pair (Chinese-English chat messages)

● Also benefits high resource languages○ +0.39 BLEU on Chinese-English○ +0.47 BLEU on German-English

Introduction● Goal: Improve NMT by using monolingual data● By: Integrating a Language Model (LM) for the target language (English)● For:

a. Resource-limited pair: Turkish-Englishb. Domain restricted translation: Chinese-English SMS chatsc. High-resource pairs: German-English and Chinese-English

● Article structure:a. Recent workb. Basic model architecturec. Shallow and deep fusion approachesd. Datasetse. Experiments and results

Background: Neural Machine TranslationSMT

● Theory: Maximize p(y|x). Bayes: p(y) ← language model● Reality: Systems tend to model

○ fj(x, y) ← a feature, like pair-wise statistics

○ C is a normalization constant. Often ignored.

NMT

● A single network optimizes log p(x|y), including feature extraction and C● Typically, encoder-decoder framework● Once the conditional distribution is learnt,

○ find a translation using, for instance a beam search algorithm

Model Description

Figure from [Ling et al. 2015]

1. Word embeddings2. Annotation vectors

(Encoding)3. y word embeddings4. Decoder hidden state5. Alignment model

6. Context vector

7. Deep output layer and softmax

Optimize:

Integrating Language Model into the Decoder

● Two methods for integrating a LM: “shallow fusion” and “deep fusion”● Both, the Language Model (LM) and the Translation Model (TM) are pre-trained.● The LM is based on Recurrent Neural Networks (RNNLM) [Mikolov et al. 2011]

○ Very similar to the TM, but without steps (5) and (6) in the previous slide

Shallow Fusion Deep Fusion

● NMT computes `p’ for each next word● New score is the summation to the word score

and the hypothesis for t-1● K top hypotheses are selected as candidate● THEN, rescore the hypothesis using a weighted

sum.

● Concatenate the hidden states of the LM and TM before the Deep Output Layer.

● The model is finetunned.○ Only for the parameters involved.

Integrating Language Model into the Decoder

Deep Fusion - Balancing the LM and the TM

● In some cases the LM is more informative than in others. Examples:○ Articles: because Chinese doesn’t have them, in Zh-En the LM is more informative than TM○ Nouns: The LM is less informative in this case.

● Controller mechanism added.○ The hidden state of the LM ( ) is multiplied by gt

vg and bg are learnt parameters.○

● Intuitively, this decides the importance of the LM for each word.

Datasets1. Chinese-English (Zh-En)

a. from NIST OpenMT’15 challengei. SMS/Chatii. Conversational Telephone Speech (CTS)iii. Newsgroups/weblogs from DARPA BOLT Project

b. Chinese part on character-levelc. Restricted to CTS (<--- ?????)

2. Turkish-English (Tr-En)a. WIT and SETimes parallel corpora (TEDx talks)b. Turkish tokenized as subword-units (Zemberek)

3. German-English (De-En)4. Czech-English (Cs-En)

a. WMT’15. Weird sentences dropped.

5. Monolingual Corpora: English Gigaword (LDC)

Datasets

SettingsNMT

● Vocabulary sizes for Zh-En and Tr-En: Zh (10k) Tr (30k) En (40)● Vocabylary sizes for Cs-En and De-En: 200k using sampled softmax

○ [Jean et al. 2014]

● Size of recurrent units: Zh-En (1200), Tr-En (1000) OTHERS?● Adadelta with minibatches of 80● Clip the gradient to 5 if L2 is exceeding● Non-recurrent layers have dropout [Hinton et al. 2012]

○ and gaussian noise (mean: 0, std: 0.001) to prevent overfitting [Graves, 2011]

● Early stopped on development set BLEU● Weight matrices initialized as random orthonormal

SettingsLM

● For each English vocabulary (3 variations) constructed LM with○ LSTM of 2,400 units (Zh-En and Tr-En)○ LSTM of 2,000 units (Cs-En and De-En)

● Optimized with○ RMSProp (Zh-En and Tr-En) [Tieleman and Hilton, 2012]○ Adam (Cs-En and De-En) [Kingma and Ba, 2014]

● Sentences with more than 10% of UNK were discarded● Early stopped on perplexity

Settings● Shallow Fusion

○ Beta (Eq. 5) selected to maximize translation performance on dev set○ Range in (0.001 and 0.1)○ Renormalize softmax of LM without EOS and OOV symbols

■ Maybe due to domain differences in LM and TM

● Deep Fusion○ Finetunned parameters of Deep Output Layer and the controller

■ RMSProp: ● Dropout prob: 0.56 ● STD of weight noise: 0.005● Reduce level of regularization after 10K updates

■ Adadelta: Scaling down update steps by 0.01

● Handling Rare Words: For De- and Cs- cases copy UNK from source using attention mechanism. (Improved +1.0 BLEU)

ResultsZh-En: OpenMT’15

● Phrase-Based (PB) SMT [Koehn et al. 2003]

○ Rescoring with external neural LM (+CSLM)■ [Schwenk]

● Hierarchical Phrase-Based SMT (HPB) [Chiang, 2005]

○ +CSLM

● NMT, NMT+Shallow, NMT+Deep● Except CTS, +Deep helps● NMT outperformed Phrase-Based SMT

ResultsTr-En: IWSLT’14

● Using Deep Fusion○ +1.19 BLEU○ Outperformed the best previously reported result [Yilmaz et al. 2013]

ResultsCs-En and De-En: WMT’15

● Shallow: +0.09 and +0.29 BLEU● Deep: +0.39 and +0.47

Analysis● Depends heavily on domain similarity● In the case of Zh the domain if very different (Conversational vs News)

○ This is supported by the high perplexity

● Perplexity in Tr is lower, which led to larger improvement for shallow and deep○ Perplexity is even lower for De- and Cs-; so, larger the improvement

● For the case of deep the weight of LM is regulated through the controller○ For more similar domains it will be more active

○ In the case of De- and Cs- the controller

was more active.■ Correlates with +BLEU

■ Deep can adapt better

to domain mismatch

Conclusion and Future Work● 2 methods were presented and empirically evaluated.● For Chinese and Turkish, the deep fusion approach achieve better result than

existing SMT● Also improvement was observed for high resource pairs● The improvement depends heavily in the domain match between LM and ™

○ In the case were the domain matched, there was improvement for both the shallow and deep

approach

● Suggests that domain adaption for the LM may improve translations

On using monolingual corpora in neural machine translation

Engineering