SYSTRAN’s Pure Neural Machine Translation Systemsblog.systransoft.com/wp-content/uploads/2016/10/SystranNMTReport.pdf · SYSTRAN’s Pure Neural Machine Translation Systems ...

SYSTRANs Pure Neural Machine Translation SystemsJosep Crego, Jungi Kim, Guillaume Klein, Anabel Rebollo, Kathy Yang, Jean Senellart

Egor Akhanov, Patrice Brunelle, Aurelien Coquard, Yongchao Deng, Satoshi Enoue, Chiyo Geiss,Joshua Johanson, Ardas Khalsa, Raoum Khiari, Byeongil Ko, Catherine Kobus,

Jean Lorieux, Leidiana Martins, Dang-Chuan Nguyen, Alexandra Priori, Thomas Riccardi,Natalia Segal, Christophe Servan, Cyril Tiquet, Bo Wang, Jin Yang, Dakun Zhang, Jing Zhou

[email protected]

Abstract

Since the first online demonstration ofNeural Machine Translation (NMT) byLISA (Bahdanau et al., 2014), NMT de-velopment has recently moved from lab-oratory to production systems as demon-strated by several entities announcing roll-out of NMT engines to replace their ex-isting technologies. NMT systems havea large number of training configurationsand the training process of such sys-tems is usually very long, often a fewweeks, so role of experimentation is crit-ical and important to share. In this work,we present our approach to production-ready systems simultaneously with releaseof online demonstrators covering a largevariety of languages (12 languages, for32 language pairs). We explore differentpractical choices: an efficient and evolu-tive open-source framework; data prepara-tion; network architecture; additional im-plemented features; tuning for production;etc. We discuss about evaluation method-ology, present our first findings and we fi-nally outline further work.

Our ultimate goal is to share our expertiseto build competitive production systemsfor generic translation. We aim at con-tributing to set up a collaborative frame-work to speed-up adoption of the technol-ogy, foster further research efforts and en-able the delivery and adoption to/by in-dustry of use-case specific engines inte-grated in real production workflows. Mas-tering of the technology would allow usto build translation engines suited for par-ticular needs, outperforming current sim-plest/uniform systems.

1 Introduction

Neural MT has recently achieved state-of-the-art performance in several large-scale translationtasks. As a result, the deep learning approach toMT has received exponential attention, not onlyby the MT research community but by a growingnumber of private entities, that begin to includeNMT engines in their production systems.

In the last decade, several open-source MTtoolkits have emergedMoses (Koehn et al.,2007) is probably the best-known out-of-the-box MT systemcoexisting with commercial al-ternatives, though lowering the entry barriersand bringing new opportunities on both researchand business areas. Following this direction, ourNMT system is based on the open-source projectseq2seq-attn1 initiated by the Harvard NLPgroup2 with the main contributor Yoon Kim. Weare contributing to the project by sharing severalfeatures described in this technical report, whichare available to the MT community.

Neural MT systems have the ability to directlymodel, in an end-to-end fashion, the associationfrom an input text (in a source language) to itstranslation counterpart (in a target language). Amajor strength of Neural MT lies in that all thenecessary knowledge, such as syntactic and se-mantic information, is learned by taking the globalsentence context into consideration when mod-eling translation. However, Neural MT enginesare known to be computationally very expen-sive, sometimes needing for several weeks to ac-complish the training phase, even making use ofcutting-edge hardware to accelerate computations.Since our interest is for a large variety of lan-guages, and that based on our long experience withmachine translation, we do not believe that a one-fits-all approach would work optimally for lan-

1https://github.com/harvardnlp/seq2seq-attn

2http://nlp.seas.harvard.edu

guages as different as Korean, Arabic, Spanish orRussian, we did run hundreds of experiments, andparticularily explored language specific behaviors.One of our goal would indeed be to be able to in-ject existing language knowledge in the trainingprocess.

In this work we share our recipes and experi-ence to build our first generation of production-ready systems for generic translation, setting astarting point to build specialized systems. We alsoreport on extending the baseline NMT engine withseveral features that in some cases increase perfor-mance accuracy and/or efficiency while for someothers are boosting the learning curve, and/ormodel speed. As a machine translation company,in addition to decoding accuracy for generic do-main, we also pay special attention to featuressuch as:

Training time

Customization possibility: user terminology,domain adaptation

Preserving and leveraging internal formattags and misc placeholders

Practical integration in business applications:for instance online translation box, but alsotranslation batch utilities, post-editing envi-ronment...

Multiple deployment environments: cloud-based, customer-hosted environment or em-bedded for mobile applications

etc

More important than unique and uniform trans-lation options, or reaching state-of-the-art researchsystems, our focus is to reveal language specificsettings, and practical tricks to deliver this tech-nology to the largest number of users.

The remaining of this report is as follows: Sec-tion 2 covers basic details of the NMT system em-ployed in this work. Description of the transla-tion resources are given in section 3. We report onthe different experiments for trying to improve thesystem by guiding the training process in section4 and section 5, we discuss about performance.In section 6 and 7, we report on evaluation of themodels and on practical findings. And we finish bydescribing work in progress for the next release.

2 System Description

We base our NMT system on the encoder-decoderframework made available by the open-sourceproject seq2seq-attn. With its root on a num-ber of established open-source projects such asAndrej Karpathys char-rnn,3 Wojciech Zarembasstandard long short-term memory (LSTM)4 andthe rnn library from Element-Research,5 theframework provides a solid NMT basis consist-ing of LSTM, as the recurrent module and faith-ful reimplementations of global-general-attentionmodel and input-feeding at each time-step of theRNN decoder as described by Luong et al. (2015).

It also comes with a variety of features such asthe ability to train with bidirectional encoders andpre-trained word embeddings, the ability to handleunknown words during decoding by substitutingthem either by copying the source word with themost attention or by looking up the source wordon an external dictionary, and the ability to switchbetween CPU and GPU for both training and de-coding. The project is actively maintained by theHarvard NLP group6.

Over the course of the development of our ownNMT system, we have implemented additionalfeatures as described in Section 4, and contributedback to the open-source community by makingmany of them available in the seq2seq-attnrepository.seq2seq-attn is implemented on top of the

popular scientific computing library Torch.7 Torchuses Lua, a powerful and light-weight script lan-guage, as its front-end and uses the C languagewhere efficient implementations are needed. Thecombination results in a fast and efficient systemboth at the development and the run time. As anextension, to fully benefit from multi-threading,optimize CPU and GPU interactions, and to havefiner control on the objects for runtime (sparse ma-trix, quantized tensor, ...), we developed a C-baseddecoder using the C APIs of Torch, called C-torch,explained in detail in section 5.4.

The number of parameters within an NMTmodel can grow to hundreds of millions, butthere are also a handful of meta-parameters thatneed to be manually determined. For some of the

3https://github.com/karpathy/char-rnn4https://github.com/wojzaremba/lstm5https://github.com/Element-Research/

rnn6http://nlp.seas.harvard.edu7http://torch.ch

ModelEmbedding dimension: 400-1000Hidden layer dimension: 300-1000Number of layers: 2-4Uni-/Bi-directional encoder

TrainingOptimization methodLearning rateDecay rateEpoch to start decayNumber of EpochsDropout: 0.2-0.3

Text unit Section 4.1Vocabulary selectionWord vs. Subword (e.g. BPE)

Train data Section 3size (quantity vs. quality)max sentence lengthselection and mixture of domains

Table 1: There are a large number of meta-parameters to be considered during training. Theoptimal set of configurations differ from languagepair to language pair.

meta-parameters, many previous work presentsclear choices on their effectiveness, such as us-ing the attention mechanism or feeding the pre-vious prediction as input to the current time stepin the decoder. However, there are still many moremeta-parameters that have different optimal valuesacross datasets, language pairs, and the configura-tions of the rest of the meta-parameters. In table 1,we list the meta-parameter space that we exploredduring the training of our NMT systems.

In appendix B, we detail the parameters used forthe online system of this first release.

3 Training Resources

Training generic engines is a challenge, be-cause there is no such notion of generic trans-lation which is what online translation serviceusers are expecting from these services. Indeedonline translation is covering a very large varietyof use cases, genres and domains. Also availableopen-source corpora are domain specific: Europarl(Koehn, 2005), JRC (Steinberger et al., 2006) orMultiUN (Chen and Eisele, 2012) are legal texts,ted talk are scientific presentations, open subtitles(Tiedemann, 2012) are colloquial, etc. As a result,

the training corpora we used for this release werebuilt by doing a weighted mix all of the availablesources. For languages with large resources, wedid reduce the ratio of the institutional (Europal,UN-type), and colloquial types giving the pref-erence to news-type, mix of webpages (like Giga-word).

Our strategy, in order to enable more experi-ments was to define 3 sizes of corpora for each lan-guage pair: a baseline corpus (1 million sentences)for quick experiments (day-scale), a medium cor-pus (2-5M) for real-scale system (week-scale) anda very large corpora with more than 10M seg-ments.

The amount of data used to train online systemsare reported in table 2, while most of the individ-ual experimental results reported in this report areobtained with baseline corpora.

Note that size of the corpus needs to be consid-ered with the number of training periods since theneural network is continuously fed by sequencesof sentence batches till the network is consideredtrained. In Junczys-Dowmunt et al. (2016), au-thors mention using corpus of 5M sentences andtraining of 1.2M batches each having 40 sentences meaning basically that each sentence of the fullcorpus is presented 10 times to the training. In Wuet al. (2016), authors mention 2M steps of 128 ex-amples for EnglishFrench, for a corpus of 36Msentences, meaning about 7 iterations on the com-plete corpus. In our framework, for this release,we systematically extended the training up to 18epochs and for some languages up to 22 epochs.

Selection of the optimal system is made af-ter the complete training by calculating scoreson independent test sets. As an outcome, wehave seen different behaviours for different lan-guage pairs with similar training corpus size ap-parently connected to the language pair complex-ity. For instance, EnglishKorean training perplex-ity still decreases significantly between epoch 13and 19 while ItalianEnglish perplexity decreasesmarginally after epoch 10. For most languages, inour set-up, optimal systems are achieved aroundepoch 15.

We did also some experiment on the corpussize. Intuitively, since NMT systems do not havethe memorizing capacity of PBMT engines, thefact that the training use 10 times 10M sentencecorpus, or 20 times 5M corpus should not makea huge difference. In one of the experiment, we

LanguagePair

Training Testing#Sents #Tokens Vocab #Sents #Tokens Vocab OOV

source target source target source target source target source targetenbr 2.7M 74.0M 76.6M 150k 213k 2k 51k 53k 6.7k 8.1k 47 64enit 3.6M 98.3M 100M 225k 312k 2k 52k 53k 7.3k 8.8k 66 85enar 5.0M 126M 155M 295k 357k 2k 51k 62k 7.5k 8.7k 43 47enes 3.5M 98.8M 105M 375k 487k 2k 53k 56k 8.5k 9.8k 110 119ende 2.6M 72.0M 69.1M 150k 279k 2k 53k 51k 7.0k 9.6k 30 77ennl 2.1M 57.3M 57.4M 145k 325k 2k 52k 53k 6.7k 7.9k 50 141enko 3.5M 57.5M 46.4M 98.9k 58.4k 2k 30k 26k 7.1k 11k 0 -enfr 9.3M 220M 250M 558k 633k 2k 48k 55k 8.2k 8.6k 77 63frbr 1.6M 53.1M 47.9M 112k 135k 2k 62k 56k 7.4k 8.1k 55 59frit 3.1M 108M 96.5M 202k 249k 2k 69k 61k 8.2k 8.8k 47 57frar 5.0M 152M 152M 290k 320k 2k 60k 60k 8.5k 8.6k 42 61fres 2.8M 99.0M 91.7M 170k 212k 2k 69k 64k 8.0k 8.6k 37 55frde 2.4M 73.4M 62.3M 172k 253k 2k 57k 48k 7.5k 9.0k 59 104frzh 3.0M 98.5M 76.3M 199k 168k 2k 67k 51k 8.0k 5.9k 51 -jako 1.4M 14.0M 13.9M 61.9k 55.6k 2k 19k 19k 9.3k 8.5k 0 0nlfr 3.0M 74.8M 84.7M 446k 260k 2k 49k 55k 7.9k 7.5k 150 -faen 795k 21.7M 20.2M 166k 147k 2k 54k 51k 7.7k 8.7k 197 -jaen 1.3M 28.0M 22.0M 24k 87k 2k 41k 32k 6.2k 7.3k 3 -zhen 5.8M 145M 154M 246k 225k 2k 48k 51k 5.5k 6.9k 34 -

Table 2: Corpora statistics for each language pair (iso 639-1 2-letter code, expect for Portuguese Brazil-ian noted as br). All language pairs are bidirectional except nlfr, frzh, jaen, faen, enko, zhen. Columns2-6 indicate the number of sentences, running words and vocabularies referred to training datasets whilecolumns 7-11 indicate the number of sentences, running words and vocabularies referred to test datasets.Columns 12 and 13 indicate respectively the vocabulary of OOV of the source and target test sets. (Mstand for milions, k for thousands). Since jako and enko are trained using BPE tokenization (see section4.1), there is no OOV.

compared training on a 5M corpus trained over 20epochs for English to/from French, and the same5M corpus for only 10 epochs, followed by 10additional epochs on additional 5M corpus. The10M being completely homogeneous. In both di-rections, we observe that the 5 10 + 5 10training is completing with a score improvementof 0.8 1.2 compared to the 5 20 showing thatthe additional corpus is managing to bring a mean-ingful improvement. This observation leads to amore general question about how much corpus isneeded to actually build a high quality NMT en-gine (learn the language), the role and timing ofdiversity in the training and whether the incremen-tal gain could not be substituted by terminologyfeeding (learn the lexicon).

4 Technology

In this section we account for several experimentsthat improved different aspects of our translationengines. Experiments range from preprocessingtechniques to extend the network with the abil-ity to handle named entities, to use multiple wordfeatures and to enforce the attention module to bemore like word alignments. We also report on dif-

ferent levels of translation customization.

4.1 Tokenization

All corpora are preprocessed with an in-housetoolkit. We use standard token separators (spaces,tabs, etc.) as well as a set of language-dependentlinguistic rules. Several kinds of entities are rec-ognized (url and number) replacing its content bythe appropriate place-holder. A postprocess is usedto detokenize translation hypotheses, where theoriginal raw text format is regenerated followingequivalent techniques.

For each language, we have access to languagespecific tokenization and normalization rules.However, our preliminary experiments showedthat there was no obvious gain of using these lan-guage specific tokenization patterns, and that someof the hardcoded rules were actually degradingthe performance. This would need more investi-gation, but for the release of our first batch sys-tems, we used a generic tokenization model formost of the languages except Arabic, Chinese andGerman. In our past experiences with Arabic, sep-arating segmentation of clitics was beneficial, andwe retained the same procedure. For German and

Chinese, we used in-house compound splitter andword segmentation models, respectively.

In our current NMT approach, vocabulary sizeis an important factor that determines the ef-ficiency and the quality of the translation sys-tem; a larger vocabulary size correlates directlyto greater computational cost during decoding,whereas low coverage of vocabulary leads to se-vere out-of-vocabulary (OOV) problems, hencelowering translation quality.

In most language pairs, our strategy combinesa vocabulary shortlist and a placeholder mecha-nism, as described in Sections 4.2 and 4.3. This ap-proach, in general, is a practical and linguistically-robust option to addressing the fixed vocabularyissue, since we can take the full advantage of in-ternal manually-crafted dictionaries and customi-sized user dictionaries (UDs).

A number of previous work such as character-level (Chung et al., 2016), hybrid word-character-based (Luong and Manning, 2016) and subword-level (Sennrich et al., 2016b) address issues thatarise with morphologically rich languages such asGerman, Korean and Chinese. These approacheseither build accurate open-vacabulary word repre-sentations on the source side or improve transla-tion models generative capacity on the target side.Among those approaches, subword tokenizationyields competitive results achieving excellent vo-cabulary coverage and good efficiency at the sametime.

For two language pairs: enko and jaen, we usedsource and target sub-word tokenization (BPE, see(Sennrich et al., 2016b)) to reduce the vocabularysize but also to deal with rich morphology andspacing flexibility that can be observed in Korean.Although this approach is very seducing by itssimplicity and also used systematically in (Wu etal., 2016) and (Junczys-Dowmunt et al., 2016), itdoes not have significant side effects (for instancegeneration of impossible words) and is not opti-mal to deal with actual word morphology - sincethe same suffix (josa in Korea) depending on thefrequency of the word ending it is integrated with,will be splitted in multiple representations. Also,in Korean, these josa, are is an agreement withthe previous syllabus based on their final endings:however such simple information is not explicitelyor implicitely reachable by the neural network.

The sub-word encoding algorithm Byte Pair En-coding (BPE) described by Sennrich et al. (2016b)

was re-implemented in C++ for further speed op-timization.

4.2 Word Features

Sennrich and Haddow (2016) showed that us-ing additional input features improves translationquality. Similarly to this work, we introduced inthe framework the support for an arbitrary numberof discrete word features as additional inputs to theencoder. Because we do not constrain the numberof values these features can take at the same time,we represent them with continuous and normal-ized vectors. For example, the representation of afeature f at time-step t is:

x(t)i =

{1nf

if f takes the ith value

0 otherwise(1)

where nf is the number of possible values thefeature f can take and with x(t) Rnf .

These representations are then concatenatedwith the word embedding to form the new inputto the encoder.

We extended this work by also supporting ad-ditional features on the target side which will bepredicted by the decoder. We used the same inputrepresentation as on the encoder side but shiftedthe features sequence compared to the words se-quence so that the prediction of the features attime-step t depend on the word at time-step t theyannotate. Practically, we are generating feature attime t+ 1 for the word generated at time t.

To learn these target features, we added a lin-ear layer to the decoder followed by the softmaxfunction and used the mean square error criterionto learn the correct representation.

For this release, we only used case informationas additional feature. It allows us to work with alowercased vocabulary and treat the recasing as aseparate problem. We observed that the use of thissimple case feature in source and target does im-prove the translation quality as illustrated in Fig-ure 1. Also, we compared the accuracy of the in-duced recasing with other recasing frameworks(SRI disamb, and in-house recasing tool based onn-gram language models) and observed that theprediction of case by the NN was higher than us-ing external recaser, which was expected since NNhas access to source in addition to the source sen-tence context, and target sentence history.

6 8 10 12 14 16

16

18

20

22

24

26

28

Perplexity

BL

EU

baselinesource case

src&tgt

src&tgt+embed

Figure 1: Comparison of training progress (per-plexity/BLEU) with/without source (src) and tar-get (tgt) case features, with/without feature em-bedding (embed) on WMT2013 test corpus forEnglish-French. Score is calculated on lowercaseoutput. The perplexity increases when the targetfeatures are introduced because of the additionalclassification problem. We also notice a notice-able increase in the score when introducing thefeatures, in particular the target features. So thesefeatures do not simply help to reduce the vocabu-lary, but also by themselves help to structure theNN decoding model.

4.3 Named Entities (NE)

SYSTRANs RBMT and SMT translation enginesutilize number and named entity (NE) module torecognize, protect, and translate NE entities. Simi-larly, we used the same internal NE module to rec-ognize numbers and named entities in the sourcesentence and temporarily replaced them with theircorresponding placeholders (Table 3).

Both the source and the target side of the train-ing dataset need to be processed for NE placehold-ers. To ensure the correct entity recognition, wecross-validate the recognized entity across paral-lel dataset, that is: a valid entity recognition in asentence should have the same type of entity in itsparallel pair and the word or phrase covered by theentities need to be aligned to each other. We usedfast align (Dyer et al., 2013) to automaticallyalign source words to target words.

In our datasets, generally about one-fifth ofthe training instances contained one or more NEplaceholders. Our training dataset consists of sen-tences with NE placeholders as well as sentenceswithout them to be able to handle both instantiatedand recognized entity types.

NE type PlaceholderNumber ent numeric

Measurement ent numex measurementMoney ent numex money

Person (any) ent personTitle ent person title

First name ent person firstnameInitial ent person initials

Last name ent person lastnameMiddle name ent person middlename

Location ent locationOrganization ent organization

Product ent productSuffix ent suffix

Time expressions ent timex expressionDate ent date

Date (day) ent date dayDate (Month) ent date monthDate (Year) ent date year

Hour ent hour

Table 3: Named entity placeholder types

Per source sentence, a list of all entities, alongwith their translations in the target language, ifavailable, are returned by our internal NE recogni-tion module. The entities in the source sentence isthen replaced with their corresponding NE place-holders. During beam search, we make sure thatan entity placehoder is translated by itself in thetarget sentence. When the entire target sentenceis produced along with the attention weights thatprovide soft alignments back to the original sourcetokens, Placeholders in the target sentences are re-placed with either the original source string or itstranslation.

The substitution of the NE placeholders withtheir correct values needs language pair-specificconsiderations. In Figure 4, we show that even thehandling of Arabic numbers cannot be straight-forward as copying the original value in the sourcetext.

4.4 Guided Alignments

We re-reimplemented Guided alignment strategydescribed in Chen et al. (2016). Guided align-ment enforces the attention weights to be morelike alignments in the traditional sense (e.g. IBM4viterbi alignment) where the word alignments ex-plicitly indicate that source words aligned to a tar-get word are translation equivalents.

Similarly to the previous work, we created anadditional criterion on attention weights,Lga, suchthat the difference in the attention weights and thereference alignments is treated as an error and di-rectly and additionally optimize the output of the

En KoTrain

train data 25 billion 250entity-replaced ent numeric billion ent numeric

Decodeinput data 1.4 billion -

entity-replaced ent numeric billion -translated - ent numeric

naive substition - 1.4expected result - 14

Table 4: Examples of English and Korean num-ber expressions where naive recognition and sub-stitution fails. Even if the model correctly pro-duces correct placeholders, simply copying theoriginal value will result in incorrect translation.These kind of structural entities need languagepair-specific treatment.

attention module.

Lga(A,) =1

Tt

s

(Ast st)2

The final loss function for the optimization is then:

Ltotal = wga Lga(A,) + (1wga) Ldec(y, x)

where A is the alignment matrix, attentionweights, s and t indicating the indices in thesource and target sentences, and wga the linearcombination weight for the guided alignment loss.

Chen et al. (2016) also report that decayingwga, thereby gradually reducing the influencefrom guided alignment, over the course of train-ing found to be helpful on certain datasets. Whenguided alignment decay is enabled, wga is gradu-ally reduced at this rate after each epoch from thebeginning of the training.

Without searching for the optimal parametervalues, we simply took following configurationsfrom the literature: mean square error (MSE) asloss function, 0.5 as the linear-combination weightfor guided alignment loss and the cross-entropyloss for decoder, and 0.9 for decaying factor forguided alignment loss.

For the alignment, we again utilizedfast align tool. We stored alignments insparse format8 to save memory usage, and foreach minibatch a dense matrix is created for fastercomputation.

Applying such a constraint on attention weightscan help locate the original source word more

8Compressed Column Storage (CCS)

1 3 5 7 9 11 13 15 17 1946

47

48

49

50

51

52

Epoch

BL

EU

Baseline

Guided alignment

Guided alignment with decay

Figure 2: Effect of guided alignment. This partic-ular learning curve is from training an attention-based model with 4 layers of bidirectional LSTMswith 800 dimension on a 5 million French to En-glish dataset.

accurately, which we hope to benefits better NEplaceholder substitutions and especially unknownword handling.

In Figure 2, we see the effects of guided align-ment and its decay rate on our English-to-Frenchgeneric model. Unfortunately, this full compara-tive run was disrupted by a power outage and wedid not have time to relaunch, however we canstill clearly observe that up to the initial 5 epochs,guided alignment, with or without decay, providesrather big boosts over the baseline training. Af-ter 5 epochs, the training with decay slow downcompare to the training without, which is ratherintuitive: the guided alignment is indeed in con-flict with the attention learning. What would re-main to be seen, is if at the training the training,the baseline and the guided alignment with decayare converging.

4.5 Politeness Mode

Many languages have ways to express politenessand deference towards people being reffered to insentences. In Indo-European languages, there aretwo pronouns corresponding to the English You; itis called the T-V distinction between the informalLatin pronoun tu (T) and the polite Latin pronounVos (V). Asian languages, such as Japanese andKorean, make an extensive use of honorifics (re-spectful words), words that are usually appendedto the ends of names or pronouns to indicate therelative ages and social positions of the speakers.Expressing politeness can also impact the vocabu-

lary of verbs, adjectives, and nouns used, as wellas sentence structures.

Following the work of Sennrich et al. (2016a),we implemented a politeness feature in our NMTengine: a special token is added to each sourcesentence during training, where the token indicatesthe politeness mode observed in the target sen-tence. Having such an ability to specify the polite-ness mode is very useful especially when translat-ing from a language where politeness is not ex-pressed, e.g. English, into where such expressionsare abundant, e.g. Korean, because it provides away of customizing politeness mode of the trans-lation output.Table 5 presents our English-to-Korean NMTmodel trained with politeness mode, and it is clearthat the proper verb endings are generated accord-ing to the user selection. After a preliminary eval-uation on a small testset from English to Korean,we observed 70 to 80% accuracy of the polite-ness generation (Table 6). We also noticed that86% of sentences (43 out of 50) have exactly thesame meaning preserved across different polite-ness modes.

This simple approach, however, comes at asmall price, where sometimes the unknown re-placement scheme tries to copy the special tokenin the target generation. A more appropriate ap-proach that we plan to switch to in our future train-ings is to directly feed the politeness mode into thesentential representation of the decoder.

4.6 Customization

Domain adaptation is a key feature for ourcustomersit generally encompasses terminol-ogy, domain and style adaptation, but can also beseen as an extension of translation memory for hu-man post-editing workflows.

SYSTRAN engines integrate multiple tech-niques for domain adaptation, training full newin-domain engines, automatically post-editing anexisting translation model using translation mem-ories, extracting and re-using terminology. WithNeural Machine Translation, a new notion of spe-cialization comes close to the concept of incre-mental translation as developed for statistical ma-chine translation like (Ortiz-Martnez et al., 2010).

4.6.1 Generic SpecializationDomain adaptation techniques have successfullybeen used in Statistical Machine Translation. It iswell known that a system optimized on a specific

text genre obtains higher accuracy results thana generic system. The adaptation process canbe done before, during or after the training pro-cess. Our preliminary experiments follow the lat-ter approach. We incrementally adapt a Neural MTgeneric system to a specific domain by runningadditional training epochs over newly available in-domain data.

Adaptation proceeds incrementally when newin-domain data becomes available, generated byhuman translators while post-editing, which issimilar to the Computer Aided Translation frame-work described in (Cettolo et al., 2014).

We experiment on an English-to-French trans-lation task. The generic model is a subsample ofthe corpora made available for the WMT15 trans-lation task (Bojar et al., 2015). Source and tar-get NMT vocabularies are the 60k most frequentwords of source and target training datasets. Thein-domain data is extracted from the EuropeanMedical Agency (EMEA) corpus. Table 7 showssome statistics of the corpora used in this experi-ment.

Our preliminary results show that incrementaladaptation is effective for even limited amountsof in-domain data (nearly 50k additional words).Constrained to use the original generic vocabu-lary, adaptation of the models can be run in a fewseconds, showing clear quality improvements onin-domain test sets.

0 1 2 3 4 5

29

30

31

32

33

34

35

36

Additional epochs

BL

EU

adaptfull

Figure 3: Adaptation with In-Domain data.

Figure 3 compares the accuracy (BLEU) oftwo systems: full is trained after concatenationof generic and in-domain data; adapt is initiallytrained over generic data (showing a BLEU scoreof 29.01 at epoch 0) and adapted after running sev-eral training epochs over only the in-domain train-

En:A senior U.S. treasury official is urging china to move faster on making its currency more flexible.

Ko with Formal mode:.

Ko with Informal mode:.

Table 5: A translation examples from an En-Ko system where the choice of different politeness modesaffects the output.

Mode Correct Incorrect AccuracyFormal 30 14 68.2%

Informal 35 9 79.5%

Table 6: Accuracy of generating correct Politenessmode of an English-to-Korean NMT system. Theevaluation was carried out on a set of 50 sentencesonly; 6 sentences were excluded from evaluationbecause neither the original nor their translationscontained any verbs.

Type Corpus # lines # src tok (EN) # tgt tok (FR)Train Generic 1M 24M 26M

EMEA 4,393 48k 56kTest EMEA 2,787 23k 27k

Table 7: Data used to train and adapt the genericmodel to a specific domain. The test corpus alsobelongs to the specific domain.

ing data. Both systems share the same genericNMT vocabularies. As it can be seen the adapt sys-tem improves drastically its accuracy after a sin-gle additional training epoch, obtaining a similarBLEU score than the full system (separated by .91BLEU). Note also that each additional epoch usingthe in-domain training data takes less than 50 sec-onds to be processed, while training the full sys-tem needs more than 17 hours.

Results validate the utility of the adaptation ap-proach. A human post-editor would take advan-tage of using new training data as soon as it be-comes available, without needing to wait for a longfull training process. However, the comparison isnot entirely fair since full training would allow toinclude the in-domain vocabulary in the new fullmodel, what surely would result in an additionalaccuracy improvement.

4.6.2 Post-editing EngineRecent success of Pure Neural Machine Transla-tion has led to the application of this technologyto various related tasks and in particular to the Au-tomatic Post-Editing (APE). The goal of this task

1 3 5 7 9 11 130

10

20

30

40

50

Epoch

BL

EU

NMT

NPE multi-source

NPE

SPE

RBMT

Figure 4: Accuracy results of RBMT, NMT, NPEand NPE multi-source.

is to simulate the behavior of a human post-editor,correcting translation errors made by a MT sys-tem.

Until recently, most of the APE approacheshave been based on phrase-based SMT systems,either monolingual (MT target to human post-edition) (Simard et al., 2007) or source-aware(Bechara et al., 2011). For many years now SYS-TRAN has been offering a hybrid Statistical Post-Editing (SPE) solution to enhance the translationprovided by its rule-based MT system (RBMT)(Dugast et al., 2007).

Following the success of Neural Post-Editing(NPE) in the APE task of WMT16 (Junczys-Dowmunt and Grundkiewicz, 2016), we have runa series of experiments applying the neural ap-proach in order to improve the RBMT systemoutput. As a first experiment, we compared theperformance of our English-to-Korean SPE sys-tem trained on technical (IT) domain data to twoNPE systems trained on the same data: monolin-gual NPE and multi-source NPE, where the in-put language and the MT hypothesis sequenceshave been concatenated together into one input se-quence (separated by a special token).

Figure 4 illustrates the accuracy (BLEU) re-

sults of four different systems at different trainingepochs. The RBMT system performs poorly, con-firming the importance of post-editing. Both NPEsystems clearly outperform SPE. It can also be ob-served that adding source information even in asimplest way possible (NPE multi-source), with-out any source-target alignment, considerably im-proves NPE translation results.

The system performing NPE multi-source ob-tains similar accuracy results than pure NMT.What can be seen is that NPE multi-source es-sentially employs the information from the origi-nal sentence to produce translations. However, no-tice that benefits of utilizing multiple inputs fromdifferent sources is clear at earlier epochs whileonce the model parameters converge, the differ-ence in performances of NMT and NPE multi-source models become negligible.

Further experiments are currently being con-ducted aiming at finding more sophisticated waysof combining the original source and the MTtranslation in the context of NPE.

5 Performance

As previously outlined, one of the major draw-backs of NMT engines is the need for cutting-edgehardware technology to face the enormous compu-tational requirements at training and runtime.

Regarding training, there are two major issues:the full training time and the required computationpower, i.e. the server investment. For this release,most of our trainings have been running on singleGTX GeForce 1080 GPU (about $2.5k) while in(Wu et al., 2016), authors mention using 96 K80GPU for a full week for training one single lan-guage pair (about $250k). On our hardware, fulltraining on 2x5M sentences (see section 3) took abit less than one month.

A reasonnable target is to maintain training timefor any language pair under one week and keep-ing reasonable investment so that the full researchcommunity can have competitive trainings but alsoindirectly so that all of our customers can benefitfrom the training technology. To do so, we need tobetter leverage multiple GPUs on a single serverwhich is on-going engineering work. We also needto continue on exploring how to learn more withless data. For instance, we are convinced that in-jecting terminology as part of the training datashould be competitive with continuing adding fullsentences.

Model BLEUbaseline 49.2440% pruned 48.5650% pruned 47.9660% pruned 45.9070% pruned 38.3860% pruned and retrained 49.26

Table 8: BLEU scores of pruned models on aninternal test set.

Also, shortening training cycle can also beachieved by better control of the training cycle.We have shown that multiple features are boostingthe training pace, and if going to bigger network isclearly improving performance. For instance, weare using a bidirectional 4 layer RNN in additionto our regular RNN, but in Wu et al. (2016), au-thors mention using bidirectional RNN only forthe first layer. We need to understand more thesephenomena and restrict to the minimum to reducethe model size.

Finally, work on specialization described in sec-tions 4.6.1 and 5.2 are promising for long termmaintenance: we could reach a point where we donot need to retrain from scratch but continuouslyimprove existing model and use teacher models toboost initial trainings.

Regarding runtime performance, we have beenexploring the following areas and are reaching to-day throughputs compatible with production re-quirement not only using GPUs but also usingCPU and we report our different strategies in thefollowing sub-sections.

5.1 Pruning

Pruning the parameters of a neural network is acommon technique to reduce its memory footprint.This approach has been proven efficient for theNMT tasks in See et al. (2016). Inspired by thiswork, we introduced similar pruning techniques inseq2seq-attn. We reproduced that models pa-rameters can be pruned up to 60% without any per-formance loss after retraining as shown in Table 8.

With a large pruning factor, neural networksweights can also be represented with sparse matri-ces. This implementation can lead to lower com-putation time but more importantly to a smallermemory footprint that allows us to target moreenvironment. Figures 5 and 6 show experiments

Figure 5: Processing time to perform a 1000 1000 matrix multiplication on a single thread.

Figure 6: Memory needed to perform a 1000 1000 matrix multiplication.

involving sparse matrices using Eigen9. For ex-ample, when using the float precision, a mul-tiplication with a sparse matrix already begins totake less memory when 35% of its parameters arepruned.

Related to this work, we present in Section 5.4our alternative Eigen-based decoder that allows usto support sparse matrices.

5.2 DistillationDespite that surprisingly accurate, NMT systemsneed for deep networks in order to perform well.Typically, a 4-layer LSTM with 1000 hidden unitsper layer (4 x 1000) are used to obtain state-of-the-art results. Such models require cutting-edgehardware for training in reasonable time while in-ference becomes also challenging on standard se-tups, or on small devices such as mobile phones.Though, compressing deep models into smaller

9http://eigen.tuxfamily.org

networks has been an active area of research.Following the work in (Kim and Rush, 2016),

we experimented sequence-level knowledge dis-tillation in the context of an English-to-FrenchNMT task. Knowledge distillation relies on train-ing a smaller student network to perform betterby learning the relevant parts of a larger teachernetwork. Hence, wasting parameters on trying tomodel the entire space of translations. Sequence-level is the knowledge distillation variant wherethe student model mimics the teachers actions atthe sequence-level.

The experiment is summarized in 3 steps:

train a teacher model on a source/referencetraining set,

use the teacher model to produce translationsfor the source training set,

train a student model on the newsource/translation training set.

For our initial experiments, we produced 35-best translations for each of the sentences of thesource training set, and used a normalized n-grammatching score computed at the sentence level, toselect the closest translation to each reference sen-tence. The original training source sentences andtheir translated hypotheses where used as trainingdata to learn a 2 x 300 LSTM network.

Results showed slightly higher accuracy resultsfor a 70% reduction of the number of parametersand a 30% increase on decoding speed. In a sec-ond experiment, we learned a student model withthe same structure than the teacher model. Surpris-ingly, the student clearly outperformed the teachermodel by nearly 1.5 BLEU.

We hypothesize that the translation performedover the target side of the training set produces asort of language normalization which is by con-struction very heterogeneous. Such normalizationeases the translation task, being learned by not sodeep networks with similar accuracy levels.

5.3 Batch Translation

To increase translation speed of large texts, wesupport batch translation that works in addition tothe beam search. It means that for a beam of sizeK and a batch of size B, we forward K B se-quences into the model. Then, the decoder outputis split across each batch and the beam path foreach sentence is updated sequentially.

0 20 40 60 80 1000

100

200

300

400

500

batch size

toke

ns/s

Figure 7: Tokens throughput when decoding froma student model (see section 5.2) with a beam ofsize 2 and using float precision. The experimentwas run using a standard Torch + OpenBLAS in-stall and 4 threads on a desktop Intel i7 CPU.

As sentences within a batch can have large vari-ations in size, extra care is needed to mask accord-ingly the output of the encoder and the attentionsoftmax over the source sequences.

Figure 7 shows the speedup obtained usingbatch decoding in a typical setup.

5.4 C++ DecoderWhile Torch is a powerful and easy to use frame-work, we chose to develop an alternative C++implementation for the decoding on CPU. It in-creases our control over the decoding process andopen the path to further memory and speed im-provements while making deployment easier.

Our implementation is graph-based and useEigen for efficient matrix computation. It can loadand infer from Torch models.

For this release, experiments show that thedecoding speed is on par or faster than theTorch-based implementation especially in a multi-threaded context. Figure 8 shows the better use ofparallelization of the Eigen-based implementation.

6 Evaluation

Evaluation of machine translation has always beena challenge and subject to many papers and dedi-cated workshops (Bojar et al., 2016). While au-tomatic metrics are now used as standard in theresearch world and have shown good correlationwith human evaluation, ad-hoc human evaluationor productivity analysis metrics are rather used inthe industry (Blain et al., 2011).

1 2 4 80

20

40

60

80

100

threads

toke

ns/s

Eigen-basedTorch-based

Figure 8: Tokens throughput with a batch of size 1in the same condition as Figure 7.

As a translation solution company, even if au-tomatic metrics are used through all the trainingprocess (and we give scores in the section 6.1), wecare about human evaluation of the results. Wu etal. (2016) mention human evaluation but simulta-neously cast a doubt on the referenced human totranslate or evaluate. In this context, the claim al-most indistinguishable with human translation isat the same time strong but also very vague. Onour side, we have observed during all our experi-ments and preparation of specialized models, un-precedented level of quality, and contexts wherewe could claim super human translation quality.

However, we need to be very carefully defin-ing the tasks, the human that are being comparedto, and the nature of the evaluation. For evaluatingtechnical translation, the nature of the evaluationis somewhat easy and really depending on the userexpectation: is the meaning properly conveyed, oris the sentence faster to post-edit than to translate.Also, to avoid doubts about integrity or compe-tency of the evaluation we sub-contracted the taskto CrossLang, a company specialized in machinetranslation evaluation. The test protocol was de-fined collaboratively and for this first release, wedecided to perform ranking of different systems,and we present in the section 6.2 the results ob-tained on two very different language pairs: En-glish to/from French, and English to Korean.

Finally, in the section 6.3, we also present somequalitative evaluation results showing specificitiesof the Neural Machine Translation.

6.1 Automatic Scoring and systemcomparison

Figure 9 plots automatic accuracy results, BLEU,and Perplexities for all language pairs. It is re-markable the high correlation between perplex-ity and BLEU scores, showing that language pairswith lower perplexity yield higher BLEU scores.Note also that different amounts of training datawere used for each system (see Table 2). BLEUscores were calculated over an internal test set.

From the beginning of this report we have usedinternal validation and test sets, what makes itdifficult to compare the performance of our sys-tems to other research engines. However, we mustkeep in mind that our goal is to account for im-provements in our production systems. We focuson human evaluations rather than on any automaticevaluation score.

6.2 Human Ranking EvaluationTo evaluate translation outputs and compare withhuman translation, we have defined the followingprotocol.

1. For each language pair, 100 sentences in do-main (*) are collected,

2. These sentences are sent to human transla-tion (**), and translated with candidate modeland using available online translation ser-vices (***).

3. Without notification of the mix of human andmachine translation, a team of 3 professionaltranslators or linguists fluent in both sourceand target languages is then asked to rank 3random outputs for each sentence based ontheir preference as translation. Preference in-cludes accuracy as a priority, but also fluencyof the generated translation. They have thechoice to give them 3 different ranks, or canalso decide to give 2 or the 3 of them the samerank, if they cannot decide.

(*) for Generic domain, sentences from recentnews article were selected online, for Technical(IT) sentences part of translation memory definedin section 4.6 were kept apart from the training.(**) for human translation, we did use translationagency (human-pro), and online collaborativetranslation platform (human-casual).(***) we used Naver Translator10 (Naver), Google

10http://translate.naver.com

Translate11 (Google) and Bing Translator12

(Bing).

For this first release, we experimented on theevaluation protocol for 2 different extremely dif-ferent categories of language pairs. On one hand,EnglishFrench which is probably the most stud-ied language pairs for MT and for which re-sources are very large: (Wu et al., 2016) mentionabout 36M sentence pairs used in their NMT train-ing and the equivalent PBMT is completed by aweb-scale target side language models13. Also, asEnglish is a low inflected language, the currentphrase-based technology for target language En-glish is more competitive due to the relative sim-plicity of the generation and weight of giganticlanguage models.

On the other hand, EnglishKorean is one ofthe toughest language pair due to the far distancebetween English and Korean language, the smallavailability of training corpus, and the rich agglu-tinative morphology of Korean. For a real compar-ison, we ran evaluation against Naver Translationservice from English into Korean, where Naver isthe main South Korean search engine.

Tables 9 and 10 describe the different evalua-tions and their results.

Several interesting outcomes:

Our vanilla English 7 French model outper-forms existing online engines and our best ofbreed technology.

For French 7 English, if the model slightlyoutperforms (human-casual) and our best ofbreed technology, it stays behind GoogleTranslate, and more significantly behind Mi-crosoft Translator.

The generic English 7 Korean model showscloser results with human translation and out-performs clearly existing online engines.

The in-domain specialized model surpris-ingly outperforms the reference human trans-lation.

We are aware that far more evaluations are nec-essary and we will be launching a larger eval-uation plan for our next release. Informally, we

11http://translate.google.com12http://translator.bing13In 2007, Google already mentions using 2 trillion words

in their language models for machine translation (Brants etal., 2007).

3 4 5 625

30

35

40

45

50

55

enar

aren

ende

deenennl

nlenenititen

enbr

bren

enes esen

frar

arfr

frit

itfr

frde

defrfrbr

brfr

fres

esfr

frzh

zhen

nlfr

enfr

fren

jako

koja

Perplexity

BL

EU

Figure 9: Perplexity and BLEU scores obtained by NMT engines for all language pairs. Perplexity andBLEU scores were calculated respectively on validation and test sets. Note that English-to-Korean andFarsi-to-English systems are not shown in this plot achieving respectively (18.94, 16.50) and (3.84,34.19) scores for perplexity and BLEU.

Language Pair Domain (*) Human Translation (**) Online Translation (***)English 7 French Generic human-pro, human-casual Bing, GoogleFrench 7 English Generic human-pro, human-casual Bing, GoogleEnglish 7 Korean News human-pro Naver, Google, BingEnglish 7 Korean Technical (IT) human-pro N/A

Table 9: Performed evaluations

do observe that the biggest performance jumpare observed on complicated language pairs, likeEnglish-Korean or Japanese-English showing thatNMT engines are better able to handle majorstructure differences, but also on the languageswith lower resources like Farsi-English demon-strating that NMT is able to learn better with less,and we will explore this even more.

Finally, we are also launching in parallel, a real-life beta-testing program with our customers sothat we can also obtain formal feedback from theiruse-case and related to specialized models.

6.3 Qualitative Evaluation

In the table 11, we report the result of error analy-sis for NMT, SMT and RBMT for English-Frenchlanguage pair. This evaluation confirms the trans-lation ranking performed in the previous but alsoexhibits some interesting facts:

The most salient error comes from missingwords or parts of sentence. It is interesting tosee though that half of these omissions areconsidered okay by the reviewers and weremost of the time not considered as errors -it indeed shows the ability of the system notonly to translate but to summarize and getto the point as we would expect from hu-man translation. Of course, we need to fix thecases, where the omissions are not okay.

Another finding is that the engine is badlymanaging quotes, and we will make sure tospecifically teach that in our next release.Other low-hanging fruit are the case genera-tion, which seems sometimes to get off-track,and the handling of Named Entity that wehave already introduced in the system but notconnected for the release.

On the positive side, we observe that NMTis drastically improving fluency, reducesslightly meaning selection errors, and handlebetter morphology although it does not haveyet any specific access to morphology (likesub-word embedding). Regarding meaningselection errors, we will focussing on teach-ing more expressions to the system which isstill a major structural weakness compared toPBMT engines.

7 Practical Issues

Translation results from an NMT system, at firstglance, is incredibly fluent that your are divertedfrom its downsides. Over the course of the trainingand during our internal evaluation, we found outmultiple practical issues worth sharing:

1. translating very long sentences

2. translating user input such as a short word orthe title of a news article

3. cleaning the corpus

4. alignment

NMT is greatly impacted by the train data, onwhich NMT learns how to generate accurate andfluent translations holistically. Because the max-imal length of a training instance was limited toa certain length during the training of our models,NMT models are puzzled by sentences that exceedthis length, not having encountered such a train-ing data. Hard splitting of longer sentences hassome side-effects since the model consider bothparts as full sentence. As a consequence, what-ever is the limit we set for sentence length, wedo also need to teach the neural network howto handle longer sentences. For that, we are ex-ploring several options including using separatemodel based on source/target alignment to find op-timal breaking point, and introduce special and tokens.Likewise, very short phrases and incomplete orfragmental sentences were not included in ourtraining data, and consequently NMT systems failto correctly translate such input texts (e.g. Figure10). Here also, to enable this, we do simply need toteach the model to handle such input by injectingadditional synthetic data.

Also, our training corpus includes a number ofresources that are known to contain many noisydata. While NMT seems more robust than othertechnologies for handling noise, we can still per-ceive noise effect in translation - in particularfor recurring noise. An example is in English-to-Korean, where we see the model trying to sys-tematically convert amount currency in additionto the translation. As demonstrated in Section 5.2,preparing the right kind of input to NMT seems toresult in more efficient and accurate systems, andsuch a procedure should also be directly applied tothe training data more aggressively.

Finally,let us note that source/target alignmentis a must for our users, but this information ismissing from NMT output due to soft align-ment. To hide this issue from the end-users, mul-tiple alignment heuristic are showing tradictionaltarget-source alignment.

8 Further Work

In this section we outline further experiments cur-rently being conducted. First we extend NMT de-coding with the ability to make use of multiplemodels. Both external models, particularly an n-gram language model, as well as decoding withmultiple networks (ensemble). We also work onusing external word embeddings, and on mod-elling unknown words within the network.

8.1 Extending Decoding

8.1.1 Additional LMAs proposed by (Gulcehre et al., 2015), we con-duct experiments to integrate an n-gram languagemodel estimated over a large dataset on our NeuralMT system. We followed a shallow fusion integra-tion, similar to how language models are used in astandard phrase-based MT decoder.

In the context of beam search decoding in NMT,at each time step t, candidate words x are hy-pothesized and assigned a score according to theneural network, pNMT (x). Sorted according totheir respective scores, the K-best candidates, arereranked using the score assigned by the languagemodel, pLM (x). The resulting probability of eachcandidate is obtained by the weighted sum ofeach log-probability log p(x) = log pLM (x) + log pNMT (x). Where is a hyper-parameterthat needs to be tuned.

This technique is specially useful for handlingout-of-vocabulary words (OOV). Deep networksare technically constrained to work with limitedvocabulary sets (in our experiments we use targetvocabularies of 60k words), hence suffering fromimportant OOV problems. In contrast, n-gram lan-guage models can be learned for very large vocab-ularies.

Initial results show the suitability of the shal-low integration technique to select the appropriateOOV candidate out of a dictionary list (external re-source). The probability obtained from a languagemodel is the unique modeling alternative for thoseword candidates for which the neural network pro-duces no score.

8.1.2 Ensemble Decoding

Ensemble decoding has been verified as a practi-cal technique to further improve the performancecompared to a single Encoder-Decoder model(Sennrich et al., 2016b; Wu et al., 2016; Zhou etal., 2016). The improvement comes from the di-versity of prediction from different neural networkmodels, which are learned by random initializa-tion seeds and shuffling of examples during train-ing, or different optimization methods towards thedevelopment set(Cho et al., 2015). As a conse-quence, 3-8 isolated models will be trained andensembled together, considering the cost of mem-ory and training speed. Also, (Junczys-Dowmuntet al., 2016) provides some methods to acceleratethe training by choosing different checkpoints asthe final models.

We implement ensemble decoding by averagingthe output probabilities for each estimation of tar-get word x with the formula:

pensx =1M

Mm=1 p

mx

wherein, pmx represents probabilities of each x,and M is the number of neural models.

8.2 Extending word embeddings

Although NMT technology has recently accom-plished a major breakthrough in Machine Trans-lation field, it still remains constrained due to thelimited vocabulary size and to the use of bilin-gual training data. In order to reduce the negativeimpact of both phenomena, experiments are cur-rently being held on using external word embed-ding weights.

Those external word embeddings are notlearned by the NMT network from bilingual dataonly, but by an external model (e.g. word2vec(Mikolov et al., 2013)). They can therefore be esti-mated from larger monolingual corpora, incorpo-rating data from different domains.

Another advantage lies in the fact that, since ex-ternal word embedding weights are not modifiedduring NMT training, it is possible to use a dif-ferent vocabulary for this fixed part of the inputduring the application or re-training of the model(provided that the weights for the words in new vo-cabulary come from the same embedding space asthe original ones). This may allow a more efficientadaptation to the data coming from a different do-main with a different vocabulary.

Human-pro Human-casual Bing Google Naver Systran V8English 7 French -64.2 -18.5 +48.7 +10.4 +17.3French 7 English -56.8 +5.5 -23.1 -8.4 +5English 7 Korean -15.4 +35.5 +33.7 +31.4 +13.2English 7 Korean (IT) +30.3

Table 10: This table shows relative preference of SYSTRAN NMT compared to other outputs calcu-lated this way: for each triplet where output A and B were compared, we note prefA>B the number oftimes where A was strictly preferred to B, and EA the total number of triplet including output A. Foreach output, E, the number in the table is compar(SNMT, E) = (prefSNMT>E prefE>SNMT)/ESNMT.compar(SNMT, E) is a percent value in the range [1; 1]

Figure 10: Effect of translation of a single word through a model not trained for that.

Category NMT RB SMT ExampleEntity

Major 7 5 0 Galaxy Note 7 7 note 7 de Galaxy vs. Galaxy Note 7Format 3 1 1 (number localization): $2.66 7 2.66 $ vs. 2,66 $

MorphologyMinor - Local 3 2 3 (tense choice): accused 7 accusait vs. a accuse

Minor - Sentence Level 3 3 5the president [...], she emphasized

7 la president [...], elle a souligne vs. la [...], elleMajor 3 4 6 he scanned 7 il scanne vs. il scannait

Meaning SelectionMinor 9 17 7 game 7 jeu vs. match

Major - Prep Choice 4 9 10[... facing war crimes charges] over [its bombardment of ...]

7 contre vs. pour

Major - Expression 3 7 1[two] counts of murder

7 chefs de meutre vs. chefs daccusation de meutreMajor - Not Translated 5 1 4 he scanned 7 il scanned vs. il a scanne

Major - Contextual Meaning 14 39 1433 senior Republicans

7 33 republicains superieurs vs. 33 tenors republicainsWord Ordering and Fluency

Minor 2 28 15(determiner):[without] a [specific destination in mind]

7 sans une destination [...] vs. sans destination [...]

Major 3 16 15

(word ordering):in the Sept. 26 deaths7 dans les morts septembre de 26

vs. dans les morts du 26 septembre

Missing or Duplicated

Missing Minor 7 3 1

a week after the hurricane struck7 une semaine apres louragan

vs. une semaine apres que louragan ait frappe

Missing Major 6 1 3

As a working oncologist, Giordano knew [...]7 Giordano savait

vs. En tant quoncologue en fonction, Giordano savait

Duplicated Major 2 2 1

for the Republican presidential nominee7 au candidat republicain republicain

vs. au candidat republicain

Misc. (Minor)Quotes, Punctuations 2 0 0 (misplaced quotes)

Case 6 0 2

[...] will be affected by Brexit7 [...] Sera touchee Par le brexit

vs. [...] sera touchee par le Brexit

TotalMajor 47 84 54Minor 36 55 35Minor & Major 83 139 89

Table 11: Human error analysis done for 50 sentences of the corpus defined in the section 6.2 for English-French on NMT, SMT (Google) and RBMT outputs. Error categories are: - issue with entity handling(Entity), issue with Morphology either local or reflecting sentence level missing agreements, issue withMeaning Selection splitted into different sub-categories, - issue with Word Ordering or Fluency (wrongor missing determiner for instance), - missing or duplicated words. Errors are either Minor when readercould still understand the sentence without access to the source, otherwise is considered as Major. Er-roneous words are counted in only one category even if several problems add-up - for instance orderingand meaning selection.

8.3 Unknown word handling

When an unknown word is generated in the targetoutput sentence, a general encoder-decoder withattentional mechanism utilizes heuristics based onattention weights such that the source word withthe most attention is either directly copied as-is orlooked up in a dictionary.

In the recent literature (Gu et al., 2016; Gul-cehre et al., 2016), researchers have attempted todirectly model the unknown word handling withinthe attention and decoder networks. Having themodel learn to take control of both decoding andunknown word handling will result in the most op-timized way to address the single unknown wordreplacement problem, and we are implementingand evaluating the previous approaches within ourframework.

9 Conclusion

Neural MT has progressed at a very impressiverate, and it has proven itself to be competitiveagainst online systems trained on train data whosesize is several orders of magnitude larger. Thereis no doubt that Neural MT is definitely a tech-nology that will continue to have a great impacton academia and industry. However, at its currentstatus, it is not without limitations; on languagepairs that have abundant amount of monolingualand bilingual train data, phrase-based MT still per-form better than Neural MT, because Neural MTis still limited on the vocabulary size and deficientutilization of monolingual data.

Neural MT is not an one-size-fits-all technologysuch that one general configuration of the modeluniversally works on any language pairs. For ex-ample, subword tokenization such as BPE pro-vides an easy way out of the limited vocabularyproblem, but we have discovered that it is not al-ways the best choice for all language pairs. Atten-tion mechanism is still not at the satisfactory statusand it needs to be more accurate for better con-trolling the translation output and for better userinteractions.

For upcoming releases, we have begun to mak-ing even more experiments with injection of var-ious linguistic knowledges, at which SYSTRANpossesses the foremost expertise. We will also ap-ply our engineering know-hows to conquer thepractical issues of NMT one by one.

Acknowledgments

We would like to thank Yoon Kim, Prof. Alexan-der Rush and the rest of members of the HarvardNLP group for their support with the open-sourcecode, their pro-active advices and their valuableinsights on the extensions.

We are also thankful to CrossLang and HominKwon for their thorough and meaningful defini-tion of the evaluation protocol, and their evalua-tion team as well as Inah Hong and SunHyung Leefor their work.

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua

Bengio. 2014. Neural machine transla-tion by jointly learning to align and translate.CoRR, abs/1409.0473. Demoed at NIPS 2014:http://lisa.iro.umontreal.ca/mt-demo/.

Hanna Bechara, Yanjun Ma, and Josef van Genabith.2011. Statistical post-editing for a statistical mt sys-tem. In MT Summit, volume 13, pages 308315.

Frederic Blain, Jean Senellart, Holger Schwenk, MirkoPlitt, and Johann Roturier. 2011. Qualitative analy-sis of post-editing for high quality machine trans-lation. MT Summit XIII: the Thirteenth MachineTranslation Summit [organized by the] Asia-PacificAssociation for Machine Translation (AAMT), pages164171.

Ondrej Bojar, Rajen Chatterjee, Christian Federmann,Barry Haddow, Matthias Huck, Chris Hokamp,Philipp Koehn, Varvara Logacheva, Christof Monz,Matteo Negri, Matt Post, Carolina Scarton, LuciaSpecia, and Marco Turchi. 2015. Findings of the2015 workshop on statistical machine translation.In Proceedings of the Tenth Workshop on StatisticalMachine Translation, pages 146, Lisbon, Portugal,September.

Ondrej Bojar, Yvette Graham, Amir Kamran, andMilos Stanojevic. 2016. Results of the wmt16 met-rics shared task. In Proceedings of the First Con-ference on Machine Translation, Berlin, Germany,August.

Thorsten Brants, Ashok C Popat, Peng Xu, Franz JOch, and Jeffrey Dean. 2007. Large language mod-els in machine translation. In In Proceedings ofthe Joint Conference on Empirical Methods in Nat-ural Language Processing and Computational Nat-ural Language Learning. Citeseer.

Mauro Cettolo, Nicola Bertoldi, Marcello Federico,Holger Schwenk, Loic Barrault, and Chrstophe Ser-van. 2014. Translation project adaptation for mt-enhanced computer assisted translation. MachineTranslation, 28(2):127150, October.

Yu Chen and Andreas Eisele. 2012. Multiun v2: Undocuments with multilingual alignments. In LREC,pages 25002504.

Wenhu Chen, Evgeny Matusov, Shahram Khadivi, andJan-Thorsten Peter. 2016. Guided alignmenttraining for topic-aware neural machine translation.CoRR, abs/1607.01628v1.

Sebastien Jean Kyunghyun Cho, Roland Memisevic,and Yoshua Bengio. 2015. On using very large tar-get vocabulary for neural machine translation. arXivpreprint arXiv:1412.2007.

Junyoung Chung, Kyunghyun Cho, and Yoshua Ben-gio. 2016. A character-level decoder without ex-plicit segmentation for neural machine translation.CoRR, abs/1603.06147.

Loc Dugast, Jean Senellart, and Philipp Koehn. 2007.Statistical Post-Editing on SYSTRANs Rule-BasedTranslation System. In Proceedings of the SecondWorkshop on Statistical Machine Translation, pages220223, Prague, Czech Republic, June. Associa-tion for Computational Linguistics.

Chris Dyer, Victor Chahuneau, and Noah A. Smith.2013. A simple, fast, and effective reparameter-ization of ibm model 2. In Proceedings of the2013 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, pages 644648, At-lanta, Georgia, June.

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K.Li. 2016. Incorporating copying mechanism insequence-to-sequence learning. In Proceedings ofthe 54th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 16311640, Berlin, Germany, August. Asso-ciation for Computational Linguistics.

Caglar Gulcehre, Orhan Firat, Kelvin Xu, KyunghyunCho, Loc Barrault, Huei-Chi Lin, Fethi Bougares,and Holger Schwenk andYoshua Bengio. 2015. Onusing monolingual corpora in neural machine trans-lation. CoRR, abs/1503.03535.

Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati,Bowen Zhou, and Yoshua Bengio. 2016. Point-ing the unknown words. In Proceedings of the 54thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers), pages140149, Berlin, Germany, August. Association forComputational Linguistics.

Marcin Junczys-Dowmunt and Roman Grundkiewicz.2016. Log-linear combinations of monolingual andbilingual neural machine translation models for au-tomatic post-editing. CoRR, abs/1605.04800.

M. Junczys-Dowmunt, T. Dwojak, and H. Hoang.2016. Is Neural Machine Translation Ready for De-ployment? A Case Study on 30 Translation Direc-tions. CoRR, abs/1610.01108, October.

Yoon Kim and Alexander M. Rush. 2016.Sequence-level knowledge distillation. CoRR,abs/1606.07947.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, AlexandraConstantin, and Evan Herbst. 2007. Moses: OpenSource Toolkit for Statistical Machine Translation.In Proceedings of the 45th Annual Meeting of theAssociation for Computational Linguistics Compan-ion Volume Proceedings of the Demo and Poster Ses-sions.

Philipp Koehn. 2005. Europarl: A parallel corpus forstatistical machine translation. In MT summit, vol-ume 5, pages 7986.

Minh-Thang Luong and Christopher D. Manning.2016. Achieving open vocabulary neural ma-chine translation with hybrid word-character mod-els. CoRR, abs/1604.00788.

Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-basedneural machine translation. In Proceedings of the2015 Conference on Empirical Methods in Natu-ral Language Processing, pages 14121421, Lisbon,Portugal, September. Association for ComputationalLinguistics.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word represen-tations in vector space. CoRR, abs/1301.3781.

Daniel Ortiz-Martnez, Ismael Garca-Varea, and Fran-cisco Casacuberta. 2010. Online learning for in-teractive statistical machine translation. In HumanLanguage Technologies: The 2010 Annual Confer-ence of the North American Chapter of the Associa-tion for Computational Linguistics, pages 546554.Association for Computational Linguistics.

Abigail See, Minh-Thang Luong, and Christopher DManning. 2016. Compression of neural machinetranslation models via pruning. In the proceedingsof The 20th SIGNLL Conference on ComputationalNatural Language Learning (CoNLL2016), Berlin,Germany, August.

Rico Sennrich and Barry Haddow. 2016. Linguisticinput features improve neural machine translation.CoRR, abs/1606.02892.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016a. Controlling politeness in neural machinetranslation via side constraints. In Proceedings ofthe 15th Annual Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, pages 3540, San Diego, California, USA, June. Associationfor Computational Linguistics.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016b. Neural machine translation of rare wordswith subword units. In Proceedings of the 54thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers), pages17151725, Berlin, Germany, August. Associationfor Computational Linguistics.

Michel Simard, Cyril Goutte, and Pierre Isabelle.2007. Statistical phrase-based post-editing. In InProceedings of NAACL.

Ralf Steinberger, Bruno Pouliquen, Anna Widiger,Camelia Ignat, Tomaz Erjavec, Dan Tufis, andDaniel Varga. 2006. The jrc-acquis: A multilingualaligned parallel corpus with 20+ languages. arXivpreprint cs/0609058.

Jorg Tiedemann. 2012. Parallel data, tools and inter-faces in opus. In LREC, pages 22142218.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, Jeff Klingner, Apurva Shah, MelvinJohnson, Xiaobing Liu, ukasz Kaiser, StephanGouws, Yoshikiyo Kato, Taku Kudo, HidetoKazawa, Keith Stevens, George Kurian, NishantPatil, Wei Wang, Cliff Young, Jason Smith, JasonRiesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2016. Googlesneural machine translation system: Bridging the gapbetween human and machine translation. Technicalreport, Google.

Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and WeiXu. 2016. Deep recurrent models with fast-forwardconnections for neural machine translation. Trans-actions of the Association for Computational Lin-guistics, 4:371383.

A Remarkable Results

In this section, we highlight a serie of remarkable translations (positively remarkable and also fewnegatively outstanding) that we found out during evaluation for a variety of languages.

LanguagePair Source NMT Google Translate (2016/10/15)

en7fr The report also said that althoughthree-quarters of Chinese say theircountry is playing a larger role inthe world than it did a decade ago,most want their government to fo-cus on domestic issues rather thanhelping other nations.

Le rapport dit aussi que, bien queles trois quarts des Chinois disentque leur pays joue un role plus im-portant dans le monde quil ne lafait il y a dix ans, la plupart veulentque leur gouvernement se concentresur les questions nationales plutotque daider dautres nations.

Le rapport indique egalement que,bien que les trois quarts des Chi-nois disent que leur pays joue unrole plus important dans le mondequil y a dix ans, la plupart veulentque leur gouvernement de se con-centrer sur les questions interieuresplutot que daider les autres nations.

fr7en Fenetre a rue se fermait mal, trestres bruyant et mal dormi. Petitdejeuner banal.

The room was very clean and theroom was very clean.

street window will not close, veryvery noisy and slept badly. mun-dane breakfast.

en7ko Forty Republican senators and con-gressmen have revoked their sup-port for the Republican presidentialnominee with nearly 30 of themcalling on him to quit the race alto-gether in recent days.

40 , 30 .

(From Naver NMT14): 4030, 30 .

en7it all this leads to feelings of inse-curity and conflict thus reinforcingnegative spirals.

tutto cio porta a sentimenti di insi-curezza e di conflitto, rafforzandocos la spirale negativa.

tutto cio genera elementi di insi-curezza e di conflitto rinforzandoquindi spirali negative.

it7fr Mi sembra che cio non abbia nullaa che vedere con una penuria ali-mentare: si tratta di una vera cares-tia, di una vera e propria catas-trofe che non e naturale ma strut-turale, provocata da un regimefolle, criminale, psicopatico, peg-giore dei peggiori regimi esistitinellUnione Sovietica e nella Ro-mania di Ceaucescu.

Il me semble que cela na rien avoir avec une penurie alimentaire:il sagit dune veritable famine,dune veritable catastrophe quinest pas naturelle mais structurelle,provoquee par un regime insense,criminel, psychopathe, pire que lespires regimes de lunion sovietiqueet de la Roumanie de Ceaucescu.

Il me semble que cela na rien avoir avec une penurie alimentaire: ilest une veritable famine, une catas-trophe qui est pas naturel, maisune structure, provoquee par unregime criminel fou, psychopathe,le pire des pires regimes existaitlUnion sovietique et en Roumaniede Ceaucescu.

it7fr Ebbene io sono in totale disaccordo:lo Stato non deve passare in sec-ondo piano; deve invece interveniredi piu e garantire una maggiore vig-ilanza in ambito economico.

je suis tout a fait daccord: letatne doit pas passer sous silence; ildoit plutot agir et assurer une plusgrande vigilance dans le domaineeconomique.

Eh bien, je suis en desaccord total:lEtat ne doit pas prendre un siegearriere; Il doit plutot agir plus et as-surer une plus grande supervisiondans leconomie.

es7en Es posible un Bangladesh mas am-able con los ninos? Is Bangladesh more friendly with

children?

Can a kinder Bangladesh with chil-dren?

en7de On the other hand, businesses areconstantly complaining about theshortage of specialist staff, usingthis supposed shortage as an excuseto use cheap labour.

Andererseits beschweren sich Un-ternehmen standig uber den Man-gel an Fachpersonal, indem siediese angebliche Knappheit alsVorwand benutzen, billige Arbeit-skrafte einzusetzen.

Auf der anderen Seite beschw-eren sich Unternehmen standig uberden Mangel an Fachkraften, wobeidiese angebliche Mangelhaftigkeitals Ausrede fur billige Arbeitskrafteverwendet wird.

zh7en ,.Morning News, now its our inter-active time.

(From Google GNMT): North Ko-rea heard the world, and now to ourinteraction time friends.

LanguagePair Source NMT Google Translate (2016/10/15)

br7en Face as decisoes de Nice, tnhamosde ter em conta esta situacao ede adotar uma metodologia quecontemplasse as transformacoesnecessarias.

Faced with the Nice decisions, wehad to take this situation into ac-count and adopt a methodologythat would address the necessarychanges.

Nice view of the decisions wehad to take account of this situa-tion and adopt a methodology thattook into consideration the neces-sary changes.

fr7br Une information presentee au mo-ment opportun signifie la trans-parence, laquelle cree la confianceet evite a lentreprise de subir despertes.

A informacao apresentada nomomento oportuno significatransparencia, que cria confianca eevita que a empresa sofra perdas.

Informacao apresentada em umatransparencia meio oportuna, quecria confianca e evita a empresa asofrer perdas.

B Online System Parameters

All systems were trained with 4 LSTM layers, size of word embedding vectors was 500, dropout was setto 0.3 and we used bidirectional RNN (BRNN). Column Guided Alignment indicates wether the networkwas trained with guided alignments and on which epoch the feature was stopped.

TokenizationRNNsize

OptimalEpoch

GuidedAlignment

NERaware Special

zh7en word boundary-generic 800 12 epoch 4 yesen7it generic 800 16 epoch 4 yesit7en generic 800 16 epoch 4 yesen7ar generic-crf 800 15 epoch 4 noar 7en crf-generic 800 15 epoch 4 noen7es generic 800 18 epoch 4 yeses7en generic 800 18 epoch 4 yesen7de generic-compound splitting 800 17 epoch 4 yesde7en compound splitting-generic 800 18 epoch 4 yesen 7nl generic 800 17 epoch 4 nonl 7en generic 800 14 epoch 4 noen7fr generic 800 18 epoch 4 yes double corpora (2 x 5M)fr7en generic 800 17 epoch 4 yesja7en bpe 800 11 no nofr 7br generic 800 18 epoch 4 yesbr7fr generic 800 18 epoch 4 yesen 7pt generic 800 18 epoch 4 yesbr7en generic 800 18 epoch 4 yesfr 7it generic 800 18 epoch 4 yesit7fr generic 800 18 epoch 4 yesfr 7ar generic-crf 800 10 epoch 4 noar 7fr crf-generic 800 15 epoch 4 nofr 7es generic 800 18 epoch 4 yeses7fr generic 800 18 epoch 4 yesfr 7de generic-compound splitting 800 17 epoch 4 node 7fr compound splitting-generic 800 16 epoch 4 nonl7fr generic 800 16 epoch 4 nofr 7zh generic-word segmentation 800 18 epoch 4 noja 7ko bpe 1000 18 no noko 7ja bpe 1000 18 no noen7ko bpe 1000 17 no no politenessfa7en basic 800 18 yes no

Table 12: Parameters for online systems.

SYSTRAN’s Pure Neural Machine Translation Systemsblog.systransoft.com/wp-content/uploads/2016/10/SystranNMTReport.pdf · SYSTRAN’s Pure Neural Machine Translation Systems ...

Documents