-
SYSTRANs Pure Neural Machine Translation SystemsJosep Crego,
Jungi Kim, Guillaume Klein, Anabel Rebollo, Kathy Yang, Jean
Senellart
Egor Akhanov, Patrice Brunelle, Aurelien Coquard, Yongchao Deng,
Satoshi Enoue, Chiyo Geiss,Joshua Johanson, Ardas Khalsa, Raoum
Khiari, Byeongil Ko, Catherine Kobus,
Jean Lorieux, Leidiana Martins, Dang-Chuan Nguyen, Alexandra
Priori, Thomas Riccardi,Natalia Segal, Christophe Servan, Cyril
Tiquet, Bo Wang, Jin Yang, Dakun Zhang, Jing Zhou
[email protected]
Abstract
Since the first online demonstration ofNeural Machine
Translation (NMT) byLISA (Bahdanau et al., 2014), NMT de-velopment
has recently moved from lab-oratory to production systems as
demon-strated by several entities announcing roll-out of NMT
engines to replace their ex-isting technologies. NMT systems havea
large number of training configurationsand the training process of
such sys-tems is usually very long, often a fewweeks, so role of
experimentation is crit-ical and important to share. In this
work,we present our approach to production-ready systems
simultaneously with releaseof online demonstrators covering a
largevariety of languages (12 languages, for32 language pairs). We
explore differentpractical choices: an efficient and evolu-tive
open-source framework; data prepara-tion; network architecture;
additional im-plemented features; tuning for production;etc. We
discuss about evaluation method-ology, present our first findings
and we fi-nally outline further work.
Our ultimate goal is to share our expertiseto build competitive
production systemsfor generic translation. We aim at con-tributing
to set up a collaborative frame-work to speed-up adoption of the
technol-ogy, foster further research efforts and en-able the
delivery and adoption to/by in-dustry of use-case specific engines
inte-grated in real production workflows. Mas-tering of the
technology would allow usto build translation engines suited for
par-ticular needs, outperforming current sim-plest/uniform
systems.
1 Introduction
Neural MT has recently achieved state-of-the-art performance in
several large-scale translationtasks. As a result, the deep
learning approach toMT has received exponential attention, not
onlyby the MT research community but by a growingnumber of private
entities, that begin to includeNMT engines in their production
systems.
In the last decade, several open-source MTtoolkits have
emergedMoses (Koehn et al.,2007) is probably the best-known
out-of-the-box MT systemcoexisting with commercial al-ternatives,
though lowering the entry barriersand bringing new opportunities on
both researchand business areas. Following this direction, ourNMT
system is based on the open-source projectseq2seq-attn1 initiated
by the Harvard NLPgroup2 with the main contributor Yoon Kim. Weare
contributing to the project by sharing severalfeatures described in
this technical report, whichare available to the MT community.
Neural MT systems have the ability to directlymodel, in an
end-to-end fashion, the associationfrom an input text (in a source
language) to itstranslation counterpart (in a target language).
Amajor strength of Neural MT lies in that all thenecessary
knowledge, such as syntactic and se-mantic information, is learned
by taking the globalsentence context into consideration when
mod-eling translation. However, Neural MT enginesare known to be
computationally very expen-sive, sometimes needing for several
weeks to ac-complish the training phase, even making use
ofcutting-edge hardware to accelerate computations.Since our
interest is for a large variety of lan-guages, and that based on
our long experience withmachine translation, we do not believe that
a one-fits-all approach would work optimally for lan-
1https://github.com/harvardnlp/seq2seq-attn
2http://nlp.seas.harvard.edu
-
guages as different as Korean, Arabic, Spanish orRussian, we did
run hundreds of experiments, andparticularily explored language
specific behaviors.One of our goal would indeed be to be able to
in-ject existing language knowledge in the trainingprocess.
In this work we share our recipes and experi-ence to build our
first generation of production-ready systems for generic
translation, setting astarting point to build specialized systems.
We alsoreport on extending the baseline NMT engine withseveral
features that in some cases increase perfor-mance accuracy and/or
efficiency while for someothers are boosting the learning curve,
and/ormodel speed. As a machine translation company,in addition to
decoding accuracy for generic do-main, we also pay special
attention to featuressuch as:
Training time
Customization possibility: user terminology,domain
adaptation
Preserving and leveraging internal formattags and misc
placeholders
Practical integration in business applications:for instance
online translation box, but alsotranslation batch utilities,
post-editing envi-ronment...
Multiple deployment environments: cloud-based, customer-hosted
environment or em-bedded for mobile applications
etc
More important than unique and uniform trans-lation options, or
reaching state-of-the-art researchsystems, our focus is to reveal
language specificsettings, and practical tricks to deliver this
tech-nology to the largest number of users.
The remaining of this report is as follows: Sec-tion 2 covers
basic details of the NMT system em-ployed in this work. Description
of the transla-tion resources are given in section 3. We report
onthe different experiments for trying to improve thesystem by
guiding the training process in section4 and section 5, we discuss
about performance.In section 6 and 7, we report on evaluation of
themodels and on practical findings. And we finish bydescribing
work in progress for the next release.
2 System Description
We base our NMT system on the encoder-decoderframework made
available by the open-sourceproject seq2seq-attn. With its root on
a num-ber of established open-source projects such asAndrej
Karpathys char-rnn,3 Wojciech Zarembasstandard long short-term
memory (LSTM)4 andthe rnn library from Element-Research,5
theframework provides a solid NMT basis consist-ing of LSTM, as the
recurrent module and faith-ful reimplementations of
global-general-attentionmodel and input-feeding at each time-step
of theRNN decoder as described by Luong et al. (2015).
It also comes with a variety of features such asthe ability to
train with bidirectional encoders andpre-trained word embeddings,
the ability to handleunknown words during decoding by
substitutingthem either by copying the source word with themost
attention or by looking up the source wordon an external
dictionary, and the ability to switchbetween CPU and GPU for both
training and de-coding. The project is actively maintained by
theHarvard NLP group6.
Over the course of the development of our ownNMT system, we have
implemented additionalfeatures as described in Section 4, and
contributedback to the open-source community by makingmany of them
available in the seq2seq-attnrepository.seq2seq-attn is implemented
on top of the
popular scientific computing library Torch.7 Torchuses Lua, a
powerful and light-weight script lan-guage, as its front-end and
uses the C languagewhere efficient implementations are needed.
Thecombination results in a fast and efficient systemboth at the
development and the run time. As anextension, to fully benefit from
multi-threading,optimize CPU and GPU interactions, and to havefiner
control on the objects for runtime (sparse ma-trix, quantized
tensor, ...), we developed a C-baseddecoder using the C APIs of
Torch, called C-torch,explained in detail in section 5.4.
The number of parameters within an NMTmodel can grow to hundreds
of millions, butthere are also a handful of meta-parameters
thatneed to be manually determined. For some of the
3https://github.com/karpathy/char-rnn4https://github.com/wojzaremba/lstm5https://github.com/Element-Research/
rnn6http://nlp.seas.harvard.edu7http://torch.ch
-
ModelEmbedding dimension: 400-1000Hidden layer dimension:
300-1000Number of layers: 2-4Uni-/Bi-directional encoder
TrainingOptimization methodLearning rateDecay rateEpoch to start
decayNumber of EpochsDropout: 0.2-0.3
Text unit Section 4.1Vocabulary selectionWord vs. Subword (e.g.
BPE)
Train data Section 3size (quantity vs. quality)max sentence
lengthselection and mixture of domains
Table 1: There are a large number of meta-parameters to be
considered during training. Theoptimal set of configurations differ
from languagepair to language pair.
meta-parameters, many previous work presentsclear choices on
their effectiveness, such as us-ing the attention mechanism or
feeding the pre-vious prediction as input to the current time
stepin the decoder. However, there are still many
moremeta-parameters that have different optimal valuesacross
datasets, language pairs, and the configura-tions of the rest of
the meta-parameters. In table 1,we list the meta-parameter space
that we exploredduring the training of our NMT systems.
In appendix B, we detail the parameters used forthe online
system of this first release.
3 Training Resources
Training generic engines is a challenge, be-cause there is no
such notion of generic trans-lation which is what online
translation serviceusers are expecting from these services.
Indeedonline translation is covering a very large varietyof use
cases, genres and domains. Also availableopen-source corpora are
domain specific: Europarl(Koehn, 2005), JRC (Steinberger et al.,
2006) orMultiUN (Chen and Eisele, 2012) are legal texts,ted talk
are scientific presentations, open subtitles(Tiedemann, 2012) are
colloquial, etc. As a result,
the training corpora we used for this release werebuilt by doing
a weighted mix all of the availablesources. For languages with
large resources, wedid reduce the ratio of the institutional
(Europal,UN-type), and colloquial types giving the pref-erence to
news-type, mix of webpages (like Giga-word).
Our strategy, in order to enable more experi-ments was to define
3 sizes of corpora for each lan-guage pair: a baseline corpus (1
million sentences)for quick experiments (day-scale), a medium
cor-pus (2-5M) for real-scale system (week-scale) anda very large
corpora with more than 10M seg-ments.
The amount of data used to train online systemsare reported in
table 2, while most of the individ-ual experimental results
reported in this report areobtained with baseline corpora.
Note that size of the corpus needs to be consid-ered with the
number of training periods since theneural network is continuously
fed by sequencesof sentence batches till the network is
consideredtrained. In Junczys-Dowmunt et al. (2016), au-thors
mention using corpus of 5M sentences andtraining of 1.2M batches
each having 40 sentences meaning basically that each sentence of
the fullcorpus is presented 10 times to the training. In Wuet al.
(2016), authors mention 2M steps of 128 ex-amples for
EnglishFrench, for a corpus of 36Msentences, meaning about 7
iterations on the com-plete corpus. In our framework, for this
release,we systematically extended the training up to 18epochs and
for some languages up to 22 epochs.
Selection of the optimal system is made af-ter the complete
training by calculating scoreson independent test sets. As an
outcome, wehave seen different behaviours for different lan-guage
pairs with similar training corpus size ap-parently connected to
the language pair complex-ity. For instance, EnglishKorean training
perplex-ity still decreases significantly between epoch 13and 19
while ItalianEnglish perplexity decreasesmarginally after epoch 10.
For most languages, inour set-up, optimal systems are achieved
aroundepoch 15.
We did also some experiment on the corpussize. Intuitively,
since NMT systems do not havethe memorizing capacity of PBMT
engines, thefact that the training use 10 times 10M sentencecorpus,
or 20 times 5M corpus should not makea huge difference. In one of
the experiment, we
-
LanguagePair
Training Testing#Sents #Tokens Vocab #Sents #Tokens Vocab
OOV
source target source target source target source target source
targetenbr 2.7M 74.0M 76.6M 150k 213k 2k 51k 53k 6.7k 8.1k 47
64enit 3.6M 98.3M 100M 225k 312k 2k 52k 53k 7.3k 8.8k 66 85enar
5.0M 126M 155M 295k 357k 2k 51k 62k 7.5k 8.7k 43 47enes 3.5M 98.8M
105M 375k 487k 2k 53k 56k 8.5k 9.8k 110 119ende 2.6M 72.0M 69.1M
150k 279k 2k 53k 51k 7.0k 9.6k 30 77ennl 2.1M 57.3M 57.4M 145k 325k
2k 52k 53k 6.7k 7.9k 50 141enko 3.5M 57.5M 46.4M 98.9k 58.4k 2k 30k
26k 7.1k 11k 0 -enfr 9.3M 220M 250M 558k 633k 2k 48k 55k 8.2k 8.6k
77 63frbr 1.6M 53.1M 47.9M 112k 135k 2k 62k 56k 7.4k 8.1k 55 59frit
3.1M 108M 96.5M 202k 249k 2k 69k 61k 8.2k 8.8k 47 57frar 5.0M 152M
152M 290k 320k 2k 60k 60k 8.5k 8.6k 42 61fres 2.8M 99.0M 91.7M 170k
212k 2k 69k 64k 8.0k 8.6k 37 55frde 2.4M 73.4M 62.3M 172k 253k 2k
57k 48k 7.5k 9.0k 59 104frzh 3.0M 98.5M 76.3M 199k 168k 2k 67k 51k
8.0k 5.9k 51 -jako 1.4M 14.0M 13.9M 61.9k 55.6k 2k 19k 19k 9.3k
8.5k 0 0nlfr 3.0M 74.8M 84.7M 446k 260k 2k 49k 55k 7.9k 7.5k 150
-faen 795k 21.7M 20.2M 166k 147k 2k 54k 51k 7.7k 8.7k 197 -jaen
1.3M 28.0M 22.0M 24k 87k 2k 41k 32k 6.2k 7.3k 3 -zhen 5.8M 145M
154M 246k 225k 2k 48k 51k 5.5k 6.9k 34 -
Table 2: Corpora statistics for each language pair (iso 639-1
2-letter code, expect for Portuguese Brazil-ian noted as br). All
language pairs are bidirectional except nlfr, frzh, jaen, faen,
enko, zhen. Columns2-6 indicate the number of sentences, running
words and vocabularies referred to training datasets whilecolumns
7-11 indicate the number of sentences, running words and
vocabularies referred to test datasets.Columns 12 and 13 indicate
respectively the vocabulary of OOV of the source and target test
sets. (Mstand for milions, k for thousands). Since jako and enko
are trained using BPE tokenization (see section4.1), there is no
OOV.
compared training on a 5M corpus trained over 20epochs for
English to/from French, and the same5M corpus for only 10 epochs,
followed by 10additional epochs on additional 5M corpus. The10M
being completely homogeneous. In both di-rections, we observe that
the 5 10 + 5 10training is completing with a score improvementof
0.8 1.2 compared to the 5 20 showing thatthe additional corpus is
managing to bring a mean-ingful improvement. This observation leads
to amore general question about how much corpus isneeded to
actually build a high quality NMT en-gine (learn the language), the
role and timing ofdiversity in the training and whether the
incremen-tal gain could not be substituted by terminologyfeeding
(learn the lexicon).
4 Technology
In this section we account for several experimentsthat improved
different aspects of our translationengines. Experiments range from
preprocessingtechniques to extend the network with the abil-ity to
handle named entities, to use multiple wordfeatures and to enforce
the attention module to bemore like word alignments. We also report
on dif-
ferent levels of translation customization.
4.1 Tokenization
All corpora are preprocessed with an in-housetoolkit. We use
standard token separators (spaces,tabs, etc.) as well as a set of
language-dependentlinguistic rules. Several kinds of entities are
rec-ognized (url and number) replacing its content bythe
appropriate place-holder. A postprocess is usedto detokenize
translation hypotheses, where theoriginal raw text format is
regenerated followingequivalent techniques.
For each language, we have access to languagespecific
tokenization and normalization rules.However, our preliminary
experiments showedthat there was no obvious gain of using these
lan-guage specific tokenization patterns, and that someof the
hardcoded rules were actually degradingthe performance. This would
need more investi-gation, but for the release of our first batch
sys-tems, we used a generic tokenization model formost of the
languages except Arabic, Chinese andGerman. In our past experiences
with Arabic, sep-arating segmentation of clitics was beneficial,
andwe retained the same procedure. For German and
-
Chinese, we used in-house compound splitter andword segmentation
models, respectively.
In our current NMT approach, vocabulary sizeis an important
factor that determines the ef-ficiency and the quality of the
translation sys-tem; a larger vocabulary size correlates directlyto
greater computational cost during decoding,whereas low coverage of
vocabulary leads to se-vere out-of-vocabulary (OOV) problems,
hencelowering translation quality.
In most language pairs, our strategy combinesa vocabulary
shortlist and a placeholder mecha-nism, as described in Sections
4.2 and 4.3. This ap-proach, in general, is a practical and
linguistically-robust option to addressing the fixed
vocabularyissue, since we can take the full advantage of in-ternal
manually-crafted dictionaries and customi-sized user dictionaries
(UDs).
A number of previous work such as character-level (Chung et al.,
2016), hybrid word-character-based (Luong and Manning, 2016) and
subword-level (Sennrich et al., 2016b) address issues thatarise
with morphologically rich languages such asGerman, Korean and
Chinese. These approacheseither build accurate open-vacabulary word
repre-sentations on the source side or improve transla-tion models
generative capacity on the target side.Among those approaches,
subword tokenizationyields competitive results achieving excellent
vo-cabulary coverage and good efficiency at the sametime.
For two language pairs: enko and jaen, we usedsource and target
sub-word tokenization (BPE, see(Sennrich et al., 2016b)) to reduce
the vocabularysize but also to deal with rich morphology andspacing
flexibility that can be observed in Korean.Although this approach
is very seducing by itssimplicity and also used systematically in
(Wu etal., 2016) and (Junczys-Dowmunt et al., 2016), itdoes not
have significant side effects (for instancegeneration of impossible
words) and is not opti-mal to deal with actual word morphology -
sincethe same suffix (josa in Korea) depending on thefrequency of
the word ending it is integrated with,will be splitted in multiple
representations. Also,in Korean, these josa, are is an agreement
withthe previous syllabus based on their final endings:however such
simple information is not explicitelyor implicitely reachable by
the neural network.
The sub-word encoding algorithm Byte Pair En-coding (BPE)
described by Sennrich et al. (2016b)
was re-implemented in C++ for further speed op-timization.
4.2 Word Features
Sennrich and Haddow (2016) showed that us-ing additional input
features improves translationquality. Similarly to this work, we
introduced inthe framework the support for an arbitrary numberof
discrete word features as additional inputs to theencoder. Because
we do not constrain the numberof values these features can take at
the same time,we represent them with continuous and normal-ized
vectors. For example, the representation of afeature f at time-step
t is:
x(t)i =
{1nf
if f takes the ith value
0 otherwise(1)
where nf is the number of possible values thefeature f can take
and with x(t) Rnf .
These representations are then concatenatedwith the word
embedding to form the new inputto the encoder.
We extended this work by also supporting ad-ditional features on
the target side which will bepredicted by the decoder. We used the
same inputrepresentation as on the encoder side but shiftedthe
features sequence compared to the words se-quence so that the
prediction of the features attime-step t depend on the word at
time-step t theyannotate. Practically, we are generating feature
attime t+ 1 for the word generated at time t.
To learn these target features, we added a lin-ear layer to the
decoder followed by the softmaxfunction and used the mean square
error criterionto learn the correct representation.
For this release, we only used case informationas additional
feature. It allows us to work with alowercased vocabulary and treat
the recasing as aseparate problem. We observed that the use of
thissimple case feature in source and target does im-prove the
translation quality as illustrated in Fig-ure 1. Also, we compared
the accuracy of the in-duced recasing with other recasing
frameworks(SRI disamb, and in-house recasing tool based onn-gram
language models) and observed that theprediction of case by the NN
was higher than us-ing external recaser, which was expected since
NNhas access to source in addition to the source sen-tence context,
and target sentence history.
-
6 8 10 12 14 16
16
18
20
22
24
26
28
Perplexity
BL
EU
baselinesource case
src&tgt
src&tgt+embed
Figure 1: Comparison of training progress (per-plexity/BLEU)
with/without source (src) and tar-get (tgt) case features,
with/without feature em-bedding (embed) on WMT2013 test corpus
forEnglish-French. Score is calculated on lowercaseoutput. The
perplexity increases when the targetfeatures are introduced because
of the additionalclassification problem. We also notice a
notice-able increase in the score when introducing thefeatures, in
particular the target features. So thesefeatures do not simply help
to reduce the vocabu-lary, but also by themselves help to structure
theNN decoding model.
4.3 Named Entities (NE)
SYSTRANs RBMT and SMT translation enginesutilize number and
named entity (NE) module torecognize, protect, and translate NE
entities. Simi-larly, we used the same internal NE module to
rec-ognize numbers and named entities in the sourcesentence and
temporarily replaced them with theircorresponding placeholders
(Table 3).
Both the source and the target side of the train-ing dataset
need to be processed for NE placehold-ers. To ensure the correct
entity recognition, wecross-validate the recognized entity across
paral-lel dataset, that is: a valid entity recognition in asentence
should have the same type of entity in itsparallel pair and the
word or phrase covered by theentities need to be aligned to each
other. We usedfast align (Dyer et al., 2013) to automaticallyalign
source words to target words.
In our datasets, generally about one-fifth ofthe training
instances contained one or more NEplaceholders. Our training
dataset consists of sen-tences with NE placeholders as well as
sentenceswithout them to be able to handle both instantiatedand
recognized entity types.
NE type PlaceholderNumber ent numeric
Measurement ent numex measurementMoney ent numex money
Person (any) ent personTitle ent person title
First name ent person firstnameInitial ent person initials
Last name ent person lastnameMiddle name ent person
middlename
Location ent locationOrganization ent organization
Product ent productSuffix ent suffix
Time expressions ent timex expressionDate ent date
Date (day) ent date dayDate (Month) ent date monthDate (Year)
ent date year
Hour ent hour
Table 3: Named entity placeholder types
Per source sentence, a list of all entities, alongwith their
translations in the target language, ifavailable, are returned by
our internal NE recogni-tion module. The entities in the source
sentence isthen replaced with their corresponding NE place-holders.
During beam search, we make sure thatan entity placehoder is
translated by itself in thetarget sentence. When the entire target
sentenceis produced along with the attention weights thatprovide
soft alignments back to the original sourcetokens, Placeholders in
the target sentences are re-placed with either the original source
string or itstranslation.
The substitution of the NE placeholders withtheir correct values
needs language pair-specificconsiderations. In Figure 4, we show
that even thehandling of Arabic numbers cannot be straight-forward
as copying the original value in the sourcetext.
4.4 Guided Alignments
We re-reimplemented Guided alignment strategydescribed in Chen
et al. (2016). Guided align-ment enforces the attention weights to
be morelike alignments in the traditional sense (e.g. IBM4viterbi
alignment) where the word alignments ex-plicitly indicate that
source words aligned to a tar-get word are translation
equivalents.
Similarly to the previous work, we created anadditional
criterion on attention weights,Lga, suchthat the difference in the
attention weights and thereference alignments is treated as an
error and di-rectly and additionally optimize the output of the
-
En KoTrain
train data 25 billion 250entity-replaced ent numeric billion ent
numeric
Decodeinput data 1.4 billion -
entity-replaced ent numeric billion -translated - ent
numeric
naive substition - 1.4expected result - 14
Table 4: Examples of English and Korean num-ber expressions
where naive recognition and sub-stitution fails. Even if the model
correctly pro-duces correct placeholders, simply copying
theoriginal value will result in incorrect translation.These kind
of structural entities need languagepair-specific treatment.
attention module.
Lga(A,) =1
Tt
s
(Ast st)2
The final loss function for the optimization is then:
Ltotal = wga Lga(A,) + (1wga) Ldec(y, x)
where A is the alignment matrix, attentionweights, s and t
indicating the indices in thesource and target sentences, and wga
the linearcombination weight for the guided alignment loss.
Chen et al. (2016) also report that decayingwga, thereby
gradually reducing the influencefrom guided alignment, over the
course of train-ing found to be helpful on certain datasets.
Whenguided alignment decay is enabled, wga is gradu-ally reduced at
this rate after each epoch from thebeginning of the training.
Without searching for the optimal parametervalues, we simply
took following configurationsfrom the literature: mean square error
(MSE) asloss function, 0.5 as the linear-combination weightfor
guided alignment loss and the cross-entropyloss for decoder, and
0.9 for decaying factor forguided alignment loss.
For the alignment, we again utilizedfast align tool. We stored
alignments insparse format8 to save memory usage, and foreach
minibatch a dense matrix is created for fastercomputation.
Applying such a constraint on attention weightscan help locate
the original source word more
8Compressed Column Storage (CCS)
1 3 5 7 9 11 13 15 17 1946
47
48
49
50
51
52
Epoch
BL
EU
Baseline
Guided alignment
Guided alignment with decay
Figure 2: Effect of guided alignment. This partic-ular learning
curve is from training an attention-based model with 4 layers of
bidirectional LSTMswith 800 dimension on a 5 million French to
En-glish dataset.
accurately, which we hope to benefits better NEplaceholder
substitutions and especially unknownword handling.
In Figure 2, we see the effects of guided align-ment and its
decay rate on our English-to-Frenchgeneric model. Unfortunately,
this full compara-tive run was disrupted by a power outage and
wedid not have time to relaunch, however we canstill clearly
observe that up to the initial 5 epochs,guided alignment, with or
without decay, providesrather big boosts over the baseline
training. Af-ter 5 epochs, the training with decay slow downcompare
to the training without, which is ratherintuitive: the guided
alignment is indeed in con-flict with the attention learning. What
would re-main to be seen, is if at the training the training,the
baseline and the guided alignment with decayare converging.
4.5 Politeness Mode
Many languages have ways to express politenessand deference
towards people being reffered to insentences. In Indo-European
languages, there aretwo pronouns corresponding to the English You;
itis called the T-V distinction between the informalLatin pronoun
tu (T) and the polite Latin pronounVos (V). Asian languages, such
as Japanese andKorean, make an extensive use of honorifics
(re-spectful words), words that are usually appendedto the ends of
names or pronouns to indicate therelative ages and social positions
of the speakers.Expressing politeness can also impact the
vocabu-
-
lary of verbs, adjectives, and nouns used, as wellas sentence
structures.
Following the work of Sennrich et al. (2016a),we implemented a
politeness feature in our NMTengine: a special token is added to
each sourcesentence during training, where the token indicatesthe
politeness mode observed in the target sen-tence. Having such an
ability to specify the polite-ness mode is very useful especially
when translat-ing from a language where politeness is not
ex-pressed, e.g. English, into where such expressionsare abundant,
e.g. Korean, because it provides away of customizing politeness
mode of the trans-lation output.Table 5 presents our
English-to-Korean NMTmodel trained with politeness mode, and it is
clearthat the proper verb endings are generated accord-ing to the
user selection. After a preliminary eval-uation on a small testset
from English to Korean,we observed 70 to 80% accuracy of the
polite-ness generation (Table 6). We also noticed that86% of
sentences (43 out of 50) have exactly thesame meaning preserved
across different polite-ness modes.
This simple approach, however, comes at asmall price, where
sometimes the unknown re-placement scheme tries to copy the special
tokenin the target generation. A more appropriate ap-proach that we
plan to switch to in our future train-ings is to directly feed the
politeness mode into thesentential representation of the
decoder.
4.6 Customization
Domain adaptation is a key feature for ourcustomersit generally
encompasses terminol-ogy, domain and style adaptation, but can also
beseen as an extension of translation memory for hu-man
post-editing workflows.
SYSTRAN engines integrate multiple tech-niques for domain
adaptation, training full newin-domain engines, automatically
post-editing anexisting translation model using translation
mem-ories, extracting and re-using terminology. WithNeural Machine
Translation, a new notion of spe-cialization comes close to the
concept of incre-mental translation as developed for statistical
ma-chine translation like (Ortiz-Martnez et al., 2010).
4.6.1 Generic SpecializationDomain adaptation techniques have
successfullybeen used in Statistical Machine Translation. It iswell
known that a system optimized on a specific
text genre obtains higher accuracy results thana generic system.
The adaptation process canbe done before, during or after the
training pro-cess. Our preliminary experiments follow the lat-ter
approach. We incrementally adapt a Neural MTgeneric system to a
specific domain by runningadditional training epochs over newly
available in-domain data.
Adaptation proceeds incrementally when newin-domain data becomes
available, generated byhuman translators while post-editing, which
issimilar to the Computer Aided Translation frame-work described in
(Cettolo et al., 2014).
We experiment on an English-to-French trans-lation task. The
generic model is a subsample ofthe corpora made available for the
WMT15 trans-lation task (Bojar et al., 2015). Source and tar-get
NMT vocabularies are the 60k most frequentwords of source and
target training datasets. Thein-domain data is extracted from the
EuropeanMedical Agency (EMEA) corpus. Table 7 showssome statistics
of the corpora used in this experi-ment.
Our preliminary results show that incrementaladaptation is
effective for even limited amountsof in-domain data (nearly 50k
additional words).Constrained to use the original generic
vocabu-lary, adaptation of the models can be run in a fewseconds,
showing clear quality improvements onin-domain test sets.
0 1 2 3 4 5
29
30
31
32
33
34
35
36
Additional epochs
BL
EU
adaptfull
Figure 3: Adaptation with In-Domain data.
Figure 3 compares the accuracy (BLEU) oftwo systems: full is
trained after concatenationof generic and in-domain data; adapt is
initiallytrained over generic data (showing a BLEU scoreof 29.01 at
epoch 0) and adapted after running sev-eral training epochs over
only the in-domain train-
-
En:A senior U.S. treasury official is urging china to move
faster on making its currency more flexible.
Ko with Formal mode:.
Ko with Informal mode:.
Table 5: A translation examples from an En-Ko system where the
choice of different politeness modesaffects the output.
Mode Correct Incorrect AccuracyFormal 30 14 68.2%
Informal 35 9 79.5%
Table 6: Accuracy of generating correct Politenessmode of an
English-to-Korean NMT system. Theevaluation was carried out on a
set of 50 sentencesonly; 6 sentences were excluded from
evaluationbecause neither the original nor their
translationscontained any verbs.
Type Corpus # lines # src tok (EN) # tgt tok (FR)Train Generic
1M 24M 26M
EMEA 4,393 48k 56kTest EMEA 2,787 23k 27k
Table 7: Data used to train and adapt the genericmodel to a
specific domain. The test corpus alsobelongs to the specific
domain.
ing data. Both systems share the same genericNMT vocabularies.
As it can be seen the adapt sys-tem improves drastically its
accuracy after a sin-gle additional training epoch, obtaining a
similarBLEU score than the full system (separated by .91BLEU). Note
also that each additional epoch usingthe in-domain training data
takes less than 50 sec-onds to be processed, while training the
full sys-tem needs more than 17 hours.
Results validate the utility of the adaptation ap-proach. A
human post-editor would take advan-tage of using new training data
as soon as it be-comes available, without needing to wait for a
longfull training process. However, the comparison isnot entirely
fair since full training would allow toinclude the in-domain
vocabulary in the new fullmodel, what surely would result in an
additionalaccuracy improvement.
4.6.2 Post-editing EngineRecent success of Pure Neural Machine
Transla-tion has led to the application of this technologyto
various related tasks and in particular to the Au-tomatic
Post-Editing (APE). The goal of this task
1 3 5 7 9 11 130
10
20
30
40
50
Epoch
BL
EU
NMT
NPE multi-source
NPE
SPE
RBMT
Figure 4: Accuracy results of RBMT, NMT, NPEand NPE
multi-source.
is to simulate the behavior of a human post-editor,correcting
translation errors made by a MT sys-tem.
Until recently, most of the APE approacheshave been based on
phrase-based SMT systems,either monolingual (MT target to human
post-edition) (Simard et al., 2007) or source-aware(Bechara et al.,
2011). For many years now SYS-TRAN has been offering a hybrid
Statistical Post-Editing (SPE) solution to enhance the
translationprovided by its rule-based MT system (RBMT)(Dugast et
al., 2007).
Following the success of Neural Post-Editing(NPE) in the APE
task of WMT16 (Junczys-Dowmunt and Grundkiewicz, 2016), we have
runa series of experiments applying the neural ap-proach in order
to improve the RBMT systemoutput. As a first experiment, we
compared theperformance of our English-to-Korean SPE sys-tem
trained on technical (IT) domain data to twoNPE systems trained on
the same data: monolin-gual NPE and multi-source NPE, where the
in-put language and the MT hypothesis sequenceshave been
concatenated together into one input se-quence (separated by a
special token).
Figure 4 illustrates the accuracy (BLEU) re-
-
sults of four different systems at different trainingepochs. The
RBMT system performs poorly, con-firming the importance of
post-editing. Both NPEsystems clearly outperform SPE. It can also
be ob-served that adding source information even in asimplest way
possible (NPE multi-source), with-out any source-target alignment,
considerably im-proves NPE translation results.
The system performing NPE multi-source ob-tains similar accuracy
results than pure NMT.What can be seen is that NPE multi-source
es-sentially employs the information from the origi-nal sentence to
produce translations. However, no-tice that benefits of utilizing
multiple inputs fromdifferent sources is clear at earlier epochs
whileonce the model parameters converge, the differ-ence in
performances of NMT and NPE multi-source models become
negligible.
Further experiments are currently being con-ducted aiming at
finding more sophisticated waysof combining the original source and
the MTtranslation in the context of NPE.
5 Performance
As previously outlined, one of the major draw-backs of NMT
engines is the need for cutting-edgehardware technology to face the
enormous compu-tational requirements at training and runtime.
Regarding training, there are two major issues:the full training
time and the required computationpower, i.e. the server investment.
For this release,most of our trainings have been running on
singleGTX GeForce 1080 GPU (about $2.5k) while in(Wu et al., 2016),
authors mention using 96 K80GPU for a full week for training one
single lan-guage pair (about $250k). On our hardware, fulltraining
on 2x5M sentences (see section 3) took abit less than one
month.
A reasonnable target is to maintain training timefor any
language pair under one week and keep-ing reasonable investment so
that the full researchcommunity can have competitive trainings but
alsoindirectly so that all of our customers can benefitfrom the
training technology. To do so, we need tobetter leverage multiple
GPUs on a single serverwhich is on-going engineering work. We also
needto continue on exploring how to learn more withless data. For
instance, we are convinced that in-jecting terminology as part of
the training datashould be competitive with continuing adding
fullsentences.
Model BLEUbaseline 49.2440% pruned 48.5650% pruned 47.9660%
pruned 45.9070% pruned 38.3860% pruned and retrained 49.26
Table 8: BLEU scores of pruned models on aninternal test
set.
Also, shortening training cycle can also beachieved by better
control of the training cycle.We have shown that multiple features
are boostingthe training pace, and if going to bigger network
isclearly improving performance. For instance, weare using a
bidirectional 4 layer RNN in additionto our regular RNN, but in Wu
et al. (2016), au-thors mention using bidirectional RNN only forthe
first layer. We need to understand more thesephenomena and restrict
to the minimum to reducethe model size.
Finally, work on specialization described in sec-tions 4.6.1 and
5.2 are promising for long termmaintenance: we could reach a point
where we donot need to retrain from scratch but continuouslyimprove
existing model and use teacher models toboost initial
trainings.
Regarding runtime performance, we have beenexploring the
following areas and are reaching to-day throughputs compatible with
production re-quirement not only using GPUs but also usingCPU and
we report our different strategies in thefollowing
sub-sections.
5.1 Pruning
Pruning the parameters of a neural network is acommon technique
to reduce its memory footprint.This approach has been proven
efficient for theNMT tasks in See et al. (2016). Inspired by
thiswork, we introduced similar pruning techniques inseq2seq-attn.
We reproduced that models pa-rameters can be pruned up to 60%
without any per-formance loss after retraining as shown in Table
8.
With a large pruning factor, neural networksweights can also be
represented with sparse matri-ces. This implementation can lead to
lower com-putation time but more importantly to a smallermemory
footprint that allows us to target moreenvironment. Figures 5 and 6
show experiments
-
Figure 5: Processing time to perform a 1000 1000 matrix
multiplication on a single thread.
Figure 6: Memory needed to perform a 1000 1000 matrix
multiplication.
involving sparse matrices using Eigen9. For ex-ample, when using
the float precision, a mul-tiplication with a sparse matrix already
begins totake less memory when 35% of its parameters arepruned.
Related to this work, we present in Section 5.4our alternative
Eigen-based decoder that allows usto support sparse matrices.
5.2 DistillationDespite that surprisingly accurate, NMT
systemsneed for deep networks in order to perform well.Typically, a
4-layer LSTM with 1000 hidden unitsper layer (4 x 1000) are used to
obtain state-of-the-art results. Such models require
cutting-edgehardware for training in reasonable time while
in-ference becomes also challenging on standard se-tups, or on
small devices such as mobile phones.Though, compressing deep models
into smaller
9http://eigen.tuxfamily.org
networks has been an active area of research.Following the work
in (Kim and Rush, 2016),
we experimented sequence-level knowledge dis-tillation in the
context of an English-to-FrenchNMT task. Knowledge distillation
relies on train-ing a smaller student network to perform betterby
learning the relevant parts of a larger teachernetwork. Hence,
wasting parameters on trying tomodel the entire space of
translations. Sequence-level is the knowledge distillation variant
wherethe student model mimics the teachers actions atthe
sequence-level.
The experiment is summarized in 3 steps:
train a teacher model on a source/referencetraining set,
use the teacher model to produce translationsfor the source
training set,
train a student model on the newsource/translation training
set.
For our initial experiments, we produced 35-best translations
for each of the sentences of thesource training set, and used a
normalized n-grammatching score computed at the sentence level,
toselect the closest translation to each reference sen-tence. The
original training source sentences andtheir translated hypotheses
where used as trainingdata to learn a 2 x 300 LSTM network.
Results showed slightly higher accuracy resultsfor a 70%
reduction of the number of parametersand a 30% increase on decoding
speed. In a sec-ond experiment, we learned a student model withthe
same structure than the teacher model. Surpris-ingly, the student
clearly outperformed the teachermodel by nearly 1.5 BLEU.
We hypothesize that the translation performedover the target
side of the training set produces asort of language normalization
which is by con-struction very heterogeneous. Such
normalizationeases the translation task, being learned by not
sodeep networks with similar accuracy levels.
5.3 Batch Translation
To increase translation speed of large texts, wesupport batch
translation that works in addition tothe beam search. It means that
for a beam of sizeK and a batch of size B, we forward K B
se-quences into the model. Then, the decoder outputis split across
each batch and the beam path foreach sentence is updated
sequentially.
-
0 20 40 60 80 1000
100
200
300
400
500
batch size
toke
ns/s
Figure 7: Tokens throughput when decoding froma student model
(see section 5.2) with a beam ofsize 2 and using float precision.
The experimentwas run using a standard Torch + OpenBLAS in-stall
and 4 threads on a desktop Intel i7 CPU.
As sentences within a batch can have large vari-ations in size,
extra care is needed to mask accord-ingly the output of the encoder
and the attentionsoftmax over the source sequences.
Figure 7 shows the speedup obtained usingbatch decoding in a
typical setup.
5.4 C++ DecoderWhile Torch is a powerful and easy to use
frame-work, we chose to develop an alternative C++implementation
for the decoding on CPU. It in-creases our control over the
decoding process andopen the path to further memory and speed
im-provements while making deployment easier.
Our implementation is graph-based and useEigen for efficient
matrix computation. It can loadand infer from Torch models.
For this release, experiments show that thedecoding speed is on
par or faster than theTorch-based implementation especially in a
multi-threaded context. Figure 8 shows the better use
ofparallelization of the Eigen-based implementation.
6 Evaluation
Evaluation of machine translation has always beena challenge and
subject to many papers and dedi-cated workshops (Bojar et al.,
2016). While au-tomatic metrics are now used as standard in
theresearch world and have shown good correlationwith human
evaluation, ad-hoc human evaluationor productivity analysis metrics
are rather used inthe industry (Blain et al., 2011).
1 2 4 80
20
40
60
80
100
threads
toke
ns/s
Eigen-basedTorch-based
Figure 8: Tokens throughput with a batch of size 1in the same
condition as Figure 7.
As a translation solution company, even if au-tomatic metrics
are used through all the trainingprocess (and we give scores in the
section 6.1), wecare about human evaluation of the results. Wu
etal. (2016) mention human evaluation but simulta-neously cast a
doubt on the referenced human totranslate or evaluate. In this
context, the claim al-most indistinguishable with human translation
isat the same time strong but also very vague. Onour side, we have
observed during all our experi-ments and preparation of specialized
models, un-precedented level of quality, and contexts wherewe could
claim super human translation quality.
However, we need to be very carefully defin-ing the tasks, the
human that are being comparedto, and the nature of the evaluation.
For evaluatingtechnical translation, the nature of the evaluationis
somewhat easy and really depending on the userexpectation: is the
meaning properly conveyed, oris the sentence faster to post-edit
than to translate.Also, to avoid doubts about integrity or
compe-tency of the evaluation we sub-contracted the taskto
CrossLang, a company specialized in machinetranslation evaluation.
The test protocol was de-fined collaboratively and for this first
release, wedecided to perform ranking of different systems,and we
present in the section 6.2 the results ob-tained on two very
different language pairs: En-glish to/from French, and English to
Korean.
Finally, in the section 6.3, we also present somequalitative
evaluation results showing specificitiesof the Neural Machine
Translation.
-
6.1 Automatic Scoring and systemcomparison
Figure 9 plots automatic accuracy results, BLEU,and Perplexities
for all language pairs. It is re-markable the high correlation
between perplex-ity and BLEU scores, showing that language
pairswith lower perplexity yield higher BLEU scores.Note also that
different amounts of training datawere used for each system (see
Table 2). BLEUscores were calculated over an internal test set.
From the beginning of this report we have usedinternal
validation and test sets, what makes itdifficult to compare the
performance of our sys-tems to other research engines. However, we
mustkeep in mind that our goal is to account for im-provements in
our production systems. We focuson human evaluations rather than on
any automaticevaluation score.
6.2 Human Ranking EvaluationTo evaluate translation outputs and
compare withhuman translation, we have defined the
followingprotocol.
1. For each language pair, 100 sentences in do-main (*) are
collected,
2. These sentences are sent to human transla-tion (**), and
translated with candidate modeland using available online
translation ser-vices (***).
3. Without notification of the mix of human andmachine
translation, a team of 3 professionaltranslators or linguists
fluent in both sourceand target languages is then asked to rank
3random outputs for each sentence based ontheir preference as
translation. Preference in-cludes accuracy as a priority, but also
fluencyof the generated translation. They have thechoice to give
them 3 different ranks, or canalso decide to give 2 or the 3 of
them the samerank, if they cannot decide.
(*) for Generic domain, sentences from recentnews article were
selected online, for Technical(IT) sentences part of translation
memory definedin section 4.6 were kept apart from the training.(**)
for human translation, we did use translationagency (human-pro),
and online collaborativetranslation platform (human-casual).(***)
we used Naver Translator10 (Naver), Google
10http://translate.naver.com
Translate11 (Google) and Bing Translator12
(Bing).
For this first release, we experimented on theevaluation
protocol for 2 different extremely dif-ferent categories of
language pairs. On one hand,EnglishFrench which is probably the
most stud-ied language pairs for MT and for which re-sources are
very large: (Wu et al., 2016) mentionabout 36M sentence pairs used
in their NMT train-ing and the equivalent PBMT is completed by
aweb-scale target side language models13. Also, asEnglish is a low
inflected language, the currentphrase-based technology for target
language En-glish is more competitive due to the relative
sim-plicity of the generation and weight of giganticlanguage
models.
On the other hand, EnglishKorean is one ofthe toughest language
pair due to the far distancebetween English and Korean language,
the smallavailability of training corpus, and the rich
agglu-tinative morphology of Korean. For a real compar-ison, we ran
evaluation against Naver Translationservice from English into
Korean, where Naver isthe main South Korean search engine.
Tables 9 and 10 describe the different evalua-tions and their
results.
Several interesting outcomes:
Our vanilla English 7 French model outper-forms existing online
engines and our best ofbreed technology.
For French 7 English, if the model slightlyoutperforms
(human-casual) and our best ofbreed technology, it stays behind
GoogleTranslate, and more significantly behind Mi-crosoft
Translator.
The generic English 7 Korean model showscloser results with
human translation and out-performs clearly existing online
engines.
The in-domain specialized model surpris-ingly outperforms the
reference human trans-lation.
We are aware that far more evaluations are nec-essary and we
will be launching a larger eval-uation plan for our next release.
Informally, we
11http://translate.google.com12http://translator.bing13In 2007,
Google already mentions using 2 trillion words
in their language models for machine translation (Brants etal.,
2007).
-
3 4 5 625
30
35
40
45
50
55
enar
aren
ende
deenennl
nlenenititen
enbr
bren
enes esen
frar
arfr
frit
itfr
frde
defrfrbr
brfr
fres
esfr
frzh
zhen
nlfr
enfr
fren
jako
koja
Perplexity
BL
EU
Figure 9: Perplexity and BLEU scores obtained by NMT engines for
all language pairs. Perplexity andBLEU scores were calculated
respectively on validation and test sets. Note that
English-to-Korean andFarsi-to-English systems are not shown in this
plot achieving respectively (18.94, 16.50) and (3.84,34.19) scores
for perplexity and BLEU.
Language Pair Domain (*) Human Translation (**) Online
Translation (***)English 7 French Generic human-pro, human-casual
Bing, GoogleFrench 7 English Generic human-pro, human-casual Bing,
GoogleEnglish 7 Korean News human-pro Naver, Google, BingEnglish 7
Korean Technical (IT) human-pro N/A
Table 9: Performed evaluations
-
do observe that the biggest performance jumpare observed on
complicated language pairs, likeEnglish-Korean or Japanese-English
showing thatNMT engines are better able to handle majorstructure
differences, but also on the languageswith lower resources like
Farsi-English demon-strating that NMT is able to learn better with
less,and we will explore this even more.
Finally, we are also launching in parallel, a real-life
beta-testing program with our customers sothat we can also obtain
formal feedback from theiruse-case and related to specialized
models.
6.3 Qualitative Evaluation
In the table 11, we report the result of error analy-sis for
NMT, SMT and RBMT for English-Frenchlanguage pair. This evaluation
confirms the trans-lation ranking performed in the previous but
alsoexhibits some interesting facts:
The most salient error comes from missingwords or parts of
sentence. It is interesting tosee though that half of these
omissions areconsidered okay by the reviewers and weremost of the
time not considered as errors -it indeed shows the ability of the
system notonly to translate but to summarize and getto the point as
we would expect from hu-man translation. Of course, we need to fix
thecases, where the omissions are not okay.
Another finding is that the engine is badlymanaging quotes, and
we will make sure tospecifically teach that in our next
release.Other low-hanging fruit are the case genera-tion, which
seems sometimes to get off-track,and the handling of Named Entity
that wehave already introduced in the system but notconnected for
the release.
On the positive side, we observe that NMTis drastically
improving fluency, reducesslightly meaning selection errors, and
handlebetter morphology although it does not haveyet any specific
access to morphology (likesub-word embedding). Regarding
meaningselection errors, we will focussing on teach-ing more
expressions to the system which isstill a major structural weakness
compared toPBMT engines.
7 Practical Issues
Translation results from an NMT system, at firstglance, is
incredibly fluent that your are divertedfrom its downsides. Over
the course of the trainingand during our internal evaluation, we
found outmultiple practical issues worth sharing:
1. translating very long sentences
2. translating user input such as a short word orthe title of a
news article
3. cleaning the corpus
4. alignment
NMT is greatly impacted by the train data, onwhich NMT learns
how to generate accurate andfluent translations holistically.
Because the max-imal length of a training instance was limited toa
certain length during the training of our models,NMT models are
puzzled by sentences that exceedthis length, not having encountered
such a train-ing data. Hard splitting of longer sentences hassome
side-effects since the model consider bothparts as full sentence.
As a consequence, what-ever is the limit we set for sentence
length, wedo also need to teach the neural network howto handle
longer sentences. For that, we are ex-ploring several options
including using separatemodel based on source/target alignment to
find op-timal breaking point, and introduce special and
tokens.Likewise, very short phrases and incomplete orfragmental
sentences were not included in ourtraining data, and consequently
NMT systems failto correctly translate such input texts (e.g.
Figure10). Here also, to enable this, we do simply need toteach the
model to handle such input by injectingadditional synthetic
data.
Also, our training corpus includes a number ofresources that are
known to contain many noisydata. While NMT seems more robust than
othertechnologies for handling noise, we can still per-ceive noise
effect in translation - in particularfor recurring noise. An
example is in English-to-Korean, where we see the model trying to
sys-tematically convert amount currency in additionto the
translation. As demonstrated in Section 5.2,preparing the right
kind of input to NMT seems toresult in more efficient and accurate
systems, andsuch a procedure should also be directly applied tothe
training data more aggressively.
-
Finally,let us note that source/target alignmentis a must for
our users, but this information ismissing from NMT output due to
soft align-ment. To hide this issue from the end-users, mul-tiple
alignment heuristic are showing tradictionaltarget-source
alignment.
8 Further Work
In this section we outline further experiments cur-rently being
conducted. First we extend NMT de-coding with the ability to make
use of multiplemodels. Both external models, particularly an n-gram
language model, as well as decoding withmultiple networks
(ensemble). We also work onusing external word embeddings, and on
mod-elling unknown words within the network.
8.1 Extending Decoding
8.1.1 Additional LMAs proposed by (Gulcehre et al., 2015), we
con-duct experiments to integrate an n-gram languagemodel estimated
over a large dataset on our NeuralMT system. We followed a shallow
fusion integra-tion, similar to how language models are used in
astandard phrase-based MT decoder.
In the context of beam search decoding in NMT,at each time step
t, candidate words x are hy-pothesized and assigned a score
according to theneural network, pNMT (x). Sorted according totheir
respective scores, the K-best candidates, arereranked using the
score assigned by the languagemodel, pLM (x). The resulting
probability of eachcandidate is obtained by the weighted sum ofeach
log-probability log p(x) = log pLM (x) + log pNMT (x). Where is a
hyper-parameterthat needs to be tuned.
This technique is specially useful for handlingout-of-vocabulary
words (OOV). Deep networksare technically constrained to work with
limitedvocabulary sets (in our experiments we use
targetvocabularies of 60k words), hence suffering fromimportant OOV
problems. In contrast, n-gram lan-guage models can be learned for
very large vocab-ularies.
Initial results show the suitability of the shal-low integration
technique to select the appropriateOOV candidate out of a
dictionary list (external re-source). The probability obtained from
a languagemodel is the unique modeling alternative for thoseword
candidates for which the neural network pro-duces no score.
8.1.2 Ensemble Decoding
Ensemble decoding has been verified as a practi-cal technique to
further improve the performancecompared to a single Encoder-Decoder
model(Sennrich et al., 2016b; Wu et al., 2016; Zhou etal., 2016).
The improvement comes from the di-versity of prediction from
different neural networkmodels, which are learned by random
initializa-tion seeds and shuffling of examples during train-ing,
or different optimization methods towards thedevelopment set(Cho et
al., 2015). As a conse-quence, 3-8 isolated models will be trained
andensembled together, considering the cost of mem-ory and training
speed. Also, (Junczys-Dowmuntet al., 2016) provides some methods to
acceleratethe training by choosing different checkpoints asthe
final models.
We implement ensemble decoding by averagingthe output
probabilities for each estimation of tar-get word x with the
formula:
pensx =1M
Mm=1 p
mx
wherein, pmx represents probabilities of each x,and M is the
number of neural models.
8.2 Extending word embeddings
Although NMT technology has recently accom-plished a major
breakthrough in Machine Trans-lation field, it still remains
constrained due to thelimited vocabulary size and to the use of
bilin-gual training data. In order to reduce the negativeimpact of
both phenomena, experiments are cur-rently being held on using
external word embed-ding weights.
Those external word embeddings are notlearned by the NMT network
from bilingual dataonly, but by an external model (e.g.
word2vec(Mikolov et al., 2013)). They can therefore be esti-mated
from larger monolingual corpora, incorpo-rating data from different
domains.
Another advantage lies in the fact that, since ex-ternal word
embedding weights are not modifiedduring NMT training, it is
possible to use a dif-ferent vocabulary for this fixed part of the
inputduring the application or re-training of the model(provided
that the weights for the words in new vo-cabulary come from the
same embedding space asthe original ones). This may allow a more
efficientadaptation to the data coming from a different do-main
with a different vocabulary.
-
Human-pro Human-casual Bing Google Naver Systran V8English 7
French -64.2 -18.5 +48.7 +10.4 +17.3French 7 English -56.8 +5.5
-23.1 -8.4 +5English 7 Korean -15.4 +35.5 +33.7 +31.4 +13.2English
7 Korean (IT) +30.3
Table 10: This table shows relative preference of SYSTRAN NMT
compared to other outputs calcu-lated this way: for each triplet
where output A and B were compared, we note prefA>B the number
oftimes where A was strictly preferred to B, and EA the total
number of triplet including output A. Foreach output, E, the number
in the table is compar(SNMT, E) = (prefSNMT>E
prefE>SNMT)/ESNMT.compar(SNMT, E) is a percent value in the
range [1; 1]
Figure 10: Effect of translation of a single word through a
model not trained for that.
-
Category NMT RB SMT ExampleEntity
Major 7 5 0 Galaxy Note 7 7 note 7 de Galaxy vs. Galaxy Note
7Format 3 1 1 (number localization): $2.66 7 2.66 $ vs. 2,66 $
MorphologyMinor - Local 3 2 3 (tense choice): accused 7 accusait
vs. a accuse
Minor - Sentence Level 3 3 5the president [...], she
emphasized
7 la president [...], elle a souligne vs. la [...], elleMajor 3
4 6 he scanned 7 il scanne vs. il scannait
Meaning SelectionMinor 9 17 7 game 7 jeu vs. match
Major - Prep Choice 4 9 10[... facing war crimes charges] over
[its bombardment of ...]
7 contre vs. pour
Major - Expression 3 7 1[two] counts of murder
7 chefs de meutre vs. chefs daccusation de meutreMajor - Not
Translated 5 1 4 he scanned 7 il scanned vs. il a scanne
Major - Contextual Meaning 14 39 1433 senior Republicans
7 33 republicains superieurs vs. 33 tenors republicainsWord
Ordering and Fluency
Minor 2 28 15(determiner):[without] a [specific destination in
mind]
7 sans une destination [...] vs. sans destination [...]
Major 3 16 15
(word ordering):in the Sept. 26 deaths7 dans les morts septembre
de 26
vs. dans les morts du 26 septembre
Missing or Duplicated
Missing Minor 7 3 1
a week after the hurricane struck7 une semaine apres
louragan
vs. une semaine apres que louragan ait frappe
Missing Major 6 1 3
As a working oncologist, Giordano knew [...]7 Giordano
savait
vs. En tant quoncologue en fonction, Giordano savait
Duplicated Major 2 2 1
for the Republican presidential nominee7 au candidat republicain
republicain
vs. au candidat republicain
Misc. (Minor)Quotes, Punctuations 2 0 0 (misplaced quotes)
Case 6 0 2
[...] will be affected by Brexit7 [...] Sera touchee Par le
brexit
vs. [...] sera touchee par le Brexit
TotalMajor 47 84 54Minor 36 55 35Minor & Major 83 139 89
Table 11: Human error analysis done for 50 sentences of the
corpus defined in the section 6.2 for English-French on NMT, SMT
(Google) and RBMT outputs. Error categories are: - issue with
entity handling(Entity), issue with Morphology either local or
reflecting sentence level missing agreements, issue withMeaning
Selection splitted into different sub-categories, - issue with Word
Ordering or Fluency (wrongor missing determiner for instance), -
missing or duplicated words. Errors are either Minor when
readercould still understand the sentence without access to the
source, otherwise is considered as Major. Er-roneous words are
counted in only one category even if several problems add-up - for
instance orderingand meaning selection.
-
8.3 Unknown word handling
When an unknown word is generated in the targetoutput sentence,
a general encoder-decoder withattentional mechanism utilizes
heuristics based onattention weights such that the source word
withthe most attention is either directly copied as-is orlooked up
in a dictionary.
In the recent literature (Gu et al., 2016; Gul-cehre et al.,
2016), researchers have attempted todirectly model the unknown word
handling withinthe attention and decoder networks. Having themodel
learn to take control of both decoding andunknown word handling
will result in the most op-timized way to address the single
unknown wordreplacement problem, and we are implementingand
evaluating the previous approaches within ourframework.
9 Conclusion
Neural MT has progressed at a very impressiverate, and it has
proven itself to be competitiveagainst online systems trained on
train data whosesize is several orders of magnitude larger. Thereis
no doubt that Neural MT is definitely a tech-nology that will
continue to have a great impacton academia and industry. However,
at its currentstatus, it is not without limitations; on
languagepairs that have abundant amount of monolingualand bilingual
train data, phrase-based MT still per-form better than Neural MT,
because Neural MTis still limited on the vocabulary size and
deficientutilization of monolingual data.
Neural MT is not an one-size-fits-all technologysuch that one
general configuration of the modeluniversally works on any language
pairs. For ex-ample, subword tokenization such as BPE pro-vides an
easy way out of the limited vocabularyproblem, but we have
discovered that it is not al-ways the best choice for all language
pairs. Atten-tion mechanism is still not at the satisfactory
statusand it needs to be more accurate for better con-trolling the
translation output and for better userinteractions.
For upcoming releases, we have begun to mak-ing even more
experiments with injection of var-ious linguistic knowledges, at
which SYSTRANpossesses the foremost expertise. We will also ap-ply
our engineering know-hows to conquer thepractical issues of NMT one
by one.
Acknowledgments
We would like to thank Yoon Kim, Prof. Alexan-der Rush and the
rest of members of the HarvardNLP group for their support with the
open-sourcecode, their pro-active advices and their
valuableinsights on the extensions.
We are also thankful to CrossLang and HominKwon for their
thorough and meaningful defini-tion of the evaluation protocol, and
their evalua-tion team as well as Inah Hong and SunHyung Leefor
their work.
ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua
Bengio. 2014. Neural machine transla-tion by jointly learning to
align and translate.CoRR, abs/1409.0473. Demoed at NIPS
2014:http://lisa.iro.umontreal.ca/mt-demo/.
Hanna Bechara, Yanjun Ma, and Josef van Genabith.2011.
Statistical post-editing for a statistical mt sys-tem. In MT
Summit, volume 13, pages 308315.
Frederic Blain, Jean Senellart, Holger Schwenk, MirkoPlitt, and
Johann Roturier. 2011. Qualitative analy-sis of post-editing for
high quality machine trans-lation. MT Summit XIII: the Thirteenth
MachineTranslation Summit [organized by the]
Asia-PacificAssociation for Machine Translation (AAMT),
pages164171.
Ondrej Bojar, Rajen Chatterjee, Christian Federmann,Barry
Haddow, Matthias Huck, Chris Hokamp,Philipp Koehn, Varvara
Logacheva, Christof Monz,Matteo Negri, Matt Post, Carolina Scarton,
LuciaSpecia, and Marco Turchi. 2015. Findings of the2015 workshop
on statistical machine translation.In Proceedings of the Tenth
Workshop on StatisticalMachine Translation, pages 146, Lisbon,
Portugal,September.
Ondrej Bojar, Yvette Graham, Amir Kamran, andMilos Stanojevic.
2016. Results of the wmt16 met-rics shared task. In Proceedings of
the First Con-ference on Machine Translation, Berlin,
Germany,August.
Thorsten Brants, Ashok C Popat, Peng Xu, Franz JOch, and Jeffrey
Dean. 2007. Large language mod-els in machine translation. In In
Proceedings ofthe Joint Conference on Empirical Methods in Nat-ural
Language Processing and Computational Nat-ural Language Learning.
Citeseer.
Mauro Cettolo, Nicola Bertoldi, Marcello Federico,Holger
Schwenk, Loic Barrault, and Chrstophe Ser-van. 2014. Translation
project adaptation for mt-enhanced computer assisted translation.
MachineTranslation, 28(2):127150, October.
-
Yu Chen and Andreas Eisele. 2012. Multiun v2: Undocuments with
multilingual alignments. In LREC,pages 25002504.
Wenhu Chen, Evgeny Matusov, Shahram Khadivi, andJan-Thorsten
Peter. 2016. Guided alignmenttraining for topic-aware neural
machine translation.CoRR, abs/1607.01628v1.
Sebastien Jean Kyunghyun Cho, Roland Memisevic,and Yoshua
Bengio. 2015. On using very large tar-get vocabulary for neural
machine translation. arXivpreprint arXiv:1412.2007.
Junyoung Chung, Kyunghyun Cho, and Yoshua Ben-gio. 2016. A
character-level decoder without ex-plicit segmentation for neural
machine translation.CoRR, abs/1603.06147.
Loc Dugast, Jean Senellart, and Philipp Koehn. 2007.Statistical
Post-Editing on SYSTRANs Rule-BasedTranslation System. In
Proceedings of the SecondWorkshop on Statistical Machine
Translation, pages220223, Prague, Czech Republic, June.
Associa-tion for Computational Linguistics.
Chris Dyer, Victor Chahuneau, and Noah A. Smith.2013. A simple,
fast, and effective reparameter-ization of ibm model 2. In
Proceedings of the2013 Conference of the North American Chapter
ofthe Association for Computational Linguistics: Hu-man Language
Technologies, pages 644648, At-lanta, Georgia, June.
Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K.Li. 2016.
Incorporating copying mechanism insequence-to-sequence learning. In
Proceedings ofthe 54th Annual Meeting of the Association for
Com-putational Linguistics (Volume 1: Long Papers),pages 16311640,
Berlin, Germany, August. Asso-ciation for Computational
Linguistics.
Caglar Gulcehre, Orhan Firat, Kelvin Xu, KyunghyunCho, Loc
Barrault, Huei-Chi Lin, Fethi Bougares,and Holger Schwenk andYoshua
Bengio. 2015. Onusing monolingual corpora in neural machine
trans-lation. CoRR, abs/1503.03535.
Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati,Bowen Zhou, and
Yoshua Bengio. 2016. Point-ing the unknown words. In Proceedings of
the 54thAnnual Meeting of the Association for Computa-tional
Linguistics (Volume 1: Long Papers), pages140149, Berlin, Germany,
August. Association forComputational Linguistics.
Marcin Junczys-Dowmunt and Roman Grundkiewicz.2016. Log-linear
combinations of monolingual andbilingual neural machine translation
models for au-tomatic post-editing. CoRR, abs/1605.04800.
M. Junczys-Dowmunt, T. Dwojak, and H. Hoang.2016. Is Neural
Machine Translation Ready for De-ployment? A Case Study on 30
Translation Direc-tions. CoRR, abs/1610.01108, October.
Yoon Kim and Alexander M. Rush. 2016.Sequence-level knowledge
distillation. CoRR,abs/1606.07947.
Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch,
Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen,
Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar,
AlexandraConstantin, and Evan Herbst. 2007. Moses: OpenSource
Toolkit for Statistical Machine Translation.In Proceedings of the
45th Annual Meeting of theAssociation for Computational Linguistics
Compan-ion Volume Proceedings of the Demo and Poster Ses-sions.
Philipp Koehn. 2005. Europarl: A parallel corpus forstatistical
machine translation. In MT summit, vol-ume 5, pages 7986.
Minh-Thang Luong and Christopher D. Manning.2016. Achieving open
vocabulary neural ma-chine translation with hybrid word-character
mod-els. CoRR, abs/1604.00788.
Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015.
Effective approaches to attention-basedneural machine translation.
In Proceedings of the2015 Conference on Empirical Methods in
Natu-ral Language Processing, pages 14121421, Lisbon,Portugal,
September. Association for ComputationalLinguistics.
Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013.
Efficient estimation of word represen-tations in vector space.
CoRR, abs/1301.3781.
Daniel Ortiz-Martnez, Ismael Garca-Varea, and Fran-cisco
Casacuberta. 2010. Online learning for in-teractive statistical
machine translation. In HumanLanguage Technologies: The 2010 Annual
Confer-ence of the North American Chapter of the Associa-tion for
Computational Linguistics, pages 546554.Association for
Computational Linguistics.
Abigail See, Minh-Thang Luong, and Christopher DManning. 2016.
Compression of neural machinetranslation models via pruning. In the
proceedingsof The 20th SIGNLL Conference on ComputationalNatural
Language Learning (CoNLL2016), Berlin,Germany, August.
Rico Sennrich and Barry Haddow. 2016. Linguisticinput features
improve neural machine translation.CoRR, abs/1606.02892.
Rico Sennrich, Barry Haddow, and Alexandra Birch.2016a.
Controlling politeness in neural machinetranslation via side
constraints. In Proceedings ofthe 15th Annual Conference of the
North AmericanChapter of the Association for Computational
Lin-guistics: Human Language Technologies, pages 3540, San Diego,
California, USA, June. Associationfor Computational
Linguistics.
-
Rico Sennrich, Barry Haddow, and Alexandra Birch.2016b. Neural
machine translation of rare wordswith subword units. In Proceedings
of the 54thAnnual Meeting of the Association for Computa-tional
Linguistics (Volume 1: Long Papers), pages17151725, Berlin,
Germany, August. Associationfor Computational Linguistics.
Michel Simard, Cyril Goutte, and Pierre Isabelle.2007.
Statistical phrase-based post-editing. In InProceedings of
NAACL.
Ralf Steinberger, Bruno Pouliquen, Anna Widiger,Camelia Ignat,
Tomaz Erjavec, Dan Tufis, andDaniel Varga. 2006. The jrc-acquis: A
multilingualaligned parallel corpus with 20+ languages.
arXivpreprint cs/0609058.
Jorg Tiedemann. 2012. Parallel data, tools and inter-faces in
opus. In LREC, pages 22142218.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.Le, Mohammad
Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao,
KlausMacherey, Jeff Klingner, Apurva Shah, MelvinJohnson, Xiaobing
Liu, ukasz Kaiser, StephanGouws, Yoshikiyo Kato, Taku Kudo,
HidetoKazawa, Keith Stevens, George Kurian, NishantPatil, Wei Wang,
Cliff Young, Jason Smith, JasonRiesa, Alex Rudnick, Oriol Vinyals,
Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2016. Googlesneural
machine translation system: Bridging the gapbetween human and
machine translation. Technicalreport, Google.
Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and WeiXu. 2016. Deep
recurrent models with fast-forwardconnections for neural machine
translation. Trans-actions of the Association for Computational
Lin-guistics, 4:371383.
-
A Remarkable Results
In this section, we highlight a serie of remarkable translations
(positively remarkable and also fewnegatively outstanding) that we
found out during evaluation for a variety of languages.
LanguagePair Source NMT Google Translate (2016/10/15)
en7fr The report also said that althoughthree-quarters of
Chinese say theircountry is playing a larger role inthe world than
it did a decade ago,most want their government to fo-cus on
domestic issues rather thanhelping other nations.
Le rapport dit aussi que, bien queles trois quarts des Chinois
disentque leur pays joue un role plus im-portant dans le monde quil
ne lafait il y a dix ans, la plupart veulentque leur gouvernement
se concentresur les questions nationales plutotque daider dautres
nations.
Le rapport indique egalement que,bien que les trois quarts des
Chi-nois disent que leur pays joue unrole plus important dans le
mondequil y a dix ans, la plupart veulentque leur gouvernement de
se con-centrer sur les questions interieuresplutot que daider les
autres nations.
fr7en Fenetre a rue se fermait mal, trestres bruyant et mal
dormi. Petitdejeuner banal.
The room was very clean and theroom was very clean.
street window will not close, veryvery noisy and slept badly.
mun-dane breakfast.
en7ko Forty Republican senators and con-gressmen have revoked
their sup-port for the Republican presidentialnominee with nearly
30 of themcalling on him to quit the race alto-gether in recent
days.
40 , 30 .
(From Naver NMT14): 4030, 30 .
en7it all this leads to feelings of inse-curity and conflict
thus reinforcingnegative spirals.
tutto cio porta a sentimenti di insi-curezza e di conflitto,
rafforzandocos la spirale negativa.
tutto cio genera elementi di insi-curezza e di conflitto
rinforzandoquindi spirali negative.
it7fr Mi sembra che cio non abbia nullaa che vedere con una
penuria ali-mentare: si tratta di una vera cares-tia, di una vera e
propria catas-trofe che non e naturale ma strut-turale, provocata
da un regimefolle, criminale, psicopatico, peg-giore dei peggiori
regimi esistitinellUnione Sovietica e nella Ro-mania di
Ceaucescu.
Il me semble que cela na rien avoir avec une penurie
alimentaire:il sagit dune veritable famine,dune veritable
catastrophe quinest pas naturelle mais structurelle,provoquee par
un regime insense,criminel, psychopathe, pire que lespires regimes
de lunion sovietiqueet de la Roumanie de Ceaucescu.
Il me semble que cela na rien avoir avec une penurie
alimentaire: ilest une veritable famine, une catas-trophe qui est
pas naturel, maisune structure, provoquee par unregime criminel
fou, psychopathe,le pire des pires regimes existaitlUnion
sovietique et en Roumaniede Ceaucescu.
it7fr Ebbene io sono in totale disaccordo:lo Stato non deve
passare in sec-ondo piano; deve invece interveniredi piu e
garantire una maggiore vig-ilanza in ambito economico.
je suis tout a fait daccord: letatne doit pas passer sous
silence; ildoit plutot agir et assurer une plusgrande vigilance
dans le domaineeconomique.
Eh bien, je suis en desaccord total:lEtat ne doit pas prendre un
siegearriere; Il doit plutot agir plus et as-surer une plus grande
supervisiondans leconomie.
es7en Es posible un Bangladesh mas am-able con los ninos? Is
Bangladesh more friendly with
children?
Can a kinder Bangladesh with chil-dren?
en7de On the other hand, businesses areconstantly complaining
about theshortage of specialist staff, usingthis supposed shortage
as an excuseto use cheap labour.
Andererseits beschweren sich Un-ternehmen standig uber den
Man-gel an Fachpersonal, indem siediese angebliche Knappheit
alsVorwand benutzen, billige Arbeit-skrafte einzusetzen.
Auf der anderen Seite beschw-eren sich Unternehmen standig
uberden Mangel an Fachkraften, wobeidiese angebliche
Mangelhaftigkeitals Ausrede fur billige Arbeitskrafteverwendet
wird.
zh7en ,.Morning News, now its our inter-active time.
(From Google GNMT): North Ko-rea heard the world, and now to
ourinteraction time friends.
-
LanguagePair Source NMT Google Translate (2016/10/15)
br7en Face as decisoes de Nice, tnhamosde ter em conta esta
situacao ede adotar uma metodologia quecontemplasse as
transformacoesnecessarias.
Faced with the Nice decisions, wehad to take this situation into
ac-count and adopt a methodologythat would address the
necessarychanges.
Nice view of the decisions wehad to take account of this
situa-tion and adopt a methodology thattook into consideration the
neces-sary changes.
fr7br Une information presentee au mo-ment opportun signifie la
trans-parence, laquelle cree la confianceet evite a lentreprise de
subir despertes.
A informacao apresentada nomomento oportuno
significatransparencia, que cria confianca eevita que a empresa
sofra perdas.
Informacao apresentada em umatransparencia meio oportuna,
quecria confianca e evita a empresa asofrer perdas.
B Online System Parameters
All systems were trained with 4 LSTM layers, size of word
embedding vectors was 500, dropout was setto 0.3 and we used
bidirectional RNN (BRNN). Column Guided Alignment indicates wether
the networkwas trained with guided alignments and on which epoch
the feature was stopped.
TokenizationRNNsize
OptimalEpoch
GuidedAlignment
NERaware Special
zh7en word boundary-generic 800 12 epoch 4 yesen7it generic 800
16 epoch 4 yesit7en generic 800 16 epoch 4 yesen7ar generic-crf 800
15 epoch 4 noar 7en crf-generic 800 15 epoch 4 noen7es generic 800
18 epoch 4 yeses7en generic 800 18 epoch 4 yesen7de
generic-compound splitting 800 17 epoch 4 yesde7en compound
splitting-generic 800 18 epoch 4 yesen 7nl generic 800 17 epoch 4
nonl 7en generic 800 14 epoch 4 noen7fr generic 800 18 epoch 4 yes
double corpora (2 x 5M)fr7en generic 800 17 epoch 4 yesja7en bpe
800 11 no nofr 7br generic 800 18 epoch 4 yesbr7fr generic 800 18
epoch 4 yesen 7pt generic 800 18 epoch 4 yesbr7en generic 800 18
epoch 4 yesfr 7it generic 800 18 epoch 4 yesit7fr generic 800 18
epoch 4 yesfr 7ar generic-crf 800 10 epoch 4 noar 7fr crf-generic
800 15 epoch 4 nofr 7es generic 800 18 epoch 4 yeses7fr generic 800
18 epoch 4 yesfr 7de generic-compound splitting 800 17 epoch 4 node
7fr compound splitting-generic 800 16 epoch 4 nonl7fr generic 800
16 epoch 4 nofr 7zh generic-word segmentation 800 18 epoch 4 noja
7ko bpe 1000 18 no noko 7ja bpe 1000 18 no noen7ko bpe 1000 17 no
no politenessfa7en basic 800 18 yes no
Table 12: Parameters for online systems.