-
Proceedings of the Workshop on Natural Language Processing for
Translation Memories (NLP4TM), pages 9–16,Hissar, Bulgaria, Sept
2015.
Spotting false translation segments in translation memories
Eduard BarbuTranslated.net
[email protected]
Abstract
The problem of spotting false translationsin the bi-segments of
translation memoriescan be thought of as a classification task.We
test the accuracy of various machinelearning algorithms to find
segments thatare not true translations. We show thatthe Church-Gale
scores in two large bi-segment sets extracted from MyMemorycan be
used for finding positive and neg-ative training examples for the
machinelearning algorithms. The performanceof the winning
classification algorithms,though high, is not yet sufficient for
auto-matic cleaning of translations memories.
1 Introduction
MyMemory1 (Trombetti, 2009) is the biggesttranslation memory in
the world. It contains morethan 1 billion bi-segments in
approximately 6000language pairs. MyMemory is built using
threemethods. The first method is to aggregate thememories
contributed by translators. The secondmethod is to use translation
memories extractedfrom corpora, glossaries or data mined from
theweb. The current distribution of the automaticallyacquired
translation memories is given in figure1. Approximately 50% of the
distribution is oc-cupied by the DGT-TM (Steinberger et al.,
2013),a translation memory built for 24 EU languagesfrom aligned
parallel corpora. The glossaries arerepresented by the Unified
Medical Language Sys-tem (UMLS) (Humphreys and Lindberg, 1993),
aterminology released by the National Library ofMedicine. The third
method is to allow anony-mous contributors to add source segments
andtheir translations through a web interface.
The quality of the translations using the firstmethod is high
and the errors are relatively few.
1https://mymemory.translated.net/
Figure 1: The distribution of automatically ac-quired memories
in MyMemory
However the second method and especially thethird one produce a
significant number of erro-neous translations. The automatically
aligned par-allel corpora have alignment errors and the
collab-orative translation memories are spammed or havelow quality
contributions.
The problem of finding bi-segments that are nottrue translations
can be stated as a typical classi-fication problem. Given a
bi-segment a classifiershould return yes if the segments are true
transla-tions and no otherwise. In this paper we test vari-ous
classification algorithms at this task.
The rest of the paper has the following struc-ture. Section 2
puts our work in the larger contextof research focused on
translation memories. Sec-tion 3 explains the typical errors that
the transla-tion memories which are part of MyMemory con-tain and
show how we have built the training andtest sets. Section 4
describes the features chosen torepresent the data and briefly
describes the classi-fication algorithms employed. Section 5
presentsand discusses the results. In the final section wedraw the
conclusions and plan the further devel-opments.
9
-
2 Related Work
The translation memory systems are extensivelyused today. The
main tasks they help accomplishare localization of digital
information and transla-tion (Reinke, 2013). Because translation
memo-ries are stored in databases the principal optimiza-tion from
a technical point of view is the speed ofretrieval.
There are two not technical requirements thatthe translation
memories systems should fulfillthat interest the research
community: the accu-racy of retrieval and the translation memory
clean-ing. If for improving the accuracy of retrievedsegments there
is a fair amount of work (e.g.(Zhechev and van Genabith, 2010),
(Koehn andSenellart, 2010)) to the best of our knowledge thememory
cleaning is a neglected research area. Tobe fair there are software
tools that incorporatebasic methods of data cleaning. We would
liketo mention Apsic X-Bench2. Apsic X-Bench im-plements a series
of syntactic checks for the seg-ments. It checks for example if the
opened tag isclosed, if a word is repeated or if a word is
mis-spelled. It also integrates terminological dictio-naries and
verifies if the terms are translated ac-curately. The main
assumptions behind these val-idations seem to be that the
translation memoriesbi-segments contain accidental errors (e.g tags
notclosed) or that the translators sometimes use inac-curate terms
that can be spotted with a bilingualterminology. These assumptions
hold for transla-tion memories produced by professional
transla-tors but not for collaborative memories and mem-ories
derived from parallel corpora.
A task somehow similar to translation memorycleaning as
envisioned in section 1 is Quality Es-timation in Machine
Translation. Quality Estima-tion can also be modeled as a
classification taskwhere the goal is to distinguish between
accu-rate and inaccurate translations (Li and Khudan-pur, 2009).
The difference is that the sentenceswhose quality should be
estimated are producedby Machine Translations systems and not by
hu-mans. Therefore the features that help to discrimi-nate between
good and bad translations in this ap-proach are different from
those in ours.
2http://www.xbench.net
3 The data
In this section we describe the process of obtain-ing the data
for training and testing the classi-fiers. The positive training
examples are segmentswhere the source segment is correctly
translatedby the target segment. The negative training ex-amples
are translation memory segments that arenot true translations.
Before explaining how wecollected the examples it is useful to
understandwhat kind of errors the translation memories partof
MyMemory contain. They can be roughly clas-sified in the four types
:
1. Random Text. The Random Text errors arecases when one or both
segments is/are a ran-dom text. They occur when a malevolent
con-tributor uses the platform to copy and pasterandom texts from
the web.
2. Chat. This type of errors verifies when thetranslation memory
contributors exchangemessages instead of providing translations.For
example the English text “How are you?”translates in Italian as
“Come stai?”. Insteadof providing the translation the
contributoranswers “Bene” (“Fine”).
3. Language Error. This kind of errors oc-curs when the
languages of the source or tar-get segments are mistaken. The
contribu-tors accidentally interchange the languages ofsource and
target segments. We would like torecover from this error and pass
to the clas-sifier the correct source and target segments.There are
also cases when a different lan-guage code is assigned to the
source or targetsegment. This happens when the parallel cor-pora
contain segments in multiple languages(e.g. the English part of the
corpus containssegments in French). The aligner does notcheck the
language code of the aligned seg-ments.
4. Partial Translations. This error verifieswhen the
contributors translate only a part ofthe source segment. For
example, the En-glish source segment “Early 1980s. MuirfieldC.C.”
is translated in Italian partially: “Primianni 1980” (“Early
1980s”).
The errors Random Text and Chat take placein the collaborative
strategy of enriching MyMem-ory. The Language Error and Partial
Transla-tions are pervasive errors.
10
-
It is relatively easy to find positive examples be-cause the
high majority of bi-segments are cor-rect. Finding good negative
examples is not soeasy as it requires reading a lot of translation
seg-ments. Inspecting small samples of bi-segmentscorresponding to
the three methods, we noticedthat the highest percentage of errors
come fromthe collaborative web interface. To verify that thisis
indeed the case we make use of an insight firsttime articulated by
Church and Gale (Gale andChurch, 1993). The idea is that in a
parallel cor-pus the corresponding segments have roughly thesame
length3. To quantify the difference betweenthe length of the source
and destination segmentswe use a modified Church-Gale length
difference(Tiedemann, 2011) presented in equation 1 :
CG =ls − ld√
3.4(ls + ld)(1)
In figures 2 and 3 we plot the distribution of therelative
frequency of Church Gale scores for twosets of bi-segments with
source segments in En-glish and target segments in Italian. The
first set,from now on called the Matecat Set, is a set of seg-ments
extracted from the output of Matecat4. Thebi-segments of this set
are produced by profes-sional translators and have few errors. The
otherbi-segment set, from now on called the Collabora-tive Set, is
a set of collaborative bi-segments.
If it is true that the sets come from different dis-tributions
then the plots should be different. Thisis indeed the case. The
plot for the Matecat Set isa little bit skewed to the right but
close to a normalplot. In figure 2 we plot the Church Gale
scoreobtained for the bi-segments of the Matecat setadding a normal
curve over the histogram to bettervisualize the difference from the
gaussian curve.For the Matecat set the Church Gale score variesin
the interval −4.18 ...4.26.
The plot for the Collaborative Set has the distri-bution of
scores concentrated in the center as canbe seen in 3 . In figure 4
we add a normal curve tothe the previous histogram. The relative
frequencyof the scores away from the center is much lowerthan the
scores in the center. Therefore to get abetter wiew of the
distribution the y axis is reducedto the interval 0...0.1. For the
Collaborative set the
3This simple idea is implemented in many sentence align-ers.
4Matecat is a free web based CAT tool that can be used atthe
following address: https://www.matecat.com
Church Gale ScoreP
rob
ab
ility C
hu
rch
Ga
le S
co
re
−4 −2 0 2 4
0.0
0.2
0.4
0.6
0.8
Figure 2: The distribution of Church Gale Scoresin the Matecat
Set
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
−100 −50 0 50
0.0
0.1
0.2
0.3
0.4
Figure 3: The distribution of Church Gale Scoresin the
Collaborative Set
11
-
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
−100 −50 0 50
0.0
00
0.0
04
0.0
08
Figure 4: The normal curve added to the distri-bution of Church
Gale Scores in the CollaborativeSet
Church Gale score varies in the interval −131.51...60.15.
To see how close the distribution of Church-Gale scores is to a
normal distribution we haveplotted these distributions against the
normal dis-tribution using the Quantile to Quantile plot in
fig-ures 5 and 6.
In the Collaborative Set the scores that have alow probability
could be a source of errors. Tobuild the training set we first draw
random bi-segments from the Matecat Set. As said beforethe
bi-segments in the Matecat Set should containmainly positive
examples. Second, we draw ran-dom bi-segments from the
Collaborative Set bi-asing the sampling to the bi-segments that
havescores away from the center of the distribution. Inthis way we
hope that we draw enough negativesegments. After manually
validating the exampleswe created a training set and a test set
distributedas follows :
• Training Set. It contains 1243 bi-segmentsand has 373 negative
example.
• Test Set. It contains 309 bi-segments and has87 negatives
examples.
The proportion of the negative examples in bothsets is
approximately 30%.
−4 −2 0 2 4
−4
−2
02
4Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 5: The Q-Q plot for the Matecat set
−4 −2 0 2 4
−1
00
−5
00
50
Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 6: The Q-Q plot for the Collaborative set
12
-
4 Machine Learning
In this section we discuss the features computedfor the training
and the test sets. Moreover, webriefly present the algorithms used
for classifica-tion and the rationale for using them.
4.1 FeaturesThe features computed for the training and test
setare the following :
• same. This feature takes two values: 0 and1. It has value 1 if
the source and target seg-ments are equal. There are cases
specificallyin the collaborative part of MyMemory whenthe source
segment is copied in the target seg-ment. Of course there are
perfectly legitimatecases when the source and target segmentsare
the same (e.g. when the source segmentis a name entity that has the
same form in thetarget language), but many times the value
1indicates a spam attempt.
• cg score. This feature is the Church-Galescore described in
the equation 1. This scorereflects the idea that the length of the
sourceand destination segments that are true trans-lations is
correlated. We expect that theclassifiers learn the threshold that
separatesthe positive and negative examples. How-ever, relying
exclusively on the Church-Galescore is tricky because there are
cases whena high Church Gale score is perfectly legit-imate. For
example, when the acronyms inthe source language are expanded in
the tar-get language.
• has url. The value of the feature is 1 if thesource or target
segments contain an URL ad-dress, otherwise is 0.
• has tag. The value of the feature is 1 if thesource or target
segments contain a tag, oth-erwise is 0.
• has email. The value of the feature is 1 if thesource or
target segments contain an emailaddress, otherwise is 0.
• has number. The value of the feature is 1 ifthe source or
target segments contain a num-ber, otherwise is 0.
• has capital letters. The value of the featureis 1 if the
source or target segments contain
words that have at least a capital letter, other-wise is 0.
• has words capital letters. The value of thefeature is 1 if the
source or target segmentscontain words that consist completely of
cap-ital letters, otherwise is 0. Unlike the pre-vious feature,
this one activates only whenthere exists whole words in capital
letters.
• punctuation similarity. The value of thisfeature is the cosine
similarity between thesource and destination segments
punctuationvectors. The intuition behind this feature isthat source
and target segments should havesimilar punctuation vectors if the
source seg-ment and the target segment are true transla-tions.
• tag similarity. The value of this feature is thecosine
similarity between the source segmentand destination segment tag
vectors. The rea-son for introducing this feature is that thesource
and target segments should containvery similar tag vectors if they
are true trans-lations. This feature combines with has tagto
exhaust all possibilities (e.g., the tag exists/does not exist and
if it exists is present/is notpresent in the source and the target
segments)
• email similarity. The value of the fea-ture is the cosine
similarity between thesource segment and destination segmentemail
vectors. The reasoning for introduc-ing this feature is the same as
for the featuretag similarity. This feature combines withthe
feature has email to exhaust all possibili-ties.
• url similarity. The value of the featureis the cosine
similarity between the sourcesegment and destination segment url
ad-dresses vectors. The reasoning for introduc-ing this feature is
the same as for the featuretag similarity.
• number similarity. The value of the featureis the cosine
similarity between the sourcesegment and destination segment
numbervectors. The reasoning for introducingthis feature is the
same as for the featuretag similarity.
13
-
• bisegment similarity. The value of the fea-ture is the cosine
similarity between the desti-nation segment and the source segment
trans-lation in the destination language. It formal-izes the idea
that if the target segment is atrue translation of the source
segment thena machine translation of the source segmentshould be
similar to the target segment.
• capital letters word difference. The value ofthe feature is
the ratio between the differenceof the number of words containing
at least acapital letter in the source segment and thetarget
segment and the sum of the capital let-ter words in the bi-segment.
It is complemen-tary to the feature has capital letters.
• only capletters dif. The value of the featureis the ratio
between the difference of the num-ber of words containing only
capital letters inthe source segment and the target segmentsand the
sum of the only capital letter wordsin the bi-segment. It is
complementary to thefeature has words capital letters.
• lang dif. The value of the feature is calcu-lated from the
language codes declared in thesegment and the language codes
detected bya language detector. For example, if we ex-pect the
source segment language code to be”en” and the target segment
language code tobe ”it” and the language detector detects ”en”and
”it”, then the value of the feature is 0 (en-en,it-it). If instead
the language detector de-tects ”en” and ”fr” then the value of the
fea-ture is 1 (en-en,it-fr) and if it detects ”de” and”fr”
(en-de,it-fr) then the value is 2.
All feature values are normalized between 0and 1. The most
important features are biseg-ment similarity and lang dif. The
other featuresare either sparse (e.g. relatively few
bi-segmentscontain URLs, emails or tags) or they do notdescribe the
translation process very accurately.For example, we assumed that
the punctuation inthe source and target segments should be
similar,which is true for many bi-segments. However,there are also
many bi-segments where the trans-lation of the source segment in
the target languagelacks punctuation.
The translation of the source English segment toItalian is
performed with the Bing API. The com-putation of the language codes
for the bi-segment
is done with the highly accurate language detectorCybozu5.
4.2 AlgorithmsAs we showed in section 3 there are cases whenthe
contributors mistake the language codes of thesource and target
segments. Nevertheless, the seg-ments might be true translations.
Therefore, be-fore applying the machine learning algorithms,
wefirst invert the source and target segments if theabove situation
verifies. We tested the followingclassification algorithms from the
package scikit-learn (Pedregosa et al., 2011):
• Decision Tree. The decision trees are oneof the oldest
classification algorithms. Evenif they are known to overfit the
training datathey have the advantage that the rules inferredare
readable by humans. This means that wecan tamper with the
automatically inferredrules and at least theoretically create a
betterdecision tree.
• Random Forest. Random forests are ensem-ble classifiers that
consist of multiple deci-sion trees. The final prediction is the
mode ofindividual tree predictions. The Random For-est has a lower
probability to overfit the datathan the Decision Trees.
• Logistic Regression. The Logistic Regres-sion works
particularly well when the fea-tures are linearly separable. In
addition, theclassifier is robust to noise, avoids overfittingand
its output can be interpreted as probabil-ity scores.
• Support Vector Machines with the linearkernel. Support Vector
Machines are one ofthe most used classification algorithms.
• Gaussian Naive Bayes. If the conditionalindependence that the
naive Bayes class ofalgorithm postulates holds, the training
con-verges faster than logistic regression and thealgorithm needs
less training instances.
• K-Nearst Neighbors. This algorithm classi-fies a new instance
based on the distance ithas to k training instances. The
predictionoutput is the label that classifies the majority.Because
it is a non-parametric method, it can
5https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md
14
-
give good results in classification problemswhere the decision
boundary is irregular.
5 Results and discussion
We performed two evaluations of the machinelearning algorithms
presented in the previous sec-tion. The first evaluation is a
three-fold stratifiedclassification on the training set. The
algorithmsare evaluated against two baselines. The first base-line
it is called Baseline Uniform and it gener-ates predictions
randomly. The second baselineis called Baseline Stratified and
generates predic-tions by respecting the training set class
distribu-tion. The results of the first evaluation are given
intable 1 :
Algorithm Precision Recall F1Random Forest 0.95 0.97
0.96Decision Tree 0.98 0.97 0.97SVM 0.94 0.98 0.96K-NearstNeighbors
0.94 0.98 0.96LogisticRegression 0.92 0.98 0.95GaussianNaive Bayes
0.86 0.96 0.91BaselineUniform 0.69 0.53 0.60BaselineStratified 0.70
0.73 0.71
Table 1: The results of the three-fold
stratifiedclassification.
Excepts for the Gaussian Naive Bayes all otheralgorithms have
excellent results. All algorithmsbeat the baselines by a
significant margin (at least20 points).
The second evaluation is performed against thetest set. The
baselines are the same as in three-foldevaluation above and the
results are in table 2.
The results for the second evaluation are worsethan the results
for the first evaluation. For exam-ple, the difference between the
F1-scores of thebest performing algorithm: SVM and the strati-fied
baseline is of 10%: twice lower than the dif-ference between the
best performing classificationalgorithm and the same baseline for
the first eval-uation. This fact might be explained partially bythe
great variety of the bi-segments in the Matecatand Web Sets.
Obviously this variety is not fullycaptured by the training
set.
Algorithm Precision Recall F1Random Forest 0.85 0.63
0.72Decision Tree 0.82 0.69 0.75SVM 0.82 0.81 0.81K-NearstNeighbors
0.83 0.66 0.74LogisticRegression 0.80 0.80 0.80GaussianNaive Bayes
0.76 0.61 0.68BaselineUniform 0.71 0.72 0.71BaselineStratified 0.70
0.51 0.59
Table 2: The results of the classification on the testset.
Unlike in the first evaluation, in the second onewe have two
clear winners: Support Vector Ma-chines (with the linear kernel)
and Logistic Re-gression. They produce F1-scores around 0.8.
Theresults might seem impressive, but they are insuf-ficient for
automatically cleaning MyMemory. Tounderstand why this is the case
we inspect the re-sults of the confusion table for the SVM
algorithm.From the 309 examples in the test set 175 are
truepositives, 42 false positives, 32 false negatives and60 true
negatives. This means that around 10% ofall examples corresponding
to the false negativeswill be thrown away. Applying this method to
theMyMemory database would result in the elimina-tion of many good
bi-segments. We should there-fore search for better methods of
cleaning wherethe precision is increased even if the recall
drops.We make some suggestions in the next section.
6 Conclusions and further work
In this paper we studied the performance of vari-ous
classification algorithms for identifying falsebi-segments in
translation memories. We haveshown that the distribution of the
Church-Galescores in two sets of bi-segments that contain
dif-ferent proportion of positive and negative exam-ples is
dissimilar. This distribution is closer to thenormal distribution
for the MateCat set and moresparse for Collective Set. The best
performingclassification algorithms are Support Vector Ma-chines
(with the linear kernel) and Logistic Re-gression. Both algorithms
produce a significantnumber of false negative examples. In this
case the
15
-
performance of finding the true negative examplesdoes not offset
the cost of deleting the false nega-tives from the database.
There are two potential solutions to this prob-lem. The first
solution is to improve the perfor-mance of the classifiers. In the
future we will studyensemble classifiers that can potentially boost
theperformance of the classification task. The ideabehind the
ensemble classifiers is that with differ-ently behaving classifiers
one classifier can com-pensate for the errors of other classifiers.
If thissolution does not give the expected results we willfocus on
a subset of bi-segments for which theclassification precision is
more than 90%. For ex-ample, the Logistic Regression classification
out-put can be interpreted as probability. Our hope isthat the
probabilities scores can be ranked and thathigher scores correlate
with the confidence that abi-segment is positive or negative.
Another improvement will be the substitutionof the machine
translation module with a simplertranslation system based on
bilingual dictionaries.The machine translation module works well
withan average numbers of bi-segments. For exam-ple, the machine
translation system we employ canhandle 40000 bi-segments per day.
However, thissystem is not scalable, it costs too much and it
can-not handle the entire MyMemory database. Unlikea machine
translation system, a dictionary is rela-tively easy to build using
an aligner. Moreover, asystem based on an indexed bilingual
dictionaryshould be much faster than a machine
translationsystem.
Acknowledgments
The research reported in this paper is supportedby the People
Programme (Marie Curie Actions)of the European Unions Framework
Programme(FP7/2007-2013) under REA grant agreement no.317471.
ReferencesWilliam A. Gale and Kenneth W. Church. 1993. A
program for aligning sentences in bilingual
corpora.COMPUTATIONAL LINGUISTICS.
B. L. Humphreys and D. A. Lindberg. 1993. TheUMLS project:
making the conceptual connectionbetween users and the information
they need. BullMed Libr Assoc, 81(2):170–177, April.
Philipp Koehn and Jean Senellart. 2010. Convergenceof
translation memory and statistical machine trans-
lation. In Proceedings of AMTA Workshop on MTResearch and the
Translation Industry, pages 21–31.
Zhifei Li and Sanjeev Khudanpur. 2009. Large-scalediscriminative
n-gram language models for statisti-cal machine translation. In
Proceedings of AMTA.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion,
O. Grisel, M. Blondel, P. Pretten-hofer, R. Weiss, V. Dubourg, J.
Vanderplas, A. Pas-sos, D. Cournapeau, M. Brucher, M. Perrot, andE.
Duchesnay. 2011. Scikit-learn: Machine learn-ing in Python. Journal
of Machine Learning Re-search, 12:2825–2830.
Uwe Reinke. 2013. State of the art in translation mem-ory
technology. Translation: Computation, Cor-pora, Cognition,
3(1).
Ralf Steinberger, Andreas Eisele, Szymon Klocek,Spyridon Pilos,
and Patrick Schlüter. 2013. Dgt-tm: A freely available translation
memory in 22 lan-guages. CoRR, abs/1309.5226.
Jörg Tiedemann. 2011. Bitext Alignment. Num-ber 14 in Synthesis
Lectures on Human LanguageTechnologies. Morgan & Claypool, San
Rafael, CA,USA.
Marco Trombetti. 2009. Creating the world’s largesttranslation
memory.
Ventsislav Zhechev and Josef van Genabith. 2010.Maximising tm
performance through sub-tree align-ment and smt. In Proceedings of
the Ninth Confer-ence of the Association for Machine Translation
inthe Americas.
16