YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
  • Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM), pages 9–16,Hissar, Bulgaria, Sept 2015.

    Spotting false translation segments in translation memories

    Eduard BarbuTranslated.net

    [email protected]

    Abstract

    The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification task.We test the accuracy of various machinelearning algorithms to find segments thatare not true translations. We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms. The performanceof the winning classification algorithms,though high, is not yet sufficient for auto-matic cleaning of translations memories.

    1 Introduction

    MyMemory1 (Trombetti, 2009) is the biggesttranslation memory in the world. It contains morethan 1 billion bi-segments in approximately 6000language pairs. MyMemory is built using threemethods. The first method is to aggregate thememories contributed by translators. The secondmethod is to use translation memories extractedfrom corpora, glossaries or data mined from theweb. The current distribution of the automaticallyacquired translation memories is given in figure1. Approximately 50% of the distribution is oc-cupied by the DGT-TM (Steinberger et al., 2013),a translation memory built for 24 EU languagesfrom aligned parallel corpora. The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg, 1993), aterminology released by the National Library ofMedicine. The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface.

    The quality of the translations using the firstmethod is high and the errors are relatively few.

    1https://mymemory.translated.net/

    Figure 1: The distribution of automatically ac-quired memories in MyMemory

    However the second method and especially thethird one produce a significant number of erro-neous translations. The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions.

    The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem. Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise. In this paper we test vari-ous classification algorithms at this task.

    The rest of the paper has the following struc-ture. Section 2 puts our work in the larger contextof research focused on translation memories. Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets. Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed. Section 5 presentsand discusses the results. In the final section wedraw the conclusions and plan the further devel-opments.

    9

  • 2 Related Work

    The translation memory systems are extensivelyused today. The main tasks they help accomplishare localization of digital information and transla-tion (Reinke, 2013). Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval.

    There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community: the accu-racy of retrieval and the translation memory clean-ing. If for improving the accuracy of retrievedsegments there is a fair amount of work (e.g.(Zhechev and van Genabith, 2010), (Koehn andSenellart, 2010)) to the best of our knowledge thememory cleaning is a neglected research area. Tobe fair there are software tools that incorporatebasic methods of data cleaning. We would liketo mention Apsic X-Bench2. Apsic X-Bench im-plements a series of syntactic checks for the seg-ments. It checks for example if the opened tag isclosed, if a word is repeated or if a word is mis-spelled. It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately. The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (e.g tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology. These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora.

    A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation. Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur, 2009). The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans. Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours.

    2http://www.xbench.net

    3 The data

    In this section we describe the process of obtain-ing the data for training and testing the classi-fiers. The positive training examples are segmentswhere the source segment is correctly translatedby the target segment. The negative training ex-amples are translation memory segments that arenot true translations. Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain. They can be roughly clas-sified in the four types :

    1. Random Text. The Random Text errors arecases when one or both segments is/are a ran-dom text. They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web.

    2. Chat. This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translations.For example the English text “How are you?”translates in Italian as “Come stai?”. Insteadof providing the translation the contributoranswers “Bene” (“Fine”).

    3. Language Error. This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken. The contribu-tors accidentally interchange the languages ofsource and target segments. We would like torecover from this error and pass to the clas-sifier the correct source and target segments.There are also cases when a different lan-guage code is assigned to the source or targetsegment. This happens when the parallel cor-pora contain segments in multiple languages(e.g. the English part of the corpus containssegments in French). The aligner does notcheck the language code of the aligned seg-ments.

    4. Partial Translations. This error verifieswhen the contributors translate only a part ofthe source segment. For example, the En-glish source segment “Early 1980s. MuirfieldC.C.” is translated in Italian partially: “Primianni 1980” (“Early 1980s”).

    The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory. The Language Error and Partial Transla-tions are pervasive errors.

    10

  • It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect. Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments. Inspecting small samples of bi-segmentscorresponding to the three methods, we noticedthat the highest percentage of errors come fromthe collaborative web interface. To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch, 1993). The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3. To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann, 2011) presented in equation 1 :

    CG =ls − ld√

    3.4(ls + ld)(1)

    In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian. The first set,from now on called the Matecat Set, is a set of seg-ments extracted from the output of Matecat4. Thebi-segments of this set are produced by profes-sional translators and have few errors. The otherbi-segment set, from now on called the Collabora-tive Set, is a set of collaborative bi-segments.

    If it is true that the sets come from different dis-tributions then the plots should be different. Thisis indeed the case. The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot. In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curve.For the Matecat set the Church Gale score variesin the interval −4.18 ...4.26.

    The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 . In figure 4 we add a normal curve tothe the previous histogram. The relative frequencyof the scores away from the center is much lowerthan the scores in the center. Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 0...0.1. For the Collaborative set the

    3This simple idea is implemented in many sentence align-ers.

    4Matecat is a free web based CAT tool that can be used atthe following address: https://www.matecat.com

    Church Gale ScoreP

    rob

    ab

    ility C

    hu

    rch

    Ga

    le S

    co

    re

    −4 −2 0 2 4

    0.0

    0.2

    0.4

    0.6

    0.8

    Figure 2: The distribution of Church Gale Scoresin the Matecat Set

    Church Gale Score

    Pro

    ba

    bility C

    hu

    rch

    Ga

    le S

    co

    re

    −100 −50 0 50

    0.0

    0.1

    0.2

    0.3

    0.4

    Figure 3: The distribution of Church Gale Scoresin the Collaborative Set

    11

  • Church Gale Score

    Pro

    ba

    bility C

    hu

    rch

    Ga

    le S

    co

    re

    −100 −50 0 50

    0.0

    00

    0.0

    04

    0.0

    08

    Figure 4: The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet

    Church Gale score varies in the interval −131.51...60.15.

    To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6.

    In the Collaborative Set the scores that have alow probability could be a source of errors. Tobuild the training set we first draw random bi-segments from the Matecat Set. As said beforethe bi-segments in the Matecat Set should containmainly positive examples. Second, we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution. Inthis way we hope that we draw enough negativesegments. After manually validating the exampleswe created a training set and a test set distributedas follows :

    • Training Set. It contains 1243 bi-segmentsand has 373 negative example.

    • Test Set. It contains 309 bi-segments and has87 negatives examples.

    The proportion of the negative examples in bothsets is approximately 30%.

    −4 −2 0 2 4

    −4

    −2

    02

    4Theoretical Quantiles

    Sa

    mp

    le Q

    ua

    ntile

    s

    Figure 5: The Q-Q plot for the Matecat set

    −4 −2 0 2 4

    −1

    00

    −5

    00

    50

    Theoretical Quantiles

    Sa

    mp

    le Q

    ua

    ntile

    s

    Figure 6: The Q-Q plot for the Collaborative set

    12

  • 4 Machine Learning

    In this section we discuss the features computedfor the training and the test sets. Moreover, webriefly present the algorithms used for classifica-tion and the rationale for using them.

    4.1 FeaturesThe features computed for the training and test setare the following :

    • same. This feature takes two values: 0 and1. It has value 1 if the source and target seg-ments are equal. There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment. Of course there are perfectly legitimatecases when the source and target segmentsare the same (e.g. when the source segmentis a name entity that has the same form in thetarget language), but many times the value 1indicates a spam attempt.

    • cg score. This feature is the Church-Galescore described in the equation 1. This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated. We expect that theclassifiers learn the threshold that separatesthe positive and negative examples. How-ever, relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate. For example, when the acronyms inthe source language are expanded in the tar-get language.

    • has url. The value of the feature is 1 if thesource or target segments contain an URL ad-dress, otherwise is 0.

    • has tag. The value of the feature is 1 if thesource or target segments contain a tag, oth-erwise is 0.

    • has email. The value of the feature is 1 if thesource or target segments contain an emailaddress, otherwise is 0.

    • has number. The value of the feature is 1 ifthe source or target segments contain a num-ber, otherwise is 0.

    • has capital letters. The value of the featureis 1 if the source or target segments contain

    words that have at least a capital letter, other-wise is 0.

    • has words capital letters. The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters, otherwise is 0. Unlike the pre-vious feature, this one activates only whenthere exists whole words in capital letters.

    • punctuation similarity. The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors. The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions.

    • tag similarity. The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors. The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations. This feature combines with has tagto exhaust all possibilities (e.g., the tag exists/does not exist and if it exists is present/is notpresent in the source and the target segments)

    • email similarity. The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors. The reasoning for introduc-ing this feature is the same as for the featuretag similarity. This feature combines withthe feature has email to exhaust all possibili-ties.

    • url similarity. The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors. The reasoning for introduc-ing this feature is the same as for the featuretag similarity.

    • number similarity. The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors. The reasoning for introducingthis feature is the same as for the featuretag similarity.

    13

  • • bisegment similarity. The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language. It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment.

    • capital letters word difference. The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment. It is complemen-tary to the feature has capital letters.

    • only capletters dif. The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment. It is complementary to thefeature has words capital letters.

    • lang dif. The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector. For example, if we ex-pect the source segment language code to be”en” and the target segment language code tobe ”it” and the language detector detects ”en”and ”it”, then the value of the feature is 0 (en-en,it-it). If instead the language detector de-tects ”en” and ”fr” then the value of the fea-ture is 1 (en-en,it-fr) and if it detects ”de” and”fr” (en-de,it-fr) then the value is 2.

    All feature values are normalized between 0and 1. The most important features are biseg-ment similarity and lang dif. The other featuresare either sparse (e.g. relatively few bi-segmentscontain URLs, emails or tags) or they do notdescribe the translation process very accurately.For example, we assumed that the punctuation inthe source and target segments should be similar,which is true for many bi-segments. However,there are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation.

    The translation of the source English segment toItalian is performed with the Bing API. The com-putation of the language codes for the bi-segment

    is done with the highly accurate language detectorCybozu5.

    4.2 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments. Nevertheless, the seg-ments might be true translations. Therefore, be-fore applying the machine learning algorithms, wefirst invert the source and target segments if theabove situation verifies. We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al., 2011):

    • Decision Tree. The decision trees are oneof the oldest classification algorithms. Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans. This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree.

    • Random Forest. Random forests are ensem-ble classifiers that consist of multiple deci-sion trees. The final prediction is the mode ofindividual tree predictions. The Random For-est has a lower probability to overfit the datathan the Decision Trees.

    • Logistic Regression. The Logistic Regres-sion works particularly well when the fea-tures are linearly separable. In addition, theclassifier is robust to noise, avoids overfittingand its output can be interpreted as probabil-ity scores.

    • Support Vector Machines with the linearkernel. Support Vector Machines are one ofthe most used classification algorithms.

    • Gaussian Naive Bayes. If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds, the training con-verges faster than logistic regression and thealgorithm needs less training instances.

    • K-Nearst Neighbors. This algorithm classi-fies a new instance based on the distance ithas to k training instances. The predictionoutput is the label that classifies the majority.Because it is a non-parametric method, it can

    5https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md

    14

  • give good results in classification problemswhere the decision boundary is irregular.

    5 Results and discussion

    We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion. The first evaluation is a three-fold stratifiedclassification on the training set. The algorithmsare evaluated against two baselines. The first base-line it is called Baseline Uniform and it gener-ates predictions randomly. The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion. The results of the first evaluation are given intable 1 :

    Algorithm Precision Recall F1Random Forest 0.95 0.97 0.96Decision Tree 0.98 0.97 0.97SVM 0.94 0.98 0.96K-NearstNeighbors 0.94 0.98 0.96LogisticRegression 0.92 0.98 0.95GaussianNaive Bayes 0.86 0.96 0.91BaselineUniform 0.69 0.53 0.60BaselineStratified 0.70 0.73 0.71

    Table 1: The results of the three-fold stratifiedclassification.

    Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results. All algorithmsbeat the baselines by a significant margin (at least20 points).

    The second evaluation is performed against thetest set. The baselines are the same as in three-foldevaluation above and the results are in table 2.

    The results for the second evaluation are worsethan the results for the first evaluation. For exam-ple, the difference between the F1-scores of thebest performing algorithm: SVM and the strati-fied baseline is of 10%: twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation. This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets. Obviously this variety is not fullycaptured by the training set.

    Algorithm Precision Recall F1Random Forest 0.85 0.63 0.72Decision Tree 0.82 0.69 0.75SVM 0.82 0.81 0.81K-NearstNeighbors 0.83 0.66 0.74LogisticRegression 0.80 0.80 0.80GaussianNaive Bayes 0.76 0.61 0.68BaselineUniform 0.71 0.72 0.71BaselineStratified 0.70 0.51 0.59

    Table 2: The results of the classification on the testset.

    Unlike in the first evaluation, in the second onewe have two clear winners: Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression. They produce F1-scores around 0.8. Theresults might seem impressive, but they are insuf-ficient for automatically cleaning MyMemory. Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithm.From the 309 examples in the test set 175 are truepositives, 42 false positives, 32 false negatives and60 true negatives. This means that around 10% ofall examples corresponding to the false negativeswill be thrown away. Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments. We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall drops.We make some suggestions in the next section.

    6 Conclusions and further work

    In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories. We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar. This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set. The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression. Both algorithms produce a significantnumber of false negative examples. In this case the

    15

  • performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database.

    There are two potential solutions to this prob-lem. The first solution is to improve the perfor-mance of the classifiers. In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task. The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers. If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90%. For ex-ample, the Logistic Regression classification out-put can be interpreted as probability. Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative.

    Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionaries.The machine translation module works well withan average numbers of bi-segments. For exam-ple, the machine translation system we employ canhandle 40000 bi-segments per day. However, thissystem is not scalable, it costs too much and it can-not handle the entire MyMemory database. Unlikea machine translation system, a dictionary is rela-tively easy to build using an aligner. Moreover, asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem.

    Acknowledgments

    The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP7/2007-2013) under REA grant agreement no.317471.

    ReferencesWilliam A. Gale and Kenneth W. Church. 1993. A

    program for aligning sentences in bilingual corpora.COMPUTATIONAL LINGUISTICS.

    B. L. Humphreys and D. A. Lindberg. 1993. TheUMLS project: making the conceptual connectionbetween users and the information they need. BullMed Libr Assoc, 81(2):170–177, April.

    Philipp Koehn and Jean Senellart. 2010. Convergenceof translation memory and statistical machine trans-

    lation. In Proceedings of AMTA Workshop on MTResearch and the Translation Industry, pages 21–31.

    Zhifei Li and Sanjeev Khudanpur. 2009. Large-scalediscriminative n-gram language models for statisti-cal machine translation. In Proceedings of AMTA.

    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Pretten-hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-sos, D. Cournapeau, M. Brucher, M. Perrot, andE. Duchesnay. 2011. Scikit-learn: Machine learn-ing in Python. Journal of Machine Learning Re-search, 12:2825–2830.

    Uwe Reinke. 2013. State of the art in translation mem-ory technology. Translation: Computation, Cor-pora, Cognition, 3(1).

    Ralf Steinberger, Andreas Eisele, Szymon Klocek,Spyridon Pilos, and Patrick Schlüter. 2013. Dgt-tm: A freely available translation memory in 22 lan-guages. CoRR, abs/1309.5226.

    Jörg Tiedemann. 2011. Bitext Alignment. Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies. Morgan & Claypool, San Rafael, CA,USA.

    Marco Trombetti. 2009. Creating the world’s largesttranslation memory.

    Ventsislav Zhechev and Josef van Genabith. 2010.Maximising tm performance through sub-tree align-ment and smt. In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas.

    16


Related Documents