Translationese and Swedish-English Statistical …uu.diva-portal.org/smash/get/diva2:1034684/FULLTEXT01.pdfTranslationese and Swedish-English Statistical Machine Translation Jakob

Translationese andSwedish-English StatisticalMachine Translation

Jakob Joelsson

Uppsala UniversityDepartment of Linguistics and PhilologySpråkteknologiprogrammet(Language Technology Programme)Bachelor’s Thesis in Language Technology

12th October 2016

Supervisor:Sara Stymne, Uppsala Universitet

Abstract

This thesis investigates how well machine learned classifiers can identifytranslated text, and the effect translationese may have in StatisticalMachine Translation – all in a Swedish-to-English, and reverse, context.Translationese is a term used to describe the dialect of a target languagethat is produced when a source text is translated. The systems trainedfor this thesis are SVM-based classifiers for identifying translationese, aswell as translation and language models for Statistical Machine Translation.The classifiers successfully identified translationese in relation to non-translated text, and to some extent, also what source language the textswere translated from.

In the SMT experiments, variation of the translation model waswhat affected the results the most in the BLEU evaluation. Systemsconfigured with non-translated source text and translationese target textperformed better than their reversed counter parts. The language modelexperiments showed that those trained on known translationese andclassified translationese performed better than known non-translated text,though classified translationese did not perform as well as the knowntranslationese. Ultimately, the thesis shows that translationese can beidentified by machine learned classifiers and may affect the results of SMTsystems.

Contents

Acknowledgements 4

1 Introduction 51.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Background and Related Work 72.1 Translationese . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Classifying Translationese . . . . . . . . . . . . . . . . . . . . . 82.3 Translationese in Machine Translation . . . . . . . . . . . . . . . 9

3 Resources and Tools 103.1 Europarl Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Tools and metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Methodology 134.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1.1 Three-class Classification . . . . . . . . . . . . . . . . . . 134.1.2 Multi-class classifier . . . . . . . . . . . . . . . . . . . . 144.1.3 Testing the classifiers . . . . . . . . . . . . . . . . . . . . 15

4.2 SMT Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2.1 Direction of Translation Models . . . . . . . . . . . . . . 154.2.2 Language Models . . . . . . . . . . . . . . . . . . . . . . 15

5 Results 175.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.1.1 Three-class Classifier . . . . . . . . . . . . . . . . . . . . 175.1.2 Multi-class Classifier . . . . . . . . . . . . . . . . . . . . 19

5.2 SMT Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.2.1 Translation Models . . . . . . . . . . . . . . . . . . . . . 205.2.2 Language Model Experiments . . . . . . . . . . . . . . . 20

6 Conclusion 22

Bibliography 24

3

Acknowledgements

First, I would like to thank my supervisor Sara Stymne for her critique and helpthroughout my work on this thesis.

Second, I would also like to thank everyone who has asked me and showninterest in my thesis project, forcing me to further reflect on and put into wordswhat it is I have actually been working on.

4

1 Introduction

This thesis focuses on the differences of translated and non-translated texts,and how the awareness of this difference affects statistical machine translation.Translated texts have previously been shown to be distinguishable from originaltexts, forming a dialect, or sub-language, known as translationese (Gellerstam,1986). Texts in translationese can differ from original texts in several ways,which can be described as “footprints” of the original language. Recognizing thismatter can be used to improve the results of statistical machine translation.

The work presented in this thesis uses the Europarl Corpus, in which asignificant number of sentences have annotated the original languages of thestatements. This data is used with machine learning methods to classify textsof unknown origin as either translationese or original. Volansky et al. (2015)compiled a list of features that can be used while training classifiers for thispurpose. Their work inspired the methods of classification used in this thesis.

Twitto-Shmuel et al. (2015) further showed that SMT systems constructedfrom text classified as translationese performed close to as good as systemsconstructed with known translationese texts. This meaning that in cases ordomains where known translationese texts are sparse, they state that, at leasthalf of the data for the language and translation models can be predictedtranslationese and have the system perform on a state-of-the-art level. TheSMT section of this thesis adopts the work of Twitto-Shmuel et al. (2015) in aSwedish-to-English and English-to-Swedish setting.

1.1 PurposeThis thesis addresses problems of Statistical Machine Translation, and MachineLearning, specifically Text Classification.

The purpose of this thesis is to examine how translationese can be classifiedusing Machine Learning methods, how translation models differ depending onthe translation direction of the source and target texts, how translationese canaffect the outcome of an SMT system when used to train its language models.

The work of this thesis addresses topics previously targeted in earlier re-search, though focuses on a Swedish-to-English and English-to-Swedish context.Three main questions are posed, aimed to be answered in the work presented.These questions are as follows:

• Can a classifier be trained to identify translationese versus text originallywritten in the language? And; can a similar classifier identify the sourcelanguage of the translated text?

5

• Does the translation direction of the parallel training data in a translationmodel affect the results of the SMT systems?

• Finally, is it favorable to use known translated texts as opposed tooriginal text when training language models, and can text identified astranslationese be used in the same manner?

1.2 OutlineChapter 2 covers the background and related work of this thesis. In chapter 3the resources, tools, and software used in the work are presented, an overview ofthe Europarl corpus is given, and the different data sets are described. Chapter4 shows the methods used for classification and building the SMT systems, aswell as explain the choices made in the process of this work. Chapter 5 presentsthe results of the classification, and of the SMT systems. Finally, chapter 6summarizes the thesis with conclusions drawn from the results, a discussion ofthe work, and suggestions for future research.

6

2 Background and Related Work

This chapter gives a background on the various forms of translationese thatthis thesis explores. The first section provides an introduction to the termtranslationese, and a look at why it is used and what it actually means.The second section views the current state of the research on distinguishingtranslationese from original text through machine learning and classifying tools,mainly focusing on the work on the features of translationese, presented byVolansky et al. (2015). The third section presents a background on how aware-ness of translationese can affect the results in statistical machine translation.

2.1 TranslationeseTranslationese is a term used to describe the characteristics of translatedlanguage, in relation to original language. Gellerstam (1986) notes that this is astatistical phenomenon rather than the cause of poor translation.

According to Toury (2012), there are two laws concerning the featuresof translationese. These are ’the law of interference’ and ’the law of growingstandardization’. The first applies to the characteristics from the source textthat affect the target text, while the second covers the effort to accommodatecurrent cultural and linguistic conventions of the target language. These lawscan further be expanded into four wider categories as done by Volansky et al.(2015). These four categories are as follows:

Simplification – the idea that translated texts are made ’simpler’ than theirsource. This is confirmed by Laviosa (1998, 2002) in a series of corpusbased studies.

Explicitation – the translation tends to have concepts and names more explicitlyexplained. Koppel and Ordan (2011) shows this by occurrences ofcohesive markers in the translated texts, where there are none in theoriginal.

Normalization – the notion that translators try to standardize target text asmuch as possible (Toury, 2012).

Interference – the discrepancy between the first and second language(s) of thetranslator (Toury, 1979). This intermediate state of so called interlanguage(Selinker, 1972) may thus affect the target text on a structural level,causing the distribution of elements that forms translationese (Gellerstam,1986).

7

In practice, this can take many forms. One of the easiest for the humanreader to perceive being that of Simplification, in the form of sentence length.In the example below, one Swedish sentence 1 is translated into two separateEnglish sentences 2.

(1) Swedish: Jag måste dock säga att betydelsen av kampen mot narkotika ändåär så stor att man inte kan säga att resurser som satsas i detta arbete skullevara för stora, utan jag tror att vi måste försöka arbeta på denna linje.

(2) English: I must say, however, that the fight against drugs is, for all that, soimportant that it cannot be said that resources invested in this work are toogreat. Instead, I believe that we must try to work along these lines.

2.2 Classifying TranslationeseText classification is a well studied natural language processing field, used toidentify genre or topic of the text (Sebastiani, 2002), but also author attribution- recognizing the age, gender, and provenance of an author from machinelearning features in the texts (Koppel et al., 2009).

Using the knowledge and methodology of text classification, Volansky et al.(2015) has examined a list of 32 feature sets over five categories suggested tobe beneficial when distinguishing translationese from original text. These fivecategories are the four listed in section 2.1 with the addition of a Miscellaneouscategory covering feature sets that did not fit into the others.

Of the five aforementioned categories, Interference stood out the most inaccuracy in the works of Volansky et al. (2015). It contained feature sets such asPart-Of-Speech (POS) uni-, bi-, and trigrams, as well as character uni-, bi-, andtrigrams. All of these yielded results of above 90% accuracy in their respectiveclassifiers, with the exception of character unigrams which was at 85%.

Outside of the Interference category, only one of the 32 feature sets couldproduce a classification result with above 90% accuracy. This was “Functionwords” in the Miscellaneous category. In the three remaining categories “Meanword rank (2)” topped Simplification with 77% accuracy, “Cohesive markers” Ex-plicitation with 81% accuracy and “Threshold PMI” the Normalization categorywith 66%.

Ilisei et al. (2010) has looked closer at the Simplification features for Spanishclassification of translationese, and concluded that they also can be useful withaccuracy results of up to about 97% of in-domain texts, and 87% on a generaltest set.

When evaluating the 300 most frequent features per feature set, Volanskyet al. (2015) displayed a group of five feature sets with high accuracy, as follows.POS bigrams and trigrams, each at 96% accuracy, character bigrams and trigramsat 95% and 96% accuracy respectively, and lastly positional token frequency at93% accuracy. The classification part in this experiment uses the first four ofthese feature types due to their high accuracy.

Volansky, backed with research of Koppel and Ordan (2011), challenges theidea of translation universals; the idea that translations from different languagescarry similar translationese traits, as suggested in earlier research (Baker et al.,1993).

8

Koppel and Ordan (2011) has concluded that a classifier can be learned todetermine the source language given a translated text, in addition to distinguishthem from original texts. Their results have also shown that if languages L1 andL2 are closer typologically, the accuracy when classifying translationese in L1using a classifier trained on data from L2 is higher, for instance two Romancelanguages such as French and Spanish.

2.3 Translationese in Machine TranslationCalling attention to the phenomenon of translationese has proven favorablein Statistical Machine Translation (SMT). The problem is visible on severalplanes in both bilingual and monolingual domains. Kurokawa et al. (2009)stated a significant difference in output quality when translation models forFrench-to-English SMT systems were built on corpora translated from Frenchto English, rather than when translating from corpora translated in the otherdirection.

Taking the matter further, Lembersky et al. (2012) involved the languagemodel, showing that a language model derived from translated text yieldeda better quality output than one derived from original text. Their translationmodels were built using a large amount of data, ranging from approximately250000 sentences to 1.5 million sentences. A study by Twitto-Shmuel et al.(2015) showed that SMT systems with language models derived from textclassified to be translationese by a machine learning system performed almostas well as ones with texts known to be translations. Their work also concludedthat the translation direction of the training data in the translation modelmatters. The systems using original non-translated text as source language andtranslationese as target performs a lot better than the system with a reversemodel.

9

3 Resources and Tools

This chapter presents the resources and tools used in the work behind thisthesis. It gives a brief explanation of the different resources, and also describeshow the corpus is partitioned.

3.1 Europarl CorpusThe Europarl Corpus compiles the proceedings of the European Parliament in21 languages. The Swedish-English parallel corpus contains about 45.7 millionEnglish words, and roughly 41.5 million Swedish words (Koehn, 2005). Theversion used for this work is version 7.

When training the classifiers, the data used was split into text chunksrepresenting documents. These chunks followed the frame set out by Volanskyet al. (2015). The chunks contained only utterances from a single speaker andwere split on the nearest sentence boundary following the 2000th token.

The SMT systems created in this work use the Swedish and English partsof the Europarl corpus. The corpus was tokenized and aligned with theaccompanying tools. To some extent, the Europarl corpus carries information onthe original language of the utterances, which allows for a categorization of these.Hence, the corpus was split into four categories, each category representingthe original language of the utterances. These categories were SV for Swedish,EN for English, Other for other languages, and Unknown for the untaggedstatements.

The number of sentences of each source language in the corpus can be seenin Table 3.1. About a third of the sentences are of unknown origin. To make anequal testing of both directions, 32000 sentences were chosen for both Swedishand English to be used for the translation model, so a reasonable number wouldstill be available for tuning and testing. This makes a small parallel data set,especially compared to the large translation models of Lembersky et al. (2012),their smallest being approximately 250000 sentences.

After removing the empty lines and discarding sentences longer than 100tokens with the cleaning tools in Moses, there were 31645 sentences in theoriginally English part of the corpus, and 31672 in the originally Swedish partas seen in Table 3.2 alongside the sizes of the sets for tuning and testing.

The test sets and tuning sets are disjoint from both each other and thetraining data, and consist of 2000 parallel sentences each, for each translationdirection. In both testing and tuning the SMT systems, the non-translatedsentences were translated using the system, and their parallel translationese

10

LanguageCategory TotalSwedish 37 327English 137 732Unknown 542 155Other 893 007

Table 3.1: Number of parallel sentences for each source language category acquired fromthe corpus.

Parallel Data Training Tuning TestSwedish-to-English 31 672 2 000 2 000English-to-Swedish 31 645 2 000 2 000

Table 3.2: Number of parallel sentences used for the different sets from the corpus.

sentences were used as reference during BLEU-evaluation of the translationperformance.

The Swedish and English partitions are used for language and translationmodeling. Along with the Other partition, the Swedish and English parts arealso used to train a three-class classifier used to identify what languages theUnknown partitions are originally written in.

It is important to note that this work utilizes far smaller data sets comparedto the previous research done by Twitto-Shmuel et al. (2015) and Lemberskyet al. (2012). While previous work in the area has worked with languages wellrepresented in the European Parliament, Swedish is spoken by a minority ofits members. Therefore, the number of presentations in the Parliament donein Swedish is lower than those done in both German, English, and French.This ultimately affects this thesis such as there is less of both original, andtranslationese, Swedish to work with than for work done with more wellrepresented languages. From an SMT-standpoint, the amount of data may thusseem sparse, but in text classification it is sufficient.

3.2 Tools and metricsThe following software tools were used in the work presented in this thesis.

• BLEU is a standardized, language independent, automatic evaluationmetric for machine translation. BLEU scoring utilizes the geometric meanof the 1-4 gram precision of the candidate translation in relation to thereference, as well as multiplying a brevity penalty to make up for sentencelength differences. It results in a score that can be compared to scoresgained from other evaluated translations (Papineni et al., 2002).

• The Granska Tagger tool is a Part-of-Speech tagger for Swedish,that useshidden Markov models (Carlberger and Kann, 1999).

• HunPos is an open source implementation of the NLP software standardTnT tagger. It uses hidden Markov models for Part-of-Speech tagging.

11

• LibSVM is a library for Support Vector Machines (SVMs), which areuseful methods in machine learning when working with classification,regression, or other learning tasks (Chang and Lin, 2011).

• MOSES is an open source toolkit used for statistical machine translation.Not only does it provide an SMT decoder, but it also includes severaltools related to training translation systems (Koehn et al., 2007).

• SRILM is an open source toolkit used for statistical language modelingand related tasks (Stolcke, 2002).

• WEKA is a working environment for Machine Learning (ML) outfittedwith basic ML tools along with interactive tools for result visualization,and cross validation etc. (Holmes et al., 1994). In this thesis, 10-foldcross validation is used to test the classifiers. It splits the training data into10 folds, using each fold as test data for the classifier trained with theother 9 folds. It outputs an average of the 10 different testing results as acombined testing result.

12

4 Methodology

This chapter explains the methods behind the work of this thesis. The firstsection covers the three-class classifier classification process, and the multipartclassifier testing process. The following section focuses on how the SMT systemsare trained in terms of translation and language models.

4.1 ClassificationIn this section, it is explained how the classifiers for both languages are built.Classifiers of translationese have been built before, and one is adapted in thiswork. The features are chosen since they performed very well, in matters ofaccuracy, in the work of Volansky et al. (2015).

4.1.1 Three-class Classification

The preparation of the data started with splitting the corpus based on originallanguage of the statements. This made four different corpus partitions as seenin Table 3.1.

The corpus texts were POS-tagged in HunPOS and Granska, and afterwardsplit into chunks of approximately 2000 tokens, on the nearest followingsentence boundary, to avoid length being a factor of the classification. Theuse of chunks, as well as the size of the chunks follow the work of Volanskyet al. (2015). 32000 sentences from each of the three source language categories,English, Swedish, and Other, were split into 1114 for the Swedish languagetext, and 1249 chunks for the English language text. The number of sentenceschosen are the same amount for both Swedish and English in all categories. Thenumber sentences and corresponding chunks of the corpus partitions are shownin Table 4.1 The Other category is made up of 32000 random sentences fromdata that was not tagged with the language it was originally uttered on.

Two three-class classifiers were trained with 300 each of the most frequentpart of speech bi- and trigram, and character bi- and trigram (see Table 4.2).The features were extracted from a total of 96000 sentences – 32000 per source

Training Unknown set Training(Multi-class)No. of sentences 96000 500 000 192 000Swedish 1114 5831 1696English 1249 6480 1894

Table 4.1: Number of sentences and corresponding chunks in each corpus partition.

13

Feature Set Swedish EnglishPOS bigram NN_PP IN_DTPOS trigram DT_JJ_NN DT_JJ_NNCharacter bigram ._</s> t_hCharacter trigram a_t_t ¿_o_f

Table 4.2: Examples of some of the common features included in the different feature sets.</s> represents the end of a sentence, and the ’¿’-character represents the space character.

language category – and ranked based on their occurrences. The 300 mostfrequently occurring of each feature set – adding up to 1200 features in total –were counted in each chunk and the number of occurrences placed in a featurevector representing the chunk.

First the chunks were lower-cased to further neutralize them, then thefeatures were then extracted, ranked on occurrences, and the 300 most frequentof each feature set The feature vectors were used to train a classifier with aLibSVM model in Weka. The LibSVM model was used with default settingsexcept for normalization, which is done by Weka, and scales the numericfeature values to be in the range of 0 to 1.

The Unknown text to be classified was 500 000 sentences each in Swedishand English, which were also POS-tagged, lower-cased, split into 5831 and6480 chunks respectively, and had their features extracted. The three-classclassifier for Swedish, English, and Other are then used to identify the sourcelanguage of the text chunks of Unknown origin.

After classification, the sentences were de-chunked at sentence final punc-tuation, and trimmed down to 32000, before building a language model. Addi-tionally, the Moses truecaser is trained on the corresponding target language’scorpus partition which is then used along with the Moses detruecaser to restorethe case of the sentences.

4.1.2 Multi-class classifier

Two multi-class classifier were trained, extracting features from six languages,to test the claims that not only a whether a text is translated can be classified,but also what source language the text had been translated from. Outside ofSwedish (sv) and English (en), the languages used were German (de), Greek(el), Finnish (fi), and French (fr). These languages were chosen because of theirdifferent language families. On the one hand, there is German, a Germaniclanguage which thus shares traits with both Swedish and English. It was chosento see if the learned classifier would prefer to classify either of the two primarilyinvestigated languages in another category. The three remaining languages werechosen because they are of other, separate language families – Hellenic, Finno-Ugric, and Romance, respectively. For this, 32 000 sentences of each sourcelanguage – making a total of 192 000 sentences for both target languages– werePOS-tagged, lower-cased, and split into 1696 and 1894 chunks for Swedish andEnglish texts respectively.

Since the amount of classes were doubled from the three-class classifierexperiments, the top 500 features of the four feature sets were used, making

14

the total amount of features 2000 for the classifier.

4.1.3 Testing the classifiers

The classifiers were tested using 10 fold cross-validation and with a test setconsisting of 2000 sentences from each of the three categories. In 10 fold cross-validation, the training data is split into 10 “folds” of which one acts as testingdata and the remaining nine as training data. In Weka, the process of trainingthe model on nine folds and testing on one is repeated ten times, until all foldshave been used for testing. Weka then reports the average of the classificationresults. Among the test sets used, the Swedish test set comprised 65 chunks ofsentences, and the English set comprised 72.

4.2 SMT SystemsThis section describes the method behind the training of the different SMTSystems. First the direction of the translation data in the Translation Models aredescribed, followed by the setup of the Language Models. The systems werecreated using Moses, and the evaluation of the SMT systems is done with BLEUscoring, both described in section 3.2.

4.2.1 Direction of Translation Models

Two sets of translation models per original language were trained using Moses.One set of translation models containing original text as source language and itstranslation as target (O → T), and one set containing a translation as its sourcelanguage, and its corresponding original text as target language (T → O).

4.2.2 Language Models

Six types of language models were trained, ultimately adding up to 24 differentSMT systems since the six were varied, both in translation models and wereused for both 3- and 4-gram models. The language models are trained using theSRILM toolkit with Witten-Bell Smoothing. The six types were derived fromdifferent types of texts, varied as follows:

• 32 000 sentences of text known to be translations from Swedish or English,depending on text language. (T)

• 32 000 sentences of text known to be originally written in its currentlanguage. (O)

• 32 000 sentences classified to be translations using the three-class classifierfrom section 4.1.1. (C)

• 16 000 sentences of (T) and 16 000 sentences of (O). A smallerhypothetical “real-world” language model in cases where classification hasnot been done. (T & O - S)

15

• 32 000 sentences of (T) and 32 000 sentences of (C). A hypothetical“real-world” language model in cases where classification has been done.(T & C)

• 32 000 sentences of (T) and 32 000 sentences of (O). A hypothetical“real-world” language model in cases where classification has not beendone. This model is larger than the other (T & O) model to match thesize of the (T & C) model.(T & O - L)

The classified data in the (C)-systems are from the three-class classifier. Itis used as opposed to classified data from the multi-class classifier, or both,because the results of its classification were unstable in comparison.

Evaluation of the built SMT systems is done with BLEU scoring. The testset consists of 2000 known original sentences, and the reference set used tocompare the translation is their known translationese counter parts.

16

5 Results

In this chapter, the results from the experiments described in the previouschapter are presented. First, the results from the classification experimentsare given, which partly fuels the SMT experiments, which are covered in thesecond section of the chapter.

5.1 ClassificationThis section presents the results from the classifiers built in this work. It isdivided in two parts. The first part presents the results from the three-classclassifiers, and the second part presents the result from the multi-class classifiers.

5.1.1 Three-class Classifier

The classifier for Swedish correctly classified 94.08% of the chunks duringa 10-fold cross-validation of the training data. The non-translated categoryperforms almost flawlessly, with only a few false negatives and even less falsepositives. The English and Other categories are subject to somewhat worseresults, the least proficient performance being between the two. The result isshown in a confusion matrix in table 5.1. Though a slightly lower accuracy thanthe classifiers used by Volansky et al. (2015), this was a classification combiningseveral feature sets, as well as an experiment using three classes, not two.

10-fold CV, 1114 total chunks; 94.08% correctly classifiedClassified as → Original English Other

Original 308 9 9English 1 358 17Other 4 26 382

Test set, 65 total chunks; 80% correctly classifiedClassified as → Original English Other

Original 10 1 9English 0 20 3Other 0 0 22

Table 5.1: Confusion matrices of chunk distribution from testing the classifier on bothtraining and testing data for Swedish. The upper table shows the average result from aten-fold cross validation.

17

10-fold CV, 1249 total chunks; 92.95% correctly classifiedClassified as → Original Swedish Other

Original 378 2 20Swedish 19 342 25Other 12 10 441

Test set, 72 total chunks; 93.06% correctly classifiedClassified as → Original Swedish Other

Original 22 0 2Swedish 1 22 0Other 2 0 23

Table 5.2: Confusion matrices of chunk distribution from testing the classifier on bothtraining and testing data for English. The upper table show the average result from a ten-foldcross validation.

The classifier correctly classified 80% of the test data, which is a much loweraccuracy than both for the training data, and the similar experiments made byVolansky. This low overall accuracy is mainly the cause of misclassified OriginalSwedish. Only half of the chunks of non-translated Swedish was correctlyclassified.

Classifying the translationese English and Other categories was successfulfrom a recall standpoint. Almost half of the Original chunks, as well as some ofthe English chunks, were classified as Other. This caused more than a third ofthe classified Other to be misclassified, resulting in its precision to be as low as64.7%.

However, since the misclassification of test data primarily affected non-translated Swedish and translationese of the Other category, leaving thetranslationese English relatively unaffected, this should not influence thelanguage models trained in the following experiments to any major extent.

The libSVM based three-class classifier for English successfully classified92.95% of the training data in a 10-fold cross-validation, as seen in table 5.2,only slightly lower than for the Swedish classifier. This classifier did not followthe same error pattern as the Swedish one, even though it had a total accuracythat was almost as high. It had a harder time classifying the original text, buthad fewer false positives in the Swedish (translationese) class, suggesting a moreaccurate classification of the data that would actually be used in the LanguageModel experiment.

Classification of the English testing data corresponded well to that of thetraining data in the 10-fold cross validation, and the total accuracy was in thesame area. The recall for all languages was above 90%, and the precision rangedfrom 88% to 100%.

As with the Swedish classifier, the translationese category was notremarkably affected. Though, in this case, neither were the two other categories.

For both Swedish and English, 500 000 sentences were grouped into chunksthat were classified in weka, using the three-class classifying systems trainedwith the data. The distribution of the classified chunks can be seen in Table 5.3.The translationese chunks of known English and Swedish origin were converted

18

English SwedishEnglish 3008 2818Swedish 412 273Other 3060 2740

Total 6480 5831

Table 5.3: Distribution of text chunks after classification. Bold indicates translationese usedfor language models in the SMT experiments.

back to sentences, of which 32000 each were used to create language modelsfor the Classified Translationese category in the language model experiments.

5.1.2 Multi-class Classifier

The Swedish multi-class classification experiment strengthened the idea thattranslated text can not only be identified in relation to non-translated text, butits source language can also be determined through classification. It had aninteresting 79.89% total accuracy. In these results however, a majority of thesentences of French origin were classified as being Greek translationese. TheSwedish and English partitions of the test performed with minimal error – theprecision of Swedish and the recall of English were both at 100%. Additionally,the system classified Finnish very accurately, as well as German, though notprecisely as good. Table 5.4 contains a confusion matrix of these results.

The English experiment had a very different result, with a total accuracyof 66,16% – much lower than the Swedish one. This system was much moreprone to misclassifying chunks as original English and Swedish translationese,which gave both categories a precision below 60%. The German and Finnishpartitions are mainly classified as Swedish and English, whereas the French andGreek are closer to an acceptable performance. Swedish, English, and Greek allhave a recall of over 90%, and the precision of Greek is just above 80%.

As these classifiers have proven less stable than the three-class classifier, theyare not used to classify unknown data for the SMT experiments. The multi-classclassifier does however show some potential in the field of identifying sourcelanguage of translationese text.

Classified as → de el fi fr sv ende 180 24 3 27 0 0el 1 288 0 0 0 0fi 5 5 195 1 0 0fr 9 236 0 21 0 0sv 0 0 0 0 295 30en 0 0 0 0 0 376

Table 5.4: Results of the multi-class classification on Swedish text. This result displayed is anaverage of a tenfold cross-validation. The average accuracy for this experiment was 79.89%.

19

Classified as → de el fi fr sv ende 1 15 0 13 60 173el 0 294 0 4 2 11fi 0 2 5 0 176 61fr 0 52 0 205 9 30sv 0 1 0 0 355 28en 0 3 0 0 1 393

Table 5.5: Confusion matrix of average tenfold cross-validation of the multiclass classificationon English text. The average accuracy for this experiment was 66.16%.

5.2 SMT Systems

5.2.1 Translation Models

One type of translation model (TM) for the Swedish-to-English systems wasbuilt with original texts O as source language, and translated texts T as targetlanguage. The results of these are presented under T → O in Tables 5.6 and 5.7.The other TM type was built with translated texts T as source language, andoriginal texts English O as target language. These are denoted T → O. “Original”as used here refers to non-translated text.

The results in table 5.6 shows that the two columns under T → Orepresenting the former type of system receives higher BLEU scores, no matterthe variation of the Language Model. T → O

As is shown in Table 5.7, the results of O → T outperformed T → O inall systems, even outscoring the latter’s highest scoring system with its lowestscoring one.

5.2.2 Language Model Experiments

In table 5.6, the results of the Swedish language model experiments are shown.Over all, the portion of larger Language Models perform with somewhat betterscores than the smaller ones. The only exception for this is among he (T → O)models, where the known translationese models performed better than the(T & O) models, and in par with the (T & C) models.

TM O → T T → OLM 3-gram 4-gram 3-gram 4-gram

Translated (T) 24.93 25.03 22.73 22.82Original (O) 24.09 24.17 20.59 20.57Classified (C) 24.18 24.09 21.39 21.52Both (T & O - S) 24.91 24.72 21.91 21.85

Both (T & C) 25.19 25.31 22.78 22.88Both (T & O - L) 25.31 25.46 22.57 22.69

Table 5.6: BLEU scores from testing the SMT systems for English to Swedish. The twobottom rows are the scores of the systems created with the larger Language Models.

20

The best performance in both directions is given by the 4-gram version ofthe combined (T & O) language model, with the (O→T) translation model.

One of the distinct variations among the results is that the Language Modelsbuilt from known translationese (T) perform better than their counterpartstrained with original text (O). (T) is in favor of about 0.80 points for both 3-and 4-gram models in the English-to-Swedish context, and with a slightly over1.20 points difference for the Swedish-to-English systems.

Another important point is regarding the Classified Translationese systems(C). All of these receive lower BLEU-scores than the (T) systems and higherthan the (O) systems in (T→O) contexts, where the corresponding (O→T)systems still performed with higher scores. There is an exception to this in theEnglish-to-Swedish (O→T) context, where the 3-gram (C) system performedbetter than its corresponding (O) system, and even received a score 0.01 pointshigher than the 4-gram (O) system.

For both English and Swedish, (T & C) model of the (T →O) systemsperformed better than the larger(T & O) model, whereas among the (O → T)systems, the performance of this pair was reversed.

TM O → T T → OLM 3-gram 4-gram 3-gram 4-gram

Translated (T) 32.82 33.01 29.28 29.61Original (O) 31.60 31.80 26.91 26.96Classified (C) 31.26 31.37 27.09 26.83Both (T & O - S) 32.69 32.91 28.73 28.91

Both (T & C) 32.76 33.01 29.32 29.62Both (T & O - L) 32.96 33.26 29.00 29.20

Table 5.7: BLEU scores from testing the SMT systems for Swedish to English. The twobottom rows are the scores of the systems created with the larger Language Models

21

6 Conclusion

This chapter concludes the work presented in this thesis. It will answer thequestions posed in the introduction using the results of the experiments.

• Can a classifier be trained to identify translationese versus text originallywritten in the language? And can a similar classifier identify the sourcelanguage of the translated text?

Yes, and to some extent yes. Using machine learning methods, a classifiercan be trained to identify translationese, both from original language and thesource.

The results presented in the previous section contain some cases wherethe source language is unmistakably identified. That is the English, Finnish,and to some extent German translationese of Swedish. In the other casesthere are some slight trends towards a correct identification, such as French inEnglish, though it has a significant portion classified as Greek. An interestingcontinuation for future work on this part of the research would be to specializein the causes of this type of misclassification, in case there is a certain group offeatures that drives this.

• Does the translation direction of the parallel training data in a translationmodel affect the results of the SMT systems?

Yes. Using original text as source language and translationese as targetlanguage will most likely improve the resulting output of the system. As shownin tables 5.6 and 5.7, the (O→T) receive higher scores in BLEU evaluationthan that of the reverse systems. Similar results have been presented in earlierresearch, such as by Twitto-Shmuel et al. (2015). Using the knowledge of thisin future work could influence the results of the project. If data tagged withoriginal language is available it should be used, as choosing to ignore this factorcan impair the translations of the system.

• Finally, is it favorable to use known translated texts as opposed tooriginal text when training language models? And can text identifiedas translationese be used in the same manner?

Yes, and perhaps. This study has shown that the language models trained ontranslationese texts produce better translations than those trained on originaltext, without exceptions. However, the systems with models trained on classifiedtranslationese did not perform as well.

22

Systems trained with language models made from classified translationesedid not perform as well as those of known translationese. This contradicts previ-ous research by Twitto-Shmuel et al. (2015), in which the two configurationshave performed on a similar level.

Furthermore, regarding the evaluation of the SMT systems, BLEU is notperfect. Human translations can look very different depending on the translator,which is an aspect that this type of comparison to a single reference translationdoes not take into account. Manual evaluation of the translation output isperhaps the most trustworthy, though very time consuming – especially whencomparing 24, or even 48, different systems. Additionally, the SMT systemspresented in this thesis has been trained on a smaller size corpus than those ofearlier research, because there simply were no more known original Swedishtagged in the corpus that was used.

23

Bibliography

Mona Baker et al. Corpus linguistics and translation studies: Implications andapplications. Text and technology: In honour of John Sinclair, 233:250, 1993.

Johan Carlberger and Viggo Kann. Implementing an efficient part-of-speechtagger. Software-Practice and Experience, 29(9):815–32, 1999.

Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vectormachines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.

Martin Gellerstam. Translationese in Swedish novels translated from English.Translation studies in Scandinavia, pages 88–95, 1986.

Geoffrey Holmes, Andrew Donkin, and Ian H Witten. Weka: A machinelearning workbench. In Intelligent Information Systems, 1994. Proceedings ofthe 1994 Second Australian and New Zealand Conference on, pages 357–361.IEEE, 1994.

Iustina Ilisei, Diana Inkpen, Gloria Corpas Pastor, and Ruslan Mitkov. Identi-fication of translationese: A machine learning approach. In Computationallinguistics and intelligent text processing, pages 503–511. Springer, 2010.

Philipp Koehn. Europarl: A parallel corpus for statistical machine translation.In MT Summit X, volume 5, pages 79–86, 2005.

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, MarcelloFederico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, et al. Moses: Open source toolkit for statistical machinetranslation. In Proceedings of the 45th annual meeting of the ACL on interactiveposter and demonstration sessions, pages 177–180. Association for Computa-tional Linguistics, 2007.

Moshe Koppel and Noam Ordan. Translationese and its dialects. In Proceedingsof the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies-Volume 1, pages 1318–1326. Association forComputational Linguistics, 2011.

Moshe Koppel, Jonathan Schler, and Shlomo Argamon. Computational meth-ods in authorship attribution. Journal of the American Society for informationScience and Technology, 60(1):9–26, 2009.

David Kurokawa, Cyril Goutte, and Pierre Isabelle. Automatic detection oftranslated text and its impact on machine translation. MT Summit XII, 2009.

24

Sara Laviosa. Core patterns of lexical use in a comparable corpus of Englishnarrative prose. Meta: Journal des traducteursMeta:/Translators’ Journal, 43(4):557–570, 1998.

Sara Laviosa. Corpus-based translation studies: theory, findings, applications,volume 17. Rodopi, 2002.

Gennadi Lembersky, Noam Ordan, and Shuly Wintner. Language models formachine translation: Original vs. translated texts. Computational Linguistics,38(4):799–825, 2012.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: amethod for automatic evaluation of machine translation. In Proceedings of the40th annual meeting on association for computational linguistics, pages 311–318.Association for Computational Linguistics, 2002.

Fabrizio Sebastiani. Machine learning in automated text categorization. ACMcomputing surveys (CSUR), 34(1):1–47, 2002.

Larry Selinker. Interlanguage. IRAL-International Review of Applied Linguisticsin Language Teaching, 10(1-4):209–232, 1972.

Andreas Stolcke. SRILM – an extensible language modeling toolkit. In INTER-SPEECH, 2002.

Gideon Toury. Interlanguage and its manifestations in translation. Meta: Journaldes traducteursMeta:/Translators’ Journal, 24(2):223–231, 1979.

Gideon Toury. Descriptive Translation Studies and beyond: Revised edition,volume 100. John Benjamins Publishing, 2012.

Naama Twitto-Shmuel, Noam Ordan, and Shuly Wintner. Statistical machinetranslation with automatic identification of translationese. EMNLP 2015,pages 47–57, 2015.

Vered Volansky, Noam Ordan, and Shuly Wintner. On the features oftranslationese. Digital Scholarship in the Humanities, 30(1):98–118, 2015.

25

Translationese and Swedish-English Statistical …uu.diva-portal.org/smash/get/diva2:1034684/FULLTEXT01.pdfTranslationese and Swedish-English Statistical Machine Translation Jakob

Documents