WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia Holger Schwenk Facebook AI [email protected]Vishrav Chaudhary Facebook AI [email protected]Shuo Sun Johns Hopkins University [email protected]Hongyu Gong University of Illinois at Urbana-Champaign [email protected]Francisco Guzm ´ an Facebook AI [email protected]Abstract We present an approach based on multilin- gual sentence embeddings to automatically ex- tract parallel sentences from the content of Wikipedia articles in 85 languages, including several dialects or low-resource languages. We do not limit the the extraction process to align- ments with English, but systematically con- sider all possible language pairs. In total, we are able to extract 135M parallel sentences for 1620 different language pairs, out of which only 34M are aligned with English. This cor- pus of parallel sentences is freely available. 1 To get an indication on the quality of the ex- tracted bitexts, we train neural MT baseline systems on the mined data only for 1886 lan- guages pairs, and evaluate them on the TED corpus, achieving strong BLEU scores for many language pairs. The WikiMatrix bitexts seem to be particularly interesting to train MT systems between distant languages without the need to pivot through English. 1 Introduction Most of the current approaches in Natural Lan- guage Processing (NLP) are data-driven. The size of the resources used for training is often the pri- mary concern, but the quality and a large vari- ety of topics may be equally important. Mono- lingual texts are usually available in huge amounts for many topics and languages. However, multi- lingual resources, typically sentences in two lan- guages which are mutual translations, are more limited, in particular when the two languages do not involve English. An important source of par- allel texts are international organizations like the European Parliament (Koehn, 2005) or the United Nations (Ziemski et al., 2016). These are profes- sional human translations, but they are in a more 1 https://github.com/facebookresearch/ LASER/tree/master/tasks/WikiMatrix formal language and tend to be limited to political topics. There are several projects relying on vol- unteers to provide translations for public texts, e.g. news commentary (Tiedemann, 2012), Opensub- Titles (Lison and Tiedemann, 2016) or the TED corpus (Qi et al., 2018) Wikipedia is probably the largest free multi- lingual resource on the Internet. The content of Wikipedia is very diverse and covers many top- ics. Articles exist in more than 300 languages. Some content on Wikipedia was human translated from an existing article into another language, not necessarily from or into English. Eventually, the translated articles have been later independently edited and are not parallel any more. Wikipedia strongly discourages the use of unedited machine translation, 2 but the existence of such articles can not be totally excluded. Many articles have been written independently, but may nevertheless con- tain sentences which are mutual translations. This makes Wikipedia a very appropriate resource to mine for parallel texts for a large number of lan- guage pairs. To the best of our knowledge, this is the first work to process the entire Wikipedia and systematically mine for parallel sentences in all language pairs. We hope that this resource will be useful for several research areas and enable the development of NLP applications for more lan- guages. In this work, we build on a recent approach to mine parallel texts based on a distance measure in a joint multilingual sentence embedding space (Schwenk, 2018; Artetxe and Schwenk, 2018a). For this, we use the freely available LASER toolkit 3 which provides a language agnostic sen- tence encoder which was trained on 93 languages 2 https://en.wikipedia.org/wiki/ Wikipedia:Translation 3 https://github.com/facebookresearch/ LASER arXiv:1907.05791v2 [cs.CL] 16 Jul 2019
13
Embed
arXiv:1907.05791v2 [cs.CL] 16 Jul 2019 · WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia Holger Schwenk Facebook AI [email protected] Vishrav Chaudhary
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
WikiMatrix: Mining 135M Parallel Sentencesin 1620 Language Pairs from Wikipedia
We present an approach based on multilin-gual sentence embeddings to automatically ex-tract parallel sentences from the content ofWikipedia articles in 85 languages, includingseveral dialects or low-resource languages. Wedo not limit the the extraction process to align-ments with English, but systematically con-sider all possible language pairs. In total, weare able to extract 135M parallel sentences for1620 different language pairs, out of whichonly 34M are aligned with English. This cor-pus of parallel sentences is freely available.1
To get an indication on the quality of the ex-tracted bitexts, we train neural MT baselinesystems on the mined data only for 1886 lan-guages pairs, and evaluate them on the TEDcorpus, achieving strong BLEU scores formany language pairs. The WikiMatrix bitextsseem to be particularly interesting to train MTsystems between distant languages without theneed to pivot through English.
1 Introduction
Most of the current approaches in Natural Lan-guage Processing (NLP) are data-driven. The sizeof the resources used for training is often the pri-mary concern, but the quality and a large vari-ety of topics may be equally important. Mono-lingual texts are usually available in huge amountsfor many topics and languages. However, multi-lingual resources, typically sentences in two lan-guages which are mutual translations, are morelimited, in particular when the two languages donot involve English. An important source of par-allel texts are international organizations like theEuropean Parliament (Koehn, 2005) or the UnitedNations (Ziemski et al., 2016). These are profes-sional human translations, but they are in a more
formal language and tend to be limited to politicaltopics. There are several projects relying on vol-unteers to provide translations for public texts, e.g.news commentary (Tiedemann, 2012), Opensub-Titles (Lison and Tiedemann, 2016) or the TEDcorpus (Qi et al., 2018)
Wikipedia is probably the largest free multi-lingual resource on the Internet. The content ofWikipedia is very diverse and covers many top-ics. Articles exist in more than 300 languages.Some content on Wikipedia was human translatedfrom an existing article into another language, notnecessarily from or into English. Eventually, thetranslated articles have been later independentlyedited and are not parallel any more. Wikipediastrongly discourages the use of unedited machinetranslation,2 but the existence of such articles cannot be totally excluded. Many articles have beenwritten independently, but may nevertheless con-tain sentences which are mutual translations. Thismakes Wikipedia a very appropriate resource tomine for parallel texts for a large number of lan-guage pairs. To the best of our knowledge, thisis the first work to process the entire Wikipediaand systematically mine for parallel sentences inall language pairs. We hope that this resource willbe useful for several research areas and enable thedevelopment of NLP applications for more lan-guages.
In this work, we build on a recent approach tomine parallel texts based on a distance measurein a joint multilingual sentence embedding space(Schwenk, 2018; Artetxe and Schwenk, 2018a).For this, we use the freely available LASERtoolkit3 which provides a language agnostic sen-tence encoder which was trained on 93 languages
(Artetxe and Schwenk, 2018b). We approach thecomputational challenge to mine in almost sixhundred million sentences by using fast indexingand similarity search algorithms.
The paper is organized as follows. In thenext section, we first discuss related work. Wethen summarize the underlying mining approach.Section 4 describes in detail how we appliedthis approach to extract parallel sentences fromWikipedia in 1620 language pairs. To asses thequality of the extracted bitexts, we train NMT sys-tems for a subset of language pairs and evaluatethem on the TED corpus (Qi et al., 2018) for 45languages. These results are presented in sec-tion 5. The paper concludes with a discussion offuture research directions.
2 Related work
There is a large body of research on miningparallel sentences in collections of monolingualtexts, usually named “comparable coprora”. Ini-tial approaches to bitext mining have relied onheavily engineered systems often based on meta-data information, e.g. (Resnik, 1999; Resnikand Smith, 2003). More recent methods explorethe textual content of the comparable documents.For instance, it was proposed to rely on cross-lingual document retrieval, e.g. (Utiyama and Isa-hara, 2003; Munteanu and Marcu, 2005) or ma-chine translation, e.g. (Abdul-Rauf and Schwenk,2009; Bouamor and Sajjad, 2018), typically toobtain an initial alignment that is then furtherfiltered. In the shared task for bilingual docu-ment alignment (Buck and Koehn, 2016), manyparticipants used techniques based on n-gram orneural language models, neural translation mod-els and bag-of-words lexical translation probabil-ities for scoring candidate document pairs. TheSTACC method uses seed lexical translations in-duced from IBM alignments, which are combinedwith set expansion operations to score translationcandidates through the Jaccard similarity coeffi-cient (Etchegoyhen and Azpeitia, 2016; Azpeitiaet al., 2017, 2018). Using multilingual noisy web-crawls such as ParaCrawl4 for filtering good qual-ity sentence pairs has been explored in the sharedtasks for high resource (Koehn et al., 2018) andlow resource (Koehn et al., 2019) languages.
In this work, we rely on massively multilin-gual sentence embeddings and margin-based min-
4http://www.paracrawl.eu/
ing in the joint embedding space, as described in(Schwenk, 2018; Artetxe and Schwenk, 2018a,b).This approach has also proven to perform best ina low resource scenario (Chaudhary et al., 2019;Koehn et al., 2019). Closest to this approach is theresearch described in Espana-Bonet et al. (2017);Hassan et al. (2018); Guo et al. (2018); Yang et al.(2019). However, in all these works, only bilin-gual sentence representations have been trained.Such an approach does not scale to many lan-guages, in particular when considering all possiblelanguage pairs in Wikipedia. Finally, related ideashave been also proposed in Bouamor and Sajjad(2018) or Gregoire and Langlais (2017). However,in those works, mining is not solely based on mul-tilingual sentence embeddings, but they are partof a larger system. To the best of our knowledge,this work is the first one that applies the same min-ing approach to all combinations of many differentlanguages, written in more than twenty differentscripts.
Wikipedia is arguably the largest comparablecorpus. One of the first attempts to exploit thisresource was performed by Adafre and de Ri-jke (2006). An MT system was used to trans-late Dutch sentences into English and to com-pare them with the English texts. This methodyielded several hundreds of Dutch/English par-allel sentences. Later, a similar technique wasapplied to the Persian/English pair (Mohammadiand GhasemAghaee, 2010). Structural informa-tion in Wikipedia such as the topic categories ofdocuments was used in the alignment of multi-lingual corpora (Otero and Lopez, 2010). In an-other work, the mining approach of Munteanu andMarcu (2005) was applied to extract large corporafrom Wikipedia in sixteen languages (Smith et al.,2010). Otero et al. (2011) measured the compa-rability of Wikipedia corpora by the translationequivalents on three languages Portuguese, Span-ish, and English. Patry and Langlais (2011) cameup with a set of features such as Wikipedia enti-ties to recognize parallel documents, and their ap-proach was limited to a bilingual setting. Tufiset al. (2013) proposed an approach to mine parallelsentences from Wikipedia textual content, but theyonly considered high-resource languages, namelyGerman, Spanish and Romanian paired with En-glish. Tsai and Roth (2016) grounded multilin-gual mentions to English wikipedia by trainingcross-lingual embeddings on twelve languages.
Gottschalk and Demidova (2017) searched forparallel text passages in Wikipedia by compar-ing their named entities and time expressions.Finally, Aghaebrahimian (2018) propose an ap-proach based on bilingual BiLSTM sentence en-coders to mine German, French and Persian par-allel texts with English. Parallel data consistingof aligned Wikipedia titles have been extracted fortwenty-three languages5. Since Wikipedia titlesare rarely entire sentences with a subject, verb andobject, it seems that only modest improvementswere observed when adding this resource to thetraining material of NMT systems.
We are not aware of other attempts to system-atically mine for parallel sentences in the textualcontent of Wikipedia for a large number of lan-guages.
3 Distance-based mining approach
The underling idea of the mining approach usedin this work is to first learn a multilingual sen-tence embedding, i.e. an embedding space inwhich semantically similar sentences are close in-dependently of the language they are written in.This means that the distance in that space can beused as an indicator whether two sentences aremutual translations or not. Using a simple abso-lute threshold on the cosine distance was shownto achieve competitive results (Schwenk, 2018).However, it has been observed that an absolutethreshold on the cosine distance is globally notconsistent, e.g. (Guo et al., 2018). The difficultyto select one global threshold is emphasized in oursetting since we are mining parallel sentences formany different language pairs.
3.1 Margin criterion
The alignment quality can be substantially im-proved by using a margin criterion instead of anabsolute threshold (Artetxe and Schwenk, 2018a).In that work, the margin between two candidatesentences x and y is defined as the ratio betweenthe cosine distance between the two sentence em-beddings, and the average cosine similarity of its
where NNk(x) denotes the k unique nearestneighbors of x in the other language, and analo-gously for NNk(y). We used k = 4 in all experi-ments.
We follow the “max” strategy as described in(Artetxe and Schwenk, 2018a): the margin is firstcalculated in both directions for all sentences inlanguage L1 and L2. We then create the unionof these forward and backward candidates. Candi-dates are sorted and pairs with source or target sen-tences which were already used are omitted. Wethen apply a threshold on the margin score to de-cide whether two sentences are mutual translationsor not. Note that with this technique, we alwaysget the same aligned sentences, independently ofthe mining direction, e.g. searching translationsof French sentences in a German corpus, or inthe opposite direction. The reader is referred toArtetxe and Schwenk (2018a) for a detailed dis-cussion with related work.
The complexity of a distance-based mining ap-proach is O(N ×M), where N and M are thenumber of sentences in each monolingual corpus.This makes a brute-force approach with exhaus-tive distance calculations intractable for large cor-pora. Margin-based mining was shown to sig-nificantly outperform the state-of-the-art on theshared-task of the workshop on Building and Us-ing Comparable Corpora (BUCC) (Artetxe andSchwenk, 2018a). The corpora in the BUCC cor-pus are rather small: at most 567k sentences.
The languages with the largest Wikipedia areEnglish and German with 134M and 51M sen-tences, respectively, after pre-processing (see Sec-tion 4.1 for details). This would require 6.8×1015
distance calculations.6 We show in Section 3.3how to tackle this computational challenge.
3.2 Multilingual sentence embeddingsDistance-based bitext mining requires a joint sen-tence embedding for all the considered languages.
6Strictly speaking, Cebuano and Swedish are largerthan German, yet mostly consist of template/machinetranslated text https://en.wikipedia.org/wiki/List_of_Wikipedias
Figure 1: Architecture of the system used to train massively multilingual sentence embeddings. See Artetxe andSchwenk (2018b) for details.
One may be tempted to train a bi-lingual em-bedding for each language pair, e.g. (Espana-Bonet et al., 2017; Hassan et al., 2018; Guo et al.,2018; Yang et al., 2019), but this is difficult toscale to thousands of language pairs present inWikipedia. Instead, we chose to use one singlemassively multilingual sentence embedding for alllanguages, namely the one proposed by the open-source LASER toolkit (Artetxe and Schwenk,2018b). Training one joint multilingual embed-ding on many languages at once also has the ad-vantage that low-resource languages can benefitfrom the similarity to other language in the samelanguage family. For example, we were able tomine parallel data for several Romance (minority)languages like Aragonese, Lombard, Mirandese orSicilian although data in those languages was notused to train the multilingual LASER embeddings.
The underlying idea of LASER is to train asequence-to-sequence system on many languagepairs at once using a shared BPE vocabulary anda shared encoder for all languages. The sentencerepresentation is obtained by max-pooling over allencoder output states. Figure 1 illustrates thisapproach. The reader is referred to Artetxe andSchwenk (2018b) for a detailed description.
3.3 Fast similarity search
Fast large-scale similarity search is an area witha large body of research. Traditionally, the appli-cation domain is image search, but the algorithmsare generic and can be applied to any type of vec-tors. In this work, we use the open-source FAISSlibrary7 which implements highly efficient algo-rithms to perform similarity search on billions ofvectors (Johnson et al., 2017). An additional ad-vantage is that FAISS has support to run on multi-
7https://github.com/facebookresearch/faiss
ple GPUs. Our sentence representations are 1024-dimensional. This means that the embeddings ofall English sentences require 153·106×1024×4 =513 GB of memory. Therefore, dimensionality re-duction and data compression are needed for effi-cient search. In this work, we chose a rather ag-gressive compression based on a 64-bit product-quantizer (Jegou et al., 2011), and portioning thesearch space in 32k cells. This corresponds tothe index type “OPQ64,IVF32768,PQ64” inFAISS terms.8 Another interesting compressionmethod is scalar quantization. A detailed compar-ison is left for future research. We build and trainone FAISS index for each language.
The compressed FAISS index for English re-quires only 9.2GB, i.e. more than fifty timessmaller than the original sentences embeddings.This makes it possible to load the whole index ona standard GPU and to run the search in a very ef-ficient way on multiple GPUs in parallel, withoutthe need to shard the index. The overall miningprocess for German/English requires less than 3.5hours on 8 GPUs, including the nearest neighborsearch in both direction and scoring all candidates
4 Bitext mining in Wikipedia
For each Wikipedia article, it is possible to getthe link to the corresponding article in other lan-guages. This could be used to mine sentences lim-ited to the respective articles. One one hand, thislocal mining has several advantages: 1) miningis very fast since each article usually has a fewhundreds of sentences only; 2) it seems reasonableto assume that a translation of a sentence is morelikely to be found in the same article than any-where in the whole Wikipedia. On the other hand,
we hypothesize that the margin criterion will beless efficient since one article has usually few sen-tences which are similar. This may lead to manysentences in the overall mined corpus of the type“NAME was born on DATE in CITY”, “BUILD-ING is a monument in CITY built on DATE”, etc.Although those alignments may be correct, we hy-pothesize that they are of limited use to train anNMT system, in particular when they are too fre-quent. In general, there is a risk that we will getsentences which are close in structure and content.
The other option is to consider the wholeWikipedia for each language: for each sentencein the source language, we mine in all target sen-tences. This global mining has several potentialadvantages: 1) we can try to align two languageseven though there are only few articles in com-mon; 2) many short sentences which only differ bythe name entities are likely to be excluded by themargin criterion. A drawback of this global min-ing is a potentially increased risk of misalignmentand a lower recall.
In this work, we chose the global mining op-tion. This will allow us to scale the same ap-proach to other, potentially huge, corpora forwhich document-level alignments are not easilyavailable, e.g. Common Crawl. An in depth com-parison of local and global mining (on Wikipedia)is left for future research.
4.1 Corpus preparation
Extracting the textual content of Wikipedia arti-cles in all languages is a rather challenging task,i.e. removing all tables, pictures, citations, foot-notes or formatting markup. There are severalways to download Wikipedia content. In thisstudy, we use the so-called CirrusSearch dumpssince they directly provide the textual contentwithout any meta information.9 We downloadedthis dump in March 2019. A total of about 300 lan-guages are available, but the size obviously variesa lot between languages. We applied the followingprocessing:
• extract the textual content;
• split the paragraphs into sentences;
• remove duplicate sentences;
9https://dumps.wikimedia.org/other/cirrussearch/
L1 (French) Ceci est une tres grande maison
L2 (German) Das ist ein sehr großes HausThis is a very big houseEz egy nagyon nagy hazIni rumah yang sangat besar
Table 1: Illustration how sentences in the wrong lan-guage can hurt the alignment process with a margincriterion. See text for a detailed discussion.
• perform language identification and removesentences which are not in the expected lan-guage (usually, citations or references to textsin another language).
It should be pointed out that sentence segmenta-tion is not a trivial task, with many exceptions andspecific rules for the various languages. For in-stance, it is rather difficult to make an exhaustivelist of common abbreviations for all languages. InGerman, points are used after numbers in enumer-ations, but numbers may also appear at the end ofsentences. Other languages do not use specificsymbols to mark the end of a sentence, namelyThai. We are not aware of a reliable and freelyavailable sentence segmenter for Thai and we hadto exclude that language. We used the freely avail-able Python tool SegTok10 which has specific rulesfor 24 languages. Regular expressions were usedfor most of the Asian languages, falling back toEnglish for the remaining languages. This givesus 879 million sentences in 300 languages. Themargin criterion to mine for parallel data requiresthat the texts do not contain duplicates. This re-moves about 25% of the sentences.11
LASER’s sentence embeddings are totally lan-guage agnostic. This has the side effect that thesentences in other languages (e.g. citations orquotes) may be considered closer in the embed-ding space than a potential translation in the tar-get language. Table 1 illustrates this problem.The algorithm would not select the German sen-tence although it is a perfect translation. The sen-tences in the other languages are also valid trans-lations which would yield a very small margin. Toavoid this problem, we perform language identi-fication (LID) on all sentences and remove thosewhich are not in the expected language. LID
10https://pypi.org/project/segtok/11The Cebuano and Waray Wikipedia were largely created
Figure 2: BLEU scores (continuous lines) for several NMT systems trained on bitexts extracted from Wikipediafor different margin thresholds. The size of the mined bitexts are depicted as dashed lines.
is performed with fasttext12 (Joulin et al., 2016).Fasttext does not support all the 300 languagespresent in Wikipedia and we disregarded the miss-ing ones (which typically have only few sentencesanyway). After deduplication and LID, we disposeof 595M sentences in 182 languages. English ac-counts for 134M sentences, and German with 51Msentences is the second largest language. The sizesfor all languages are given in Tables 3 and 5.
4.2 Threshold optimization
Artetxe and Schwenk (2018a) optimized theirmining approach for each language pair on a pro-vided corpus of gold alignments. This is not pos-sible when mining Wikipedia, in particular whenconsidering many language pairs. In this work, weuse an evaluation protocol inspired by the WMTshared task on parallel corpus filtering for low-resource conditions (Koehn et al., 2019): an NMTsystem is trained on the extracted bitexts – for dif-ferent thresholds – and the resulting BLEU scoresare compared. We choose newstest2014 ofthe WMT evaluations since it provides an N -wayparallel test sets for English, French, Germanand Czech. We favoured the translation betweentwo morphologically rich languages from differ-ent families and considered the following lan-guage pairs: German/English, German/French,Czech/German and Czech/French. The size ofmined bitexts is in the range of 100k to more than2M (see Table 2 and Figure 2). We did not tryto optimize the architecture of the NMT systemto the size of the bitexts and used the same archi-tecture for all systems: the encoder and decoderare 5-layer transformer models as implemented infairseq (Ott et al., 2019). The goal of this study
is not to develop the best performing NMT systemfor the considered languages pairs, but to comparedifferent mining parameters.
The evolution of the BLEU score in function ofthe margin threshold is given in Figure 2. Decreas-ing the threshold naturally leads to more mineddata – we observe an exponential increase of thedata size. The performance of the NMT systemstrained on the mined data seems to change as ex-pected, in a surprisingly smooth way. The BLEUscore first improves with increasing amounts ofavailable training data, reaches a maximum andthan decreases since the additional data gets moreand more noisy, i.e. contains wrong translations.It is also not surprising that a careful choice ofthe margin threshold is more important in a low-resource setting. Every additional parallel sen-tence is important. According to Figure 2, the op-timal value of the margin threshold seems to be1.05 when many sentences can be extracted, in ourcase German/English and German/French. Whenless parallel data is available, i.e. Czech/Germanand Czech/French, a value in the range of 1.03–1.04 seems to be a better choice. Aiming at onethreshold for all language pairs, we chose a valueof 1.04. It seems to be a good compromise formost language pairs. However, for the open re-lease of this corpus, we provide all mined sentencewith a margin of 1.02 or better. This would enableend users to choose an optimal threshold for theirparticular applications. However, it should be em-phasized that we do not expect that many sentencepairs with a margin as low as 1.02 are good trans-lations.
For comparison, we also trained NMT systemson the Europarl corpus V7 (Koehn, 2005), i.e. pro-fessional human translations, first on all availabledata, and then on the same number of sentences
Europarl 3.0M 2.3M 768k 846k+ Wikipedia 25.5 25.6 17.7 24.0
Table 2: Comparison of NMT systems trained on theEuroparl corpus and on bitexts automatically mined inWikipedia by our approach at a threshold of 1.04. Wegive the number of sentences (first line) and the BLEUscore (second line of each bloc) on newstest2014.
than the mined ones (see Table 2). With the excep-tion of Czech/French, we were able to achieve bet-ter BLEU scores with the automatically mined bi-texts in Wikipedia than with Europarl of the samesize. Adding the mined text to the full Europarlcorpus, also leads to further improvements of 1.1to 3.1 BLEU. We argue that this is a good indi-cator of the quality of the automatically extractedparallel sentences.
5 Result analysis
We run the alignment process for all possible com-binations of languages in Wikipedia. This yielded1620 language pairs for which we were able tomine at least ten thousand sentences. Remem-ber that mining L1 → L2 is identical to L2 → L1,and is counted only once. We propose to ana-lyze and evaluate the extracted bitexts in two ways.First, we discuss the amount of extracted sen-tences (Section 5.1). We then turn to a qualitativeassessment by training NMT systems for all lan-guage pairs with more than twenty-five thousandmined sentences (Section 5.2).
5.1 Quantitative analysisDue to space limits, Table 3 summarizes the num-ber of extracted parallel sentences only for lan-guages which have a total of at least five hun-dred thousand parallel sentences (with all otherlanguages at a margin threshold of 1.04). Addi-tional results are given in Table 5 in the Appendix.
There are many reasons which can influence thenumber of mined sentences. Obviously, the largerthe monolingual texts, the more likely it is to mine
many parallel sentences. Not surprisingly, we ob-serve that more sentences could be mined whenEnglish is one of the two languages. Let us pointout some languages for which it is usually not ob-vious to find parallel data with English, namelyIndonesian (1M), Hebrew (545k), Farsi (303k)or Marathi (124k sentences). The largest minedtexts not involving English are Russian/Ukrainian(2.5M), Catalan/Spanish (1.6M), between the Ro-mance languages French, Spanish, Italian and Por-tuguese (480k–923k), and German/French (626k).
It is striking to see that we were able to minemore sentences when Galician and Catalan arepaired with Spanish than with English. On onehand, this could be explained by the fact thatLASER’s multilingual sentence embeddings maybe better since the involved languages are linguis-tically very similar. On the other, it could be thatthe Wikipedia articles in both languages share a lotof content, or are obtained by mutual translation.
Services from the European Commission pro-vide human translations of (legal) texts in all the24 official languages of the European Union. ThisN-way parallel corpus enables training of MTsystem to directly translate between these lan-guages, without the need to pivot through En-glish. This is usually not the case when translat-ing between other major languages, for example inAsia. Let us list some interesting language pairsfor which we were able to mine more than hun-dred thousand sentences: Korean/Japanese (222k),Russian/Japanese (196k), Indonesian/Vietnamese(146k), or Hebrew/Romance languages (120–150k sentences).
Overall, we were able to extract at least tenthousand parallel sentences for 85 different lan-guages.13 For several low-resource languages,we were able to extract more parallel sentenceswith other languages than English. These include,among others, Aragonse with Spanish, Lombardwith Italian, Breton with several Romance lan-guages, Western Frisian with Dutch, Luxembour-gish with German or Egyptian Arabic and Wu Chi-nese with the respective major language.
Finally, Cebuano (ceb) falls clearly apart: ithas a rather huge Wikipedia (17.9M filtered sen-tence), but most of it was generated by a bot, asfor the Waray language14. This certainly explainsthat only a very small number of parallel sen-
1399 languages have more than 5,000 parallel sentences.14https://en.wikipedia.org/wiki/Lsjbot
Table 4: BLEU scores on the TED test set as proposed in (Qi et al., 2018). NMT systems were trained on bitextsmined in Wikipedia only (with at least twenty-five thousand parallel sentences). No other resources were used.
tences could be extracted. Although the same botwas also used to generate articles in the SwedishWikipedia, our alignments seem to be better forthat language.
5.2 Qualitative evaluation
Aiming to perform a large-scale assessment ofthe quality of the extracted parallel sentences, wetrained NMT systems on the extracted parallelsentences. We identified a publicly available dataset which provide test sets for many languagepairs: translations of TED talks as proposed in thecontext of a study on pretrained word embeddingsfor NMT15 (Qi et al., 2018). We would like to em-phasize that we did not use the training data pro-vided by TED – we only trained on the mined sen-tences from Wikipedia. The goal of this study isnot to build state-of-the-art NMT system for forthe TED task, but to get an estimate of the qual-ity of our extracted data, for many language pairs.In particular, there may be a mismatch in the topicand language style between Wikipedia texts andthe transcribed and translated TED talks.
For training NMT systems, we used a trans-former model from fairseq (Ott et al., 2019)
with the parameter settings shown in Figure 3in the appendix. For preprocessing, the textwas tokenized using the Moses tokenizer (withouttrue casing) and a 5000 subword vocabulary waslearnt using SentencePiece (Kudo and Richardson,2018). Decoding was done with beam size 5 andlength normalization 1.2.
We evaluate the trained translation systems onthe TED dataset (Qi et al., 2018). The TED dataconsists of parallel TED talk transcripts in mul-tiple languages, and it provides development andtest sets for 50 languages. Since the developmentand test sets were already tokenized, we first deto-kenize them using Moses. We trained NMT sys-tems for all possible language pairs with more thantwenty-five thousand mined sentences. This givesus in total 1886 language pairs in 45 languages.We train L1 → L2 and L2 → L1 with the samemined bitexts L1/L2. Scores on the test sets werecomputed with SacreBLEU (Post, 2018). Table 4summarizes all the results. Due to space con-straints, we are unable to report BLEU score for alllanguage combinations in that table. Some addi-tional results are reported in Table 6 in the annex.23 NMT systems achieve BLEU scores over 30,the best one being 37.3 for Brazilian Portugueseto English. Several results are worth mentioning,
like Farsi/English: 16.7, Hebrew/English: 25.7,Indonesian/English: 24.9 or English/Hindi: 25.7We also achieve interesting results for translationbetween various non English language pairs forwhich it is usually not easy to find parallel data,e.g. Norwegian ↔ Danish ≈33, Norwegian ↔Swedish ≈25, Indonesian ↔ Vietnamese ≈16 orJapanese / Korean ≈17.
Our results on the TED set give an indication onthe quality of the mined parallel sentences. TheseBLEU scores should be of course appreciated incontext of the sizes of the mined corpora as givenin Table 3. Obviously, we can not exclude thatthe provided data contains some wrong alignmentseven though the margin is large. Finally, we wouldlike to point out that we run our approach on allavailable languages in Wikipedia, independentlyof the quality of LASER’s sentence embeddingsfor each one.
6 Conclusion
We have presented an approach to systematicallymine for parallel sentences in the textual contentof Wikipedia, for all possible language pairs. Weuse a recently proposed mining approach basedon massively multilingual sentence embeddings(Artetxe and Schwenk, 2018b) and a margin cri-terion (Artetxe and Schwenk, 2018a). The sameapproach is used for all language pairs without theneed of a language specific optimization. In total,we make available 135M parallel sentences in 85languages, out of which only 34M sentences arealigned with English. We were able to mine morethan ten thousands sentences for 1620 differentlanguage pairs. This corpus of parallel sentencesis freely available.16 We also performed a largescale evaluation of the quality of the mined sen-tences by training 1886 NMT systems and evalu-ating them on the 45 languages of the TED corpus(Qi et al., 2018).
This work opens several directions for future re-search. The mined texts could be used to first re-train LASER’s multilingual sentence embeddingswith the hope to improve the performance on low-resource languages, and then to rerun mining inWikipedia. This process could be iteratively re-peated. We also plan to apply the same method-ology to other large multilingual collections. Themonolingual texts made available by ParaCrawl or
CommonCrawl17 are good candidates.We expect that the WikiMatrix corpus has
mostly well-formed sentences and it should notcontain social media language. The mined paral-lel sentences are not limited to specific topics likemany of the currently available resources (parlia-ment proceedings, subtitles, software documenta-tion, . . .), but are expected to cover many topicsof Wikipedia. The fraction of unedited machinetranslated text is also expected to be low. We hopethat this resource will be useful to support researchin multilinguality, in particular machine transla-tion.
7 Acknowledgments
We would like to thank Edoaurd Grave for helpwith handling the Wikipedia corpus and MatthijsDouze for support with the use of FAISS.
ReferencesSadaf Abdul-Rauf and Holger Schwenk. 2009. On the
Use of Comparable Corpora to Improve SMT per-formance. In EACL, pages 16–23.
Sisay Fissaha Adafre and Maarten de Rijke. 2006.Finding similar sentences across multiple languagesin Wikipedia. In Proceedings of the Workshop onNEW TEXT Wikis and blogs and other dynamic textsources.
Ahmad Aghaebrahimian. 2018. Deep neural networksat the service of multilingual parallel sentence ex-traction. In Coling.
Mikel Artetxe and Holger Schwenk. 2018a. Margin-based Parallel Corpus Mining with MultilingualSentence Embeddings. https://arxiv.org/abs/1811.01136.
Mikel Artetxe and Holger Schwenk. 2018b. Mas-sively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. In https://arxiv.org/abs/1812.10464.
Andoni Azpeitia, Thierry Etchegoyhen, and EvaMartınez Garcia. 2017. Weighted Set-TheoreticAlignment of Comparable Sentences. In BUCC,pages 41–45.
Andoni Azpeitia, Thierry Etchegoyhen, and EvaMartınez Garcia. 2018. Extracting Parallel Sen-tences from Comparable Corpora with STACC Vari-ants. In BUCC.
Houda Bouamor and Hassan Sajjad. 2018.H2@BUCC18: Parallel Sentence Extractionfrom Comparable Corpora Using MultilingualSentence Embeddings. In BUCC.17http://commoncrawl.org/
Christian Buck and Philipp Koehn. 2016. Findings ofthe wmt 2016 bilingual document alignment sharedtask. In Proceedings of the First Conference on Ma-chine Translation, pages 554–563, Berlin, Germany.Association for Computational Linguistics.
Vishrav Chaudhary, Yuqing Tang, Francisco Guzman,Holger Schwenk, and Philipp Koehn. 2019. Low-resource corpus filtering using multilingual sentenceembeddings. In Proceedings of the Fourth Confer-ence on Machine Translation (WMT).
Cristina Espana-Bonet, Adam Csaba Varga, AlbertoBarron-Cedeno, and Josef van Genabith. 2017. AnEmpirical Analysis of NMT-Derived InterlingualEmbeddings and their Use in Parallel Sentence Iden-tification. IEEE Journal of Selected Topics in SignalProcessing, pages 1340–1348.
Thierry Etchegoyhen and Andoni Azpeitia. 2016. Set-Theoretic Alignment for Comparable Corpora. InACL, pages 2009–2018.
Simon Gottschalk and Elena Demidova. 2017. Mul-tiwiki: Interlingual text passage alignment inWikipedia. ACM Transactions on the Web (TWEB),11(1):6.
Francis Gregoire and Philippe Langlais. 2017. BUCC2017 Shared Task: a First Attempt Toward a DeepLearning Framework for Identifying Parallel Sen-tences in Comparable Corpora. In BUCC, pages 46–50.
Mandy Guo, Qinlan Shen, Yinfei Yang, HemingGe, Daniel Cer, Gustavo Hernandez Abrego, KeithStevens, Noah Constant, Yun-Hsuan Sung, BrianStrope, and Ray Kurzweil. 2018. Effective Paral-lel Corpus Mining using Bilingual Sentence Embed-dings. arXiv:1807.11906.
Hany Hassan, Anthony Aue, Chang Chen, VishalChowdhary, Jonathan Clark, Christian Feder-mann, Xuedong Huang, Marcin Junczys-Dowmunt,William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu,Renqian Luo, Arul Menezes, Tao Qin, Frank Seide,Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu,Yingce Xia, Dongdong Zhang, Zhirui Zhang, andMing Zhou. 2018. Achieving Human Parity onAutomatic Chinese to English News Translation.arXiv:1803.05567.
Jeff Johnson, Matthijs Douze, and Herve Jegou. 2017.Billion-scale similarity search with GPUs. arXivpreprint arXiv:1702.08734.
Armand Joulin, Edouard Grave, Piotr Bojanowski, andTomas Mikolov. 2016. Bag of tricks for efficienttext classification. https://arxiv.org/abs/1607.01759.
H. Jegou, M. Douze, and C. Schmid. 2011. Prod-uct quantization for nearest neighbor search. IEEETrans. PAMI, 33(1):117–128.
Philipp Koehn. 2005. Europarl: A parallel corpus forstatistical machine translation. In MT summit.
Philipp Koehn, Francisco Guzman, Vishrav Chaud-hary, and Juan M. Pino. 2019. Findings of thewmt 2019 shared task on parallel corpus filteringfor low-resource conditions. In Proceedings of theFourth Conference on Machine Translation, Volume2: Shared Task Papers, Florence, Italy. Associationfor Computational Linguistics.
Philipp Koehn, Huda Khayrallah, Kenneth Heafield,and Mikel L. Forcada. 2018. Findings of the wmt2018 shared task on parallel corpus filtering. In Pro-ceedings of the Third Conference on Machine Trans-lation: Shared Task Papers, pages 726–739, Bel-gium, Brussels. Association for Computational Lin-guistics.
Taku Kudo and John Richardson. 2018. Sentencepiece:A simple and language independent subword tok-enizer and detokenizer for neural text processing. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing: SystemDemonstrations, pages 66–71, Brussels, Belgium.Association for Computational Linguistics.
P. Lison and J. Tiedemann. 2016. Opensubtitles2016:Extracting large parallel corpora from movie and tvsubtitles. In LREC.
Mehdi Zadeh Mohammadi and NasserGhasemAghaee. 2010. Building bilingual par-allel corpora based on Wikipedia. In 2010 SecondInternational Conference on Computer Engineeringand Applications, pages 264–268.
Dragos Stefan Munteanu and Daniel Marcu. 2005. Im-proving Machine Translation Performance by Ex-ploiting Non-Parallel Corpora. Computational Lin-guistics, 31(4):477–504.
P Otero, I Lopez, S Cilenis, and Santiago de Com-postela. 2011. Measuring comparability of multi-lingual corpora extracted from Wikipedia. IberianCross-Language Natural Language ProcessingsTasks (ICL), page 8.
Pablo Gamallo Otero and Isaac Gonzalez Lopez. 2010.Wikipedia as multilingual source of comparable cor-pora. In Proceedings of the 3rd Workshop on Build-ing and Using Comparable Corpora, LREC, pages21–25.
Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. fairseq: A fast, extensibletoolkit for sequence modeling. In Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics(Demonstrations), pages 48–53, Minneapolis, Min-nesota. Association for Computational Linguistics.
Alexandre Patry and Philippe Langlais. 2011. Identi-fying parallel documents from a large bilingual col-lection of texts: Application to parallel article ex-traction in Wikipedia. In Proceedings of the 4th
Workshop on Building and Using Comparable Cor-pora: Comparable Corpora and the Web, pages 87–95. Association for Computational Linguistics.
Matt Post. 2018. A call for clarity in reporting bleuscores. In Proceedings of the Third Conference onMachine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computa-tional Linguistics.
Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Pad-manabhan, and Graham Neubig. 2018. When andwhy are pre-trained word embeddings useful forneural machine translation? In Proceedings of the2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 2 (Short Pa-pers), pages 529–535, New Orleans, Louisiana. As-sociation for Computational Linguistics.
Philip Resnik. 1999. Mining the Web for BilingualText. In ACL.
Philip Resnik and Noah A. Smith. 2003. The Webas a Parallel Corpus. Computational Linguistics,29(3):349–380.
Holger Schwenk. 2018. Filtering and mining paralleldata in a joint multilingual space. In ACL, pages228–234.
Jason R. Smith, Chris Quirk, and Kristina Toutanova.2010. Extracting parallel sentences from compa-rable corpora using document level alignment. InNAACL, pages 403–411.
J. Tiedemann. 2012. Parallel data, tools and interfacesin OPUS. In LREC.
Chen-Tse Tsai and Dan Roth. 2016. Cross-lingual wik-ification using multilingual embeddings. In Pro-ceedings of the 2016 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages589–598.
Dan Tufis, Radu Ion, S, tefan Daniel, Dumitrescu, andDan S, tefanescu. 2013. Wikipedia as an smt trainingcorpus. In RANLP, pages 702–709.
Masao Utiyama and Hitoshi Isahara. 2003. ReliableMeasures for Aligning Japanese-English News Arti-cles and Sentences. In ACL.
Yinfei Yang, Gustavo Hernandez Abrego, Steve Yuan,Mandy Guo, Qinlan Shen, Daniel Cer, Yun-HsuanSung, Brian Strope, and Ray Kurzweil. 2019. Im-proving multilingual sentence embedding using bi-directional dual encoder with additive margin soft-max. In https://arxiv.org/abs/1902.08564.
Michał Ziemski, Marcin Junczys-Dowmunt, and BrunoPouliquen. 2016. The United Nations Parallel Cor-pus v1.0. In LREC.
Table 5 provides the amounts of mined parallelsentences for languages which have a rather smallWikipedia. Aligning those languages obviouslyyields to a very small amount of parallel sentences.Therefore, we only provide these results for align-ment with high resource languages. It is also likelythat several of these alignments are of low qual-ity since the LASER embeddings were not directlytrained on most these languages, but we still hopeto achieve reasonable results since other languagesof the same family may be covered.
Table 5: WikiMatrix (part 2): number of extracted sen-tences (in thousands) for languages with a rather smallWikipedia. Alignments with other languages yield lessthan 5k sentences and are omitted for clairty.
Table 3 gives the detailed configuration whichwas used to train NMT models on the mined datain Section 5.
Table 6: BLEU scores on the TED test set as proposedin (Qi et al., 2018). NMT systems were trained on bi-texts mined in Wikipedia only. No other resources wereused.