arXiv:1907.05791v2 [cs.CL] 16 Jul 2019 · WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia Holger Schwenk Facebook AI [email protected] Vishrav Chaudhary

WikiMatrix: Mining 135M Parallel Sentencesin 1620 Language Pairs from Wikipedia

Holger SchwenkFacebook AI

[email protected]

Vishrav ChaudharyFacebook AI

[email protected]

Shuo SunJohns Hopkins

[email protected]

Hongyu GongUniversity of Illinois

at [email protected]

Francisco GuzmanFacebook AI

[email protected]

Abstract

We present an approach based on multilin-gual sentence embeddings to automatically ex-tract parallel sentences from the content ofWikipedia articles in 85 languages, includingseveral dialects or low-resource languages. Wedo not limit the the extraction process to align-ments with English, but systematically con-sider all possible language pairs. In total, weare able to extract 135M parallel sentences for1620 different language pairs, out of whichonly 34M are aligned with English. This cor-pus of parallel sentences is freely available.1

To get an indication on the quality of the ex-tracted bitexts, we train neural MT baselinesystems on the mined data only for 1886 lan-guages pairs, and evaluate them on the TEDcorpus, achieving strong BLEU scores formany language pairs. The WikiMatrix bitextsseem to be particularly interesting to train MTsystems between distant languages without theneed to pivot through English.

1 Introduction

Most of the current approaches in Natural Lan-guage Processing (NLP) are data-driven. The sizeof the resources used for training is often the pri-mary concern, but the quality and a large vari-ety of topics may be equally important. Mono-lingual texts are usually available in huge amountsfor many topics and languages. However, multi-lingual resources, typically sentences in two lan-guages which are mutual translations, are morelimited, in particular when the two languages donot involve English. An important source of par-allel texts are international organizations like theEuropean Parliament (Koehn, 2005) or the UnitedNations (Ziemski et al., 2016). These are profes-sional human translations, but they are in a more

1https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix

formal language and tend to be limited to politicaltopics. There are several projects relying on vol-unteers to provide translations for public texts, e.g.news commentary (Tiedemann, 2012), Opensub-Titles (Lison and Tiedemann, 2016) or the TEDcorpus (Qi et al., 2018)

Wikipedia is probably the largest free multi-lingual resource on the Internet. The content ofWikipedia is very diverse and covers many top-ics. Articles exist in more than 300 languages.Some content on Wikipedia was human translatedfrom an existing article into another language, notnecessarily from or into English. Eventually, thetranslated articles have been later independentlyedited and are not parallel any more. Wikipediastrongly discourages the use of unedited machinetranslation,2 but the existence of such articles cannot be totally excluded. Many articles have beenwritten independently, but may nevertheless con-tain sentences which are mutual translations. Thismakes Wikipedia a very appropriate resource tomine for parallel texts for a large number of lan-guage pairs. To the best of our knowledge, thisis the first work to process the entire Wikipediaand systematically mine for parallel sentences inall language pairs. We hope that this resource willbe useful for several research areas and enable thedevelopment of NLP applications for more lan-guages.

In this work, we build on a recent approach tomine parallel texts based on a distance measurein a joint multilingual sentence embedding space(Schwenk, 2018; Artetxe and Schwenk, 2018a).For this, we use the freely available LASERtoolkit3 which provides a language agnostic sen-tence encoder which was trained on 93 languages

2https://en.wikipedia.org/wiki/Wikipedia:Translation

3https://github.com/facebookresearch/LASER

arX

iv:1

907.

0579

1v2

[cs

.CL

] 1

6 Ju

l 201

9

https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix


https://en.wikipedia.org/wiki/Wikipedia:Translation

https://en.wikipedia.org/wiki/Wikipedia:Translation

https://github.com/facebookresearch/LASER

https://github.com/facebookresearch/LASER

(Artetxe and Schwenk, 2018b). We approach thecomputational challenge to mine in almost sixhundred million sentences by using fast indexingand similarity search algorithms.

The paper is organized as follows. In thenext section, we first discuss related work. Wethen summarize the underlying mining approach.Section 4 describes in detail how we appliedthis approach to extract parallel sentences fromWikipedia in 1620 language pairs. To asses thequality of the extracted bitexts, we train NMT sys-tems for a subset of language pairs and evaluatethem on the TED corpus (Qi et al., 2018) for 45languages. These results are presented in sec-tion 5. The paper concludes with a discussion offuture research directions.

2 Related work

There is a large body of research on miningparallel sentences in collections of monolingualtexts, usually named “comparable coprora”. Ini-tial approaches to bitext mining have relied onheavily engineered systems often based on meta-data information, e.g. (Resnik, 1999; Resnikand Smith, 2003). More recent methods explorethe textual content of the comparable documents.For instance, it was proposed to rely on cross-lingual document retrieval, e.g. (Utiyama and Isa-hara, 2003; Munteanu and Marcu, 2005) or ma-chine translation, e.g. (Abdul-Rauf and Schwenk,2009; Bouamor and Sajjad, 2018), typically toobtain an initial alignment that is then furtherfiltered. In the shared task for bilingual docu-ment alignment (Buck and Koehn, 2016), manyparticipants used techniques based on n-gram orneural language models, neural translation mod-els and bag-of-words lexical translation probabil-ities for scoring candidate document pairs. TheSTACC method uses seed lexical translations in-duced from IBM alignments, which are combinedwith set expansion operations to score translationcandidates through the Jaccard similarity coeffi-cient (Etchegoyhen and Azpeitia, 2016; Azpeitiaet al., 2017, 2018). Using multilingual noisy web-crawls such as ParaCrawl4 for filtering good qual-ity sentence pairs has been explored in the sharedtasks for high resource (Koehn et al., 2018) andlow resource (Koehn et al., 2019) languages.

In this work, we rely on massively multilin-gual sentence embeddings and margin-based min-

4http://www.paracrawl.eu/

ing in the joint embedding space, as described in(Schwenk, 2018; Artetxe and Schwenk, 2018a,b).This approach has also proven to perform best ina low resource scenario (Chaudhary et al., 2019;Koehn et al., 2019). Closest to this approach is theresearch described in Espana-Bonet et al. (2017);Hassan et al. (2018); Guo et al. (2018); Yang et al.(2019). However, in all these works, only bilin-gual sentence representations have been trained.Such an approach does not scale to many lan-guages, in particular when considering all possiblelanguage pairs in Wikipedia. Finally, related ideashave been also proposed in Bouamor and Sajjad(2018) or Gregoire and Langlais (2017). However,in those works, mining is not solely based on mul-tilingual sentence embeddings, but they are partof a larger system. To the best of our knowledge,this work is the first one that applies the same min-ing approach to all combinations of many differentlanguages, written in more than twenty differentscripts.

Wikipedia is arguably the largest comparablecorpus. One of the first attempts to exploit thisresource was performed by Adafre and de Ri-jke (2006). An MT system was used to trans-late Dutch sentences into English and to com-pare them with the English texts. This methodyielded several hundreds of Dutch/English par-allel sentences. Later, a similar technique wasapplied to the Persian/English pair (Mohammadiand GhasemAghaee, 2010). Structural informa-tion in Wikipedia such as the topic categories ofdocuments was used in the alignment of multi-lingual corpora (Otero and Lopez, 2010). In an-other work, the mining approach of Munteanu andMarcu (2005) was applied to extract large corporafrom Wikipedia in sixteen languages (Smith et al.,2010). Otero et al. (2011) measured the compa-rability of Wikipedia corpora by the translationequivalents on three languages Portuguese, Span-ish, and English. Patry and Langlais (2011) cameup with a set of features such as Wikipedia enti-ties to recognize parallel documents, and their ap-proach was limited to a bilingual setting. Tufiset al. (2013) proposed an approach to mine parallelsentences from Wikipedia textual content, but theyonly considered high-resource languages, namelyGerman, Spanish and Romanian paired with En-glish. Tsai and Roth (2016) grounded multilin-gual mentions to English wikipedia by trainingcross-lingual embeddings on twelve languages.

2

http://www.paracrawl.eu/

Gottschalk and Demidova (2017) searched forparallel text passages in Wikipedia by compar-ing their named entities and time expressions.Finally, Aghaebrahimian (2018) propose an ap-proach based on bilingual BiLSTM sentence en-coders to mine German, French and Persian par-allel texts with English. Parallel data consistingof aligned Wikipedia titles have been extracted fortwenty-three languages5. Since Wikipedia titlesare rarely entire sentences with a subject, verb andobject, it seems that only modest improvementswere observed when adding this resource to thetraining material of NMT systems.

We are not aware of other attempts to system-atically mine for parallel sentences in the textualcontent of Wikipedia for a large number of lan-guages.

3 Distance-based mining approach

The underling idea of the mining approach usedin this work is to first learn a multilingual sen-tence embedding, i.e. an embedding space inwhich semantically similar sentences are close in-dependently of the language they are written in.This means that the distance in that space can beused as an indicator whether two sentences aremutual translations or not. Using a simple abso-lute threshold on the cosine distance was shownto achieve competitive results (Schwenk, 2018).However, it has been observed that an absolutethreshold on the cosine distance is globally notconsistent, e.g. (Guo et al., 2018). The difficultyto select one global threshold is emphasized in oursetting since we are mining parallel sentences formany different language pairs.

3.1 Margin criterion

The alignment quality can be substantially im-proved by using a margin criterion instead of anabsolute threshold (Artetxe and Schwenk, 2018a).In that work, the margin between two candidatesentences x and y is defined as the ratio betweenthe cosine distance between the two sentence em-beddings, and the average cosine similarity of its

5https://linguatools.org/tools/corpora/wikipedia-parallel-titles-corpora/

nearest neighbors in both directions:

margin(x, y)

=cos(x, y)∑

z∈NNk(x)

cos(x, z)

2k+

∑z∈NNk(y)

cos(y, z)

2k

(1)

where NNk(x) denotes the k unique nearestneighbors of x in the other language, and analo-gously for NNk(y). We used k = 4 in all experi-ments.

We follow the “max” strategy as described in(Artetxe and Schwenk, 2018a): the margin is firstcalculated in both directions for all sentences inlanguage L1 and L2. We then create the unionof these forward and backward candidates. Candi-dates are sorted and pairs with source or target sen-tences which were already used are omitted. Wethen apply a threshold on the margin score to de-cide whether two sentences are mutual translationsor not. Note that with this technique, we alwaysget the same aligned sentences, independently ofthe mining direction, e.g. searching translationsof French sentences in a German corpus, or inthe opposite direction. The reader is referred toArtetxe and Schwenk (2018a) for a detailed dis-cussion with related work.

The complexity of a distance-based mining ap-proach is O(N ×M), where N and M are thenumber of sentences in each monolingual corpus.This makes a brute-force approach with exhaus-tive distance calculations intractable for large cor-pora. Margin-based mining was shown to sig-nificantly outperform the state-of-the-art on theshared-task of the workshop on Building and Us-ing Comparable Corpora (BUCC) (Artetxe andSchwenk, 2018a). The corpora in the BUCC cor-pus are rather small: at most 567k sentences.

The languages with the largest Wikipedia areEnglish and German with 134M and 51M sen-tences, respectively, after pre-processing (see Sec-tion 4.1 for details). This would require 6.8×1015

distance calculations.6 We show in Section 3.3how to tackle this computational challenge.

3.2 Multilingual sentence embeddingsDistance-based bitext mining requires a joint sen-tence embedding for all the considered languages.

6Strictly speaking, Cebuano and Swedish are largerthan German, yet mostly consist of template/machinetranslated text https://en.wikipedia.org/wiki/List_of_Wikipedias

3

https://linguatools.org/tools/corpora/wikipedia-parallel-titles-corpora/

https://linguatools.org/tools/corpora/wikipedia-parallel-titles-corpora/

https://en.wikipedia.org/wiki/List_of_Wikipedias

https://en.wikipedia.org/wiki/List_of_Wikipedias

…

…

sent LidBPE

LSTM

<s>

sent LidBPE

LSTM

y1

sent LidBPE

LSTM

yn

y1

softmax

y2

softmax

</s>

softmax…

…

…

BPE emb

BiLSTM

x2

BPE emb

BiLSTM

</s>

…

sent embmax pooling

W

BPE emb

x1

BiLSTM

BiLSTM

…

BiLSTM

…

BiLSTM

…

ENCODER DECODER

Figure 1: Architecture of the system used to train massively multilingual sentence embeddings. See Artetxe andSchwenk (2018b) for details.

One may be tempted to train a bi-lingual em-bedding for each language pair, e.g. (Espana-Bonet et al., 2017; Hassan et al., 2018; Guo et al.,2018; Yang et al., 2019), but this is difficult toscale to thousands of language pairs present inWikipedia. Instead, we chose to use one singlemassively multilingual sentence embedding for alllanguages, namely the one proposed by the open-source LASER toolkit (Artetxe and Schwenk,2018b). Training one joint multilingual embed-ding on many languages at once also has the ad-vantage that low-resource languages can benefitfrom the similarity to other language in the samelanguage family. For example, we were able tomine parallel data for several Romance (minority)languages like Aragonese, Lombard, Mirandese orSicilian although data in those languages was notused to train the multilingual LASER embeddings.

The underlying idea of LASER is to train asequence-to-sequence system on many languagepairs at once using a shared BPE vocabulary anda shared encoder for all languages. The sentencerepresentation is obtained by max-pooling over allencoder output states. Figure 1 illustrates thisapproach. The reader is referred to Artetxe andSchwenk (2018b) for a detailed description.

3.3 Fast similarity search

Fast large-scale similarity search is an area witha large body of research. Traditionally, the appli-cation domain is image search, but the algorithmsare generic and can be applied to any type of vec-tors. In this work, we use the open-source FAISSlibrary7 which implements highly efficient algo-rithms to perform similarity search on billions ofvectors (Johnson et al., 2017). An additional ad-vantage is that FAISS has support to run on multi-

7https://github.com/facebookresearch/faiss

ple GPUs. Our sentence representations are 1024-dimensional. This means that the embeddings ofall English sentences require 153·106×1024×4 =513 GB of memory. Therefore, dimensionality re-duction and data compression are needed for effi-cient search. In this work, we chose a rather ag-gressive compression based on a 64-bit product-quantizer (Jegou et al., 2011), and portioning thesearch space in 32k cells. This corresponds tothe index type “OPQ64,IVF32768,PQ64” inFAISS terms.8 Another interesting compressionmethod is scalar quantization. A detailed compar-ison is left for future research. We build and trainone FAISS index for each language.

The compressed FAISS index for English re-quires only 9.2GB, i.e. more than fifty timessmaller than the original sentences embeddings.This makes it possible to load the whole index ona standard GPU and to run the search in a very ef-ficient way on multiple GPUs in parallel, withoutthe need to shard the index. The overall miningprocess for German/English requires less than 3.5hours on 8 GPUs, including the nearest neighborsearch in both direction and scoring all candidates

4 Bitext mining in Wikipedia

For each Wikipedia article, it is possible to getthe link to the corresponding article in other lan-guages. This could be used to mine sentences lim-ited to the respective articles. One one hand, thislocal mining has several advantages: 1) miningis very fast since each article usually has a fewhundreds of sentences only; 2) it seems reasonableto assume that a translation of a sentence is morelikely to be found in the same article than any-where in the whole Wikipedia. On the other hand,

8https://github.com/facebookresearch/faiss/wiki/Faiss-indexes

4

https://github.com/facebookresearch/faiss

https://github.com/facebookresearch/faiss

https://github.com/facebookresearch/faiss/wiki/Faiss-indexes

https://github.com/facebookresearch/faiss/wiki/Faiss-indexes

we hypothesize that the margin criterion will beless efficient since one article has usually few sen-tences which are similar. This may lead to manysentences in the overall mined corpus of the type“NAME was born on DATE in CITY”, “BUILD-ING is a monument in CITY built on DATE”, etc.Although those alignments may be correct, we hy-pothesize that they are of limited use to train anNMT system, in particular when they are too fre-quent. In general, there is a risk that we will getsentences which are close in structure and content.

The other option is to consider the wholeWikipedia for each language: for each sentencein the source language, we mine in all target sen-tences. This global mining has several potentialadvantages: 1) we can try to align two languageseven though there are only few articles in com-mon; 2) many short sentences which only differ bythe name entities are likely to be excluded by themargin criterion. A drawback of this global min-ing is a potentially increased risk of misalignmentand a lower recall.

In this work, we chose the global mining op-tion. This will allow us to scale the same ap-proach to other, potentially huge, corpora forwhich document-level alignments are not easilyavailable, e.g. Common Crawl. An in depth com-parison of local and global mining (on Wikipedia)is left for future research.

4.1 Corpus preparation

Extracting the textual content of Wikipedia arti-cles in all languages is a rather challenging task,i.e. removing all tables, pictures, citations, foot-notes or formatting markup. There are severalways to download Wikipedia content. In thisstudy, we use the so-called CirrusSearch dumpssince they directly provide the textual contentwithout any meta information.9 We downloadedthis dump in March 2019. A total of about 300 lan-guages are available, but the size obviously variesa lot between languages. We applied the followingprocessing:

• extract the textual content;

• split the paragraphs into sentences;

• remove duplicate sentences;

9https://dumps.wikimedia.org/other/cirrussearch/

L1 (French) Ceci est une tres grande maison

L2 (German) Das ist ein sehr großes HausThis is a very big houseEz egy nagyon nagy hazIni rumah yang sangat besar

Table 1: Illustration how sentences in the wrong lan-guage can hurt the alignment process with a margincriterion. See text for a detailed discussion.

• perform language identification and removesentences which are not in the expected lan-guage (usually, citations or references to textsin another language).

It should be pointed out that sentence segmenta-tion is not a trivial task, with many exceptions andspecific rules for the various languages. For in-stance, it is rather difficult to make an exhaustivelist of common abbreviations for all languages. InGerman, points are used after numbers in enumer-ations, but numbers may also appear at the end ofsentences. Other languages do not use specificsymbols to mark the end of a sentence, namelyThai. We are not aware of a reliable and freelyavailable sentence segmenter for Thai and we hadto exclude that language. We used the freely avail-able Python tool SegTok10 which has specific rulesfor 24 languages. Regular expressions were usedfor most of the Asian languages, falling back toEnglish for the remaining languages. This givesus 879 million sentences in 300 languages. Themargin criterion to mine for parallel data requiresthat the texts do not contain duplicates. This re-moves about 25% of the sentences.11

LASER’s sentence embeddings are totally lan-guage agnostic. This has the side effect that thesentences in other languages (e.g. citations orquotes) may be considered closer in the embed-ding space than a potential translation in the tar-get language. Table 1 illustrates this problem.The algorithm would not select the German sen-tence although it is a perfect translation. The sen-tences in the other languages are also valid trans-lations which would yield a very small margin. Toavoid this problem, we perform language identi-fication (LID) on all sentences and remove thosewhich are not in the expected language. LID

10https://pypi.org/project/segtok/11The Cebuano and Waray Wikipedia were largely created

by a bot and contain more than 65% of duplicates.

5

https://dumps.wikimedia.org/other/cirrussearch/

https://dumps.wikimedia.org/other/cirrussearch/

https://pypi.org/project/segtok/

21

21.5

22

22.5

23

23.5

24

24.5

25

1.03 1.035 1.04 1.045 1.05 1.055 1.06 0

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

2400

2600

2800

3000

BLEU

Bit

ext

siz

e in k

Margin threshold

de-ende-fr

10

11

12

13

14

15

16

17

1.03 1.035 1.04 1.045 1.05 1.055 1.06 0

50

100

150

200

250

300

350

400

450

500

550

600

BLEU

Bit

ext

siz

e in k

Margin threshold

cs-decs-fr

Figure 2: BLEU scores (continuous lines) for several NMT systems trained on bitexts extracted from Wikipediafor different margin thresholds. The size of the mined bitexts are depicted as dashed lines.

is performed with fasttext12 (Joulin et al., 2016).Fasttext does not support all the 300 languagespresent in Wikipedia and we disregarded the miss-ing ones (which typically have only few sentencesanyway). After deduplication and LID, we disposeof 595M sentences in 182 languages. English ac-counts for 134M sentences, and German with 51Msentences is the second largest language. The sizesfor all languages are given in Tables 3 and 5.

4.2 Threshold optimization

Artetxe and Schwenk (2018a) optimized theirmining approach for each language pair on a pro-vided corpus of gold alignments. This is not pos-sible when mining Wikipedia, in particular whenconsidering many language pairs. In this work, weuse an evaluation protocol inspired by the WMTshared task on parallel corpus filtering for low-resource conditions (Koehn et al., 2019): an NMTsystem is trained on the extracted bitexts – for dif-ferent thresholds – and the resulting BLEU scoresare compared. We choose newstest2014 ofthe WMT evaluations since it provides an N -wayparallel test sets for English, French, Germanand Czech. We favoured the translation betweentwo morphologically rich languages from differ-ent families and considered the following lan-guage pairs: German/English, German/French,Czech/German and Czech/French. The size ofmined bitexts is in the range of 100k to more than2M (see Table 2 and Figure 2). We did not tryto optimize the architecture of the NMT systemto the size of the bitexts and used the same archi-tecture for all systems: the encoder and decoderare 5-layer transformer models as implemented infairseq (Ott et al., 2019). The goal of this study

12https://fasttext.cc/docs/en/language-identification.html

is not to develop the best performing NMT systemfor the considered languages pairs, but to comparedifferent mining parameters.

The evolution of the BLEU score in function ofthe margin threshold is given in Figure 2. Decreas-ing the threshold naturally leads to more mineddata – we observe an exponential increase of thedata size. The performance of the NMT systemstrained on the mined data seems to change as ex-pected, in a surprisingly smooth way. The BLEUscore first improves with increasing amounts ofavailable training data, reaches a maximum andthan decreases since the additional data gets moreand more noisy, i.e. contains wrong translations.It is also not surprising that a careful choice ofthe margin threshold is more important in a low-resource setting. Every additional parallel sen-tence is important. According to Figure 2, the op-timal value of the margin threshold seems to be1.05 when many sentences can be extracted, in ourcase German/English and German/French. Whenless parallel data is available, i.e. Czech/Germanand Czech/French, a value in the range of 1.03–1.04 seems to be a better choice. Aiming at onethreshold for all language pairs, we chose a valueof 1.04. It seems to be a good compromise formost language pairs. However, for the open re-lease of this corpus, we provide all mined sentencewith a margin of 1.02 or better. This would enableend users to choose an optimal threshold for theirparticular applications. However, it should be em-phasized that we do not expect that many sentencepairs with a margin as low as 1.02 are good trans-lations.

For comparison, we also trained NMT systemson the Europarl corpus V7 (Koehn, 2005), i.e. pro-fessional human translations, first on all availabledata, and then on the same number of sentences

6

https://fasttext.cc/docs/en/language-identification.html

https://fasttext.cc/docs/en/language-identification.html

Bitexts de-en de-fr cs-de cs-fr

Europarl

1.9M 1.9M 568k 627k21.5 23.6 14.9 21.5

1.0M 370k 200k 220k21.2 21.1 12.6 19.2

Mined 1.0M 372k 201k 219kWikipedia 24.4 22.7 13.1 16.3

Europarl 3.0M 2.3M 768k 846k+ Wikipedia 25.5 25.6 17.7 24.0

Table 2: Comparison of NMT systems trained on theEuroparl corpus and on bitexts automatically mined inWikipedia by our approach at a threshold of 1.04. Wegive the number of sentences (first line) and the BLEUscore (second line of each bloc) on newstest2014.

than the mined ones (see Table 2). With the excep-tion of Czech/French, we were able to achieve bet-ter BLEU scores with the automatically mined bi-texts in Wikipedia than with Europarl of the samesize. Adding the mined text to the full Europarlcorpus, also leads to further improvements of 1.1to 3.1 BLEU. We argue that this is a good indi-cator of the quality of the automatically extractedparallel sentences.

5 Result analysis

We run the alignment process for all possible com-binations of languages in Wikipedia. This yielded1620 language pairs for which we were able tomine at least ten thousand sentences. Remem-ber that mining L1 → L2 is identical to L2 → L1,and is counted only once. We propose to ana-lyze and evaluate the extracted bitexts in two ways.First, we discuss the amount of extracted sen-tences (Section 5.1). We then turn to a qualitativeassessment by training NMT systems for all lan-guage pairs with more than twenty-five thousandmined sentences (Section 5.2).

5.1 Quantitative analysisDue to space limits, Table 3 summarizes the num-ber of extracted parallel sentences only for lan-guages which have a total of at least five hun-dred thousand parallel sentences (with all otherlanguages at a margin threshold of 1.04). Addi-tional results are given in Table 5 in the Appendix.

There are many reasons which can influence thenumber of mined sentences. Obviously, the largerthe monolingual texts, the more likely it is to mine

many parallel sentences. Not surprisingly, we ob-serve that more sentences could be mined whenEnglish is one of the two languages. Let us pointout some languages for which it is usually not ob-vious to find parallel data with English, namelyIndonesian (1M), Hebrew (545k), Farsi (303k)or Marathi (124k sentences). The largest minedtexts not involving English are Russian/Ukrainian(2.5M), Catalan/Spanish (1.6M), between the Ro-mance languages French, Spanish, Italian and Por-tuguese (480k–923k), and German/French (626k).

It is striking to see that we were able to minemore sentences when Galician and Catalan arepaired with Spanish than with English. On onehand, this could be explained by the fact thatLASER’s multilingual sentence embeddings maybe better since the involved languages are linguis-tically very similar. On the other, it could be thatthe Wikipedia articles in both languages share a lotof content, or are obtained by mutual translation.

Services from the European Commission pro-vide human translations of (legal) texts in all the24 official languages of the European Union. ThisN-way parallel corpus enables training of MTsystem to directly translate between these lan-guages, without the need to pivot through En-glish. This is usually not the case when translat-ing between other major languages, for example inAsia. Let us list some interesting language pairsfor which we were able to mine more than hun-dred thousand sentences: Korean/Japanese (222k),Russian/Japanese (196k), Indonesian/Vietnamese(146k), or Hebrew/Romance languages (120–150k sentences).

Overall, we were able to extract at least tenthousand parallel sentences for 85 different lan-guages.13 For several low-resource languages,we were able to extract more parallel sentenceswith other languages than English. These include,among others, Aragonse with Spanish, Lombardwith Italian, Breton with several Romance lan-guages, Western Frisian with Dutch, Luxembour-gish with German or Egyptian Arabic and Wu Chi-nese with the respective major language.

Finally, Cebuano (ceb) falls clearly apart: ithas a rather huge Wikipedia (17.9M filtered sen-tence), but most of it was generated by a bot, asfor the Waray language14. This certainly explainsthat only a very small number of parallel sen-

1399 languages have more than 5,000 parallel sentences.14https://en.wikipedia.org/wiki/Lsjbot

7

https://en.wikipedia.org/wiki/Lsjbot

ISO

Nam

eL

angu

age

Fam

ilysi

zeaz

babe

bgbn

bsca

csda

deel

eneo

eset

eufa

fifr

glhe

hihr

huid

isit

jakk

kolt

mk

ml

mr

nenl

nooc

plpt

roru

shsi

sksl

sqsr

svsw

tate

tltr

ttuk

vizh

tota

l

arA

rabi

cA

rabi

c65

1617

1511

5440

3494

6753

9966

999

3717

440

2458

5316

350

6838

3860

9018

123

8311

4833

5232

3212

7358

974

157

7112

535

3232

3930

4958

1327

2718

6914

7093

8644

35az

Aze

rbai

jani

Turk

ic18

737

314

89

1517

1134

1071

931

107

1614

2910

129

1015

174

2523

611

98

54

222

132

2225

1447

93

98

910

195

115

242

819

1019

885

baB

ashk

irTu

rkic

536

214

49

1716

1327

1028

928

86

912

2912

94

1012

123

2613

45

79

11

121

153

1924

1542

101

1010

710

204

31

111

815

710

697

beB

elar

usia

nSl

avic

1690

164

716

149

208

339

287

55

1024

910

39

119

324

122

57

82

21

1410

219

2313

161

82

87

510

162

32

18

280

88

803

bgB

ulga

rian

Slav

ic33

2738

3476

7953

132

6235

740

122

4325

3761

117

4358

3047

6860

1710

271

1138

4286

2935

1384

589

9611

469

270

4131

4346

2665

6312

2123

2156

1212

660

6037

70bn

Ben

gali

Indo

-Ary

an14

1221

4147

3370

3628

027

8126

1420

3768

2734

2123

4136

864

384

2021

237

83

5035

452

7646

6220

825

2617

2554

612

84

3337

3131

1991

bsB

osni

anSl

avic

1060

4443

3271

3321

024

7025

1620

3660

3230

1616

439

3811

5236

622

2339

1920

845

366

4862

3759

178

1625

3419

130

387

1512

1333

839

3831

2348

caC

atal

anR

oman

ce83

3210

086

180

9012

0581

1580

5477

4483

490

268

8437

5792

107

2331

610

312

5245

6145

5617

144

102

5712

135

811

016

952

5250

5734

6710

214

3035

3177

1698

106

9080

47cs

Cze

chSl

avic

7634

7523

370

519

7518

162

3645

9518

554

7238

6310

578

2316

110

515

5355

5136

4115

139

8611

176

153

8218

650

3747

464

3063

9715

3134

2575

1410

474

8053

25da

Dan

ish

Ger

man

ic28

8118

054

436

3914

045

2629

7514

244

5525

4369

6320

115

769

3735

3930

3512

110

303

889

123

7010

937

3239

4023

4316

811

2021

2355

1162

6857

4006

deG

erm

anG

erm

anic

5094

495

1573

186

418

106

5366

163

626

8010

957

8719

210

734

388

217

2382

7864

5158

2147

220

717

285

294

129

368

6850

9410

651

8121

620

5857

3212

723

165

107

134

9692

elG

reek

Hel

leni

c32

1162

039

145

4123

3555

137

4856

2643

6473

1511

969

635

3452

2732

976

608

7714

478

114

3831

3546

2852

6211

1618

2056

968

7562

3728

enE

nglis

hG

erm

anic

1344

3129

833

7724

311

930

337

527

5744

654

523

125

948

810

1985

2126

851

2030

615

739

571

124

1579

663

637

668

2461

631

1661

224

115

178

318

180

395

546

5195

9175

477

3268

110

7378

634

321

eoE

sper

anto

cons

truc

ted

2371

149

3125

2346

134

4639

2229

5746

1510

148

726

2830

2828

981

479

7791

4381

2628

4132

1936

538

1619

1737

850

4239

2817

esSp

anis

hR

oman

ce25

202

8915

483

155

905

610

153

7194

167

198

4267

121

926

108

7692

6598

2527

218

135

235

923

183

393

8184

8193

5310

718

121

5771

4814

726

187

206

174

1498

0et

Est

onia

nU

ralic

2303

2224

7085

3239

2033

5641

1475

576

2935

3220

218

7249

773

7648

9627

1934

3518

3458

916

1615

437

5640

4425

11eu

Bas

que

Isol

ate

2259

1433

6543

2513

2135

2710

5433

518

1919

1110

544

296

4358

3047

199

2020

1221

386

1311

829

530

2523

1699

faFa

rsi

Iran

ian

3954

3471

2536

2024

3946

964

468

2620

2711

106

4932

450

7740

7221

921

2417

3042

821

127

429

4138

4221

22fi

Finn

ish

Ura

lic74

2815

647

6428

4890

6322

131

879

4347

4029

3010

126

8610

119

131

6913

939

2750

4625

4612

612

2324

2172

1076

6064

3998

frFr

ench

Rom

ance

3449

415

413

660

8516

416

138

744

214

2489

7183

6283

2533

116

612

425

555

820

641

072

7483

8648

9218

619

5665

4213

026

170

165

157

1244

1gl

Gal

icia

nR

oman

ce22

2141

2133

5056

1412

050

928

2735

2939

1166

5217

6522

756

8430

3629

3320

3954

815

1722

4312

5158

4639

69he

Heb

rew

Sem

itic

6962

2841

6563

1712

182

743

3542

2625

986

648

8413

367

131

3521

3638

2345

6710

2125

1354

873

6662

3666

hiH

indi

Indo

-Ary

an13

5321

3331

756

353

1816

246

1112

4027

444

6335

5617

618

2114

2240

713

183

294

3326

3017

81hr

Cro

atia

nSl

avic

2229

5847

1480

488

2731

5224

2410

6548

771

8551

8566

619

3553

2120

556

916

1617

429

5546

4236

29hu

Hun

gari

anU

ralic

7702

7020

146

9911

4948

4727

2810

121

7510

126

148

8714

946

2656

5527

5388

1329

3020

759

8374

7543

57id

Indo

nesi

anM

alay

o-Po

lyne

sian

3899

1614

677

845

3355

2523

1010

183

893

204

9412

743

2337

4632

5679

1324

1921

7911

7314

683

4743

isIc

elan

dic

Ger

man

ic48

831

182

912

126

53

2722

327

3520

3013

413

139

1328

47

55

163

1816

1499

5it

Ital

ian

Rom

ance

2102

517

924

8362

7358

7824

240

150

2021

948

016

130

367

6872

8148

8315

319

4458

4111

223

144

143

137

1008

4ja

Japa

nese

Japo

nic

3161

414

222

4748

2123

812

381

912

817

579

196

4019

4850

2851

9612

3731

1284

1292

7526

753

82kk

Kaz

akh

Turk

ic16

846

56

22

118

111

1722

1232

81

87

68

163

31

110

414

69

581

koK

orea

nK

orea

nic

5400

2326

1010

456

415

6393

4789

239

2526

1729

517

1713

547

548

4957

2539

ltL

ithua

nian

Bal

tic20

3128

1616

657

396

7064

3910

725

1530

3016

2946

813

1112

366

5733

3520

91m

kM

aced

onia

nSl

avic

1487

2122

953

466

5693

5688

5219

2939

2510

648

914

1516

438

5757

4527

61m

lM

alay

alam

Dra

vidi

an27

23

141

327

4158

3546

212

2221

1320

446

43

116

226

1517

1388

mr

Mar

athi

Indo

-Ary

an50

31

4635

547

8647

5022

325

2513

1956

45

31

171

2614

1716

48ne

Nep

ali

Indo

-Ary

an23

517

134

1721

1419

101

1010

69

173

11

71

116

658

4nl

Dut

chG

erm

anic

1409

113

313

177

218

9619

953

4266

6434

6115

116

3735

2990

1810

484

8863

64no

Nor

weg

ian

Ger

man

ic13

649

103

161

7412

142

2843

5126

4727

012

2423

2461

1369

7963

4771

ocO

ccita

nR

oman

ce47

213

2411

146

56

64

712

23

33

73

98

770

6pl

Polis

hSl

avic

1627

020

097

285

5640

8168

3569

121

1639

4228

9216

172

8492

6035

ptPo

rtug

uese

Rom

ance

1335

417

731

274

7671

8547

101

155

2042

5445

140

2315

621

316

510

932

roR

oman

ian

Rom

ance

3504

136

4443

4249

3058

7515

2327

2972

1382

9674

4515

ruR

ussi

anSl

avic

3253

770

4285

7844

114

140

1754

5529

119

2524

8612

214

811

311

shSe

rbo-

Cro

atia

nla

ngua

ges

Sout

hSl

avic

2069

1727

4619

373

468

1413

1735

945

4236

3337

siSi

nhal

aIn

do-A

ryan

320

2222

1015

484

44

215

121

1416

1464

skSl

ovak

Slav

ic19

0436

1734

518

1415

1637

951

3838

2726

slSl

oven

ian

Slav

ic18

3619

4750

915

1617

398

5048

4226

23sq

Alb

ania

nA

lban

ian

740

2531

613

1010

2729

3024

1515

srSe

rbia

nSl

avic

4226

519

1918

1443

871

5645

3510

svSw

edis

hG

erm

anic

1790

616

3339

3572

1782

7473

5107

swSw

ahili

Nig

er-

Con

go23

56

34

1212

1011

652

taTa

mil

Dra

vidi

an16

296

229

530

1927

1315

teTe

lugu

Dra

vidi

an15

091

211

3016

2013

15tl

Taga

log

Mal

ayo-

Poly

nesi

an31

212

116

1710

1015

trTu

rkis

hTu

rkic

4067

1067

7769

3698

ttTa

tar

Turk

ic44

911

710

603

ukU

krai

nian

Slav

ic12

871

7372

7043

viV

ietn

ames

eV

ietic

6727

8946

60zh

Chi

nese

Chi

nese

1230

644

15

Tabl

e3:

Wik

iMat

rix:

num

ber

ofex

trac

ted

sent

ence

sfo

rea

chla

ngua

gepa

ir(i

nth

ousa

nds)

,e.g

.E

s-Ja

=219

corr

espo

nds

to21

9,26

0se

nten

ces

(rou

nded

).T

heco

lum

n“s

ize”

give

sth

enu

mbe

rofl

ines

inth

em

onol

ingu

alte

xts

afte

rded

uplic

atio

nan

dL

ID.

8

Src/Trgar bg bs cs da de el en eo es frfr-ca gl he hr hu id it ja ko mk nb nl pl ptpt-br ro ru sk sl sr sv tr uk vizh-cnzh-tw

ar 4.9 1.8 3.0 4.5 3.8 6.7 20.3 4.1 13.2 12.2 9.0 5.6 3.5 2.2 2.7 9.2 9.9 4.2 5.3 5.5 4.9 4.4 3.0 12.0 12.2 5.6 5.6 1.5 2.7 1.2 4.0 2.4 4.5 12.3 8.2 4.9bg 3.0 4.2 6.7 8.7 8.5 10.0 25.3 7.7 16.3 14.7 11.9 8.4 3.2 6.4 4.7 10.3 12.2 4.0 5.4 16.2 8.1 7.7 6.3 14.7 15.4 9.8 12.4 4.0 6.4 2.7 7.1 2.4 10.4 11.7 6.4 4.7bs 1.2 6.1 4.1 3.7 5.8 4.5 21.7 9.9 6.7 4.6 5.7 1.0 30.9 2.4 5.3 6.5 1.4 12.9 3.8 2.9 9.9 10.9 5.7 4.8 3.1 10.4 10.6 4.4 1.5 5.4 5.8 2.8 1.8cs 1.9 7.8 3.7 7.1 8.3 6.4 20.0 10.4 12.6 11.4 8.6 5.0 2.5 6.5 4.9 7.8 9.6 4.1 5.5 5.0 6.3 7.6 8.1 10.8 12.1 6.3 9.4 28.1 6.7 1.6 7.0 2.6 7.8 9.0 6.6 4.7da 2.0 8.9 4.0 5.2 14.0 9.0 32.9 6.7 16.7 16.7 12.8 7.3 3.5 4.4 4.7 10.8 13.4 4.7 6.0 6.2 33.1 12.4 4.8 14.0 16.2 8.5 7.8 3.1 5.2 1.4 25.8 2.7 6.3 11.2 7.3 4.9de 2.4 9.7 4.9 8.1 16.9 7.8 24.5 15.9 17.4 18.3 14.7 6.8 4.3 5.5 7.2 8.6 13.5 6.4 6.6 5.6 11.5 17.6 6.8 14.2 15.2 8.7 9.2 5.4 8.6 1.5 12.7 3.6 7.8 11.3 9.2 4.3el 4.1 11.2 4.8 5.5 9.7 6.7 27.9 8.0 18.8 16.3 13.5 10.1 4.0 5.6 5.3 13.1 15.3 5.5 6.1 10.2 9.3 8.7 5.1 18.0 18.3 10.4 8.4 3.1 6.4 2.0 7.2 3.4 7.2 14.4 8.7 5.3en 11.9 23.9 14.7 15.5 30.9 20.4 27.1 22.6 35.8 32.6 25.1 24.3 17.3 18.8 13.5 28.8 29.5 10.2 18.6 21.8 31.8 25.1 12.0 31.4 37.0 20.4 17.4 13.8 16.5 5.5 29.1 10.3 17.6 26.9 18.0 10.7eo 1.8 6.4 7.3 7.4 13.5 8.1 23.1 16.1 17.6 12.7 10.5 1.9 3.0 4.8 9.2 13.9 2.5 4.8 6.2 7.2 12.1 6.7 12.4 16.0 6.9 8.6 7.4 5.6 0.6 8.8 1.8 5.4 7.7 5.2 3.6es 6.2 14.3 6.8 8.0 15.7 12.9 16.4 33.2 13.5 25.6 19.9 30.1 8.1 8.7 7.7 16.1 23.8 7.9 9.9 11.9 13.3 14.1 8.1 27.6 27.8 14.7 11.6 6.0 8.1 2.6 13.7 5.2 10.0 17.8 12.3 6.6fr 6.0 12.7 5.0 8.3 15.6 14.5 15.4 31.6 18.6 26.4 17.2 6.8 6.9 7.7 15.0 24.6 7.3 8.7 10.7 13.1 15.9 8.1 24.2 26.0 16.5 12.7 5.8 7.1 2.1 13.8 4.2 9.8 17.0 11.1 6.6fr-ca 4.9 12.5 2.8 7.8 14.3 12.9 15.4 27.8 14.0 23.7 18.1 6.8 6.8 7.9 13.7 23.4 7.5 8.8 10.3 15.1 7.2 18.6 23.2 15.3 11.3 6.1 6.6 3.3 12.5 5.0 9.7 15.4 6.2gl 2.6 7.3 4.7 2.9 7.4 5.3 8.5 23.4 9.2 34.4 16.0 15.2 2.5 3.5 2.8 9.8 19.3 3.6 4.2 5.6 6.9 6.5 4.3 22.4 23.7 7.7 5.9 1.7 3.1 0.2 4.3 2.3 3.9 9.4 5.8 4.2he 3.4 5.7 1.8 3.7 6.1 5.4 6.3 25.7 4.2 15.3 13.6 10.3 5.1 2.5 3.2 8.9 11.2 4.1 4.5 4.6 5.4 6.5 3.7 13.7 15.0 6.3 8.0 1.9 2.6 1.0 5.2 1.8 5.1 10.7 6.9 4.6hr 1.6 8.7 29.9 7.0 6.5 5.8 6.6 24.4 4.4 12.6 9.9 7.8 5.7 1.5 4.3 8.2 10.4 1.8 4.0 14.1 5.3 6.1 5.4 12.1 12.6 7.3 8.3 4.6 12.6 11.7 6.0 1.9 6.9 9.4 4.8 2.9hu 1.6 5.6 2.6 5.9 7.0 5.5 16.7 6.5 10.8 10.9 9.3 3.9 2.1 3.6 6.6 8.2 4.4 6.2 3.9 4.5 6.0 4.3 9.7 9.9 7.1 5.8 3.4 4.2 1.2 5.3 3.0 4.4 8.5 6.5 4.2id 4.1 9.1 4.2 5.2 10.1 6.7 11.1 24.9 8.2 16.4 15.1 11.1 9.9 5.1 5.6 5.2 12.7 5.8 9.1 7.0 10.0 9.4 5.5 14.6 16.7 9.8 8.1 3.6 5.6 1.8 9.3 4.6 7.3 18.5 11.0 6.2it 5.3 11.7 5.0 6.9 13.1 11.5 14.5 30.0 13.9 26.4 24.9 20.0 19.3 6.2 7.0 6.7 14.0 7.3 9.0 9.9 13.3 12.8 7.3 22.8 24.9 13.3 10.2 5.4 7.1 2.3 11.9 4.6 8.8 15.3 10.7 5.8ja 1.4 1.9 0.7 1.8 3.1 2.7 2.5 7.9 2.2 6.0 6.0 4.7 2.3 1.4 1.3 2.0 3.5 4.6 16.9 1.6 2.6 3.1 1.8 5.1 4.9 2.8 2.7 1.2 1.8 0.5 2.6 1.9 2.2 5.9ko 0.9 1.7 1.3 2.0 1.7 1.7 8.7 1.3 4.7 4.4 3.4 1.5 0.9 0.7 1.5 3.1 3.2 9.2 1.2 1.3 1.9 1.4 4.3 4.6 2.0 2.1 1.0 1.5 0.4 1.5 1.4 1.5 4.3mk 2.4 18.2 12.0 5.4 7.2 4.3 10.3 23.4 8.9 15.2 11.5 10.0 5.4 4.0 12.6 3.7 8.9 11.1 3.5 4.7 7.5 6.7 4.5 13.9 15.6 8.3 7.5 3.5 7.1 3.7 5.6 2.0 6.4 10.9 6.3 4.3nb 2.7 8.5 5.4 32.7 9.8 8.9 35.1 9.5 17.0 14.6 8.0 3.1 3.9 4.5 9.9 15.0 5.5 5.7 6.7 4.7 14.2 9.5 7.5 3.2 3.9 0.6 26.3 1.9 6.2 7.8nl 2.4 8.2 2.9 5.9 14.2 16.1 8.4 26.5 13.4 16.8 16.7 13.5 7.3 3.9 4.8 5.3 11.4 13.3 5.2 5.9 5.9 5.3 13.8 15.4 7.8 7.6 4.1 5.1 1.6 11.1 3.2 6.1 10.6 8.0 5.1pl 1.8 7.4 2.9 8.2 6.6 6.5 5.4 15.1 7.5 11.4 11.3 8.6 5.2 2.3 4.8 4.0 7.5 8.6 4.2 5.7 5.2 4.1 6.2 9.6 9.9 6.2 9.5 6.2 5.3 1.5 5.6 2.4 8.9 7.7 6.3 3.6pt 7.4 15.2 7.0 8.0 15.5 13.2 18.7 35.0 12.0 32.4 26.7 19.0 23.0 8.8 10.3 8.4 17.5 24.4 7.8 10.6 13.1 14.3 14.9 8.9 15.3 11.7 6.6 8.1 2.0 14.3 6.2 9.3 18.0 12.9 5.8pt-br 6.5 14.7 7.4 8.6 16.8 12.9 17.6 37.3 16.0 31.0 26.6 20.3 23.0 8.7 9.8 8.1 18.6 24.8 7.8 10.7 12.5 14.8 8.5 15.1 11.8 6.4 8.9 2.8 14.6 5.3 10.8 18.8 13.2 6.7ro 3.2 9.7 3.7 5.1 9.4 7.5 10.4 25.0 6.7 18.8 19.3 14.6 10.0 4.0 5.7 5.9 11.0 15.5 4.3 6.4 7.3 8.0 8.1 5.2 15.4 17.7 8.0 3.6 5.0 1.9 7.0 3.3 6.6 12.7 7.8 4.9ru 3.3 12.6 4.2 7.7 8.5 8.8 8.3 18.7 9.9 14.3 14.5 11.0 6.0 4.9 6.8 5.6 9.5 11.7 6.1 7.7 7.4 8.0 8.1 8.9 12.4 13.9 8.2 5.8 5.4 2.7 8.2 2.9 22.5 11.5 9.1 5.2sk 0.7 5.1 2.7 27.0 4.3 5.7 3.2 16.9 9.3 9.4 8.5 6.7 2.7 1.0 5.1 3.7 4.9 6.9 2.2 3.9 3.5 4.9 5.0 7.1 7.8 8.5 3.5 6.6 5.0 1.5 4.3 1.6 5.4 5.4 2.3 2.5sl 1.2 6.2 7.6 5.5 4.7 7.6 5.8 17.3 5.9 11.4 8.5 6.4 3.2 1.2 11.2 3.9 6.5 7.8 2.7 4.2 6.3 4.3 5.5 4.8 9.9 4.8 5.9 3.6 1.9 4.1 2.2 4.3 7.4 3.8 2.6sr 1.8 7.6 33.2 5.1 3.7 3.8 5.8 22.8 3.2 11.9 9.1 7.5 3.5 1.1 30.4 2.7 6.1 8.2 1.2 2.9 13.7 3.3 3.3 3.9 10.8 11.3 5.6 7.2 2.8 9.5 3.0 1.1 5.7 6.8 4.3 3.1sv 2.2 7.4 4.8 5.9 26.5 12.6 8.1 31.8 11.0 16.9 15.7 10.7 7.1 3.3 5.5 5.4 11.6 13.3 4.8 6.2 5.2 25.4 11.5 5.8 15.0 17.4 7.9 8.1 3.8 4.7 1.0 3.2 6.9 12.9 7.8 4.8tr 2.2 3.5 2.0 2.6 3.9 4.1 4.7 15.9 2.9 9.4 7.7 6.7 3.6 1.6 2.1 3.4 6.7 6.4 4.3 7.0 3.5 3.1 4.2 2.5 9.0 8.4 4.6 4.0 1.8 2.3 0.8 3.5 3.3 8.2 6.7 4.4uk 2.9 12.3 5.3 7.4 7.5 7.5 8.4 20.7 6.5 14.2 14.1 11.2 5.5 3.5 6.6 4.7 9.5 11.2 4.9 5.8 7.2 6.3 6.9 9.6 12.9 7.2 23.5 4.9 5.7 2.6 6.9 2.6 11.4 7.9 4.9vi 4.2 7.5 4.0 4.7 8.5 6.0 8.8 20.2 7.3 13.7 13.2 9.9 6.5 4.6 4.9 4.7 14.7 10.7 5.6 9.3 6.9 5.7 7.3 4.5 13.0 14.1 8.5 7.2 3.4 4.6 1.7 8.2 4.0 6.7 9.9 6.7zh-cn 2.1 3.2 1.0 2.2 3.8 3.2 4.5 11.8 3.8 8.2 7.6 3.2 1.7 1.9 3.0 6.6 6.0 3.4 3.8 2.2 7.1 7.9 4.1 4.1 1.6 2.4 0.9 3.1 2.3 3.0 10.8zh-tw 2.2 3.1 1.1 2.1 3.7 2.8 3.9 10.7 3.4 7.5 7.2 6.1 2.8 1.8 1.6 3.0 6.2 5.4 2.8 3.5 2.3 6.3 6.9 3.5 3.9 1.4 2.1 0.9 3.0 2.4 2.9 10.0

Table 4: BLEU scores on the TED test set as proposed in (Qi et al., 2018). NMT systems were trained on bitextsmined in Wikipedia only (with at least twenty-five thousand parallel sentences). No other resources were used.

tences could be extracted. Although the same botwas also used to generate articles in the SwedishWikipedia, our alignments seem to be better forthat language.

5.2 Qualitative evaluation

Aiming to perform a large-scale assessment ofthe quality of the extracted parallel sentences, wetrained NMT systems on the extracted parallelsentences. We identified a publicly available dataset which provide test sets for many languagepairs: translations of TED talks as proposed in thecontext of a study on pretrained word embeddingsfor NMT15 (Qi et al., 2018). We would like to em-phasize that we did not use the training data pro-vided by TED – we only trained on the mined sen-tences from Wikipedia. The goal of this study isnot to build state-of-the-art NMT system for forthe TED task, but to get an estimate of the qual-ity of our extracted data, for many language pairs.In particular, there may be a mismatch in the topicand language style between Wikipedia texts andthe transcribed and translated TED talks.

For training NMT systems, we used a trans-former model from fairseq (Ott et al., 2019)

15https://github.com/neulab/word-embeddings-for-nmt

with the parameter settings shown in Figure 3in the appendix. For preprocessing, the textwas tokenized using the Moses tokenizer (withouttrue casing) and a 5000 subword vocabulary waslearnt using SentencePiece (Kudo and Richardson,2018). Decoding was done with beam size 5 andlength normalization 1.2.

We evaluate the trained translation systems onthe TED dataset (Qi et al., 2018). The TED dataconsists of parallel TED talk transcripts in mul-tiple languages, and it provides development andtest sets for 50 languages. Since the developmentand test sets were already tokenized, we first deto-kenize them using Moses. We trained NMT sys-tems for all possible language pairs with more thantwenty-five thousand mined sentences. This givesus in total 1886 language pairs in 45 languages.We train L1 → L2 and L2 → L1 with the samemined bitexts L1/L2. Scores on the test sets werecomputed with SacreBLEU (Post, 2018). Table 4summarizes all the results. Due to space con-straints, we are unable to report BLEU score for alllanguage combinations in that table. Some addi-tional results are reported in Table 6 in the annex.23 NMT systems achieve BLEU scores over 30,the best one being 37.3 for Brazilian Portugueseto English. Several results are worth mentioning,

9

https://github.com/neulab/word-embeddings-for-nmt

https://github.com/neulab/word-embeddings-for-nmt

like Farsi/English: 16.7, Hebrew/English: 25.7,Indonesian/English: 24.9 or English/Hindi: 25.7We also achieve interesting results for translationbetween various non English language pairs forwhich it is usually not easy to find parallel data,e.g. Norwegian ↔ Danish ≈33, Norwegian ↔Swedish ≈25, Indonesian ↔ Vietnamese ≈16 orJapanese / Korean ≈17.

Our results on the TED set give an indication onthe quality of the mined parallel sentences. TheseBLEU scores should be of course appreciated incontext of the sizes of the mined corpora as givenin Table 3. Obviously, we can not exclude thatthe provided data contains some wrong alignmentseven though the margin is large. Finally, we wouldlike to point out that we run our approach on allavailable languages in Wikipedia, independentlyof the quality of LASER’s sentence embeddingsfor each one.

6 Conclusion

We have presented an approach to systematicallymine for parallel sentences in the textual contentof Wikipedia, for all possible language pairs. Weuse a recently proposed mining approach basedon massively multilingual sentence embeddings(Artetxe and Schwenk, 2018b) and a margin cri-terion (Artetxe and Schwenk, 2018a). The sameapproach is used for all language pairs without theneed of a language specific optimization. In total,we make available 135M parallel sentences in 85languages, out of which only 34M sentences arealigned with English. We were able to mine morethan ten thousands sentences for 1620 differentlanguage pairs. This corpus of parallel sentencesis freely available.16 We also performed a largescale evaluation of the quality of the mined sen-tences by training 1886 NMT systems and evalu-ating them on the 45 languages of the TED corpus(Qi et al., 2018).

This work opens several directions for future re-search. The mined texts could be used to first re-train LASER’s multilingual sentence embeddingswith the hope to improve the performance on low-resource languages, and then to rerun mining inWikipedia. This process could be iteratively re-peated. We also plan to apply the same method-ology to other large multilingual collections. Themonolingual texts made available by ParaCrawl or

16https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix

CommonCrawl17 are good candidates.We expect that the WikiMatrix corpus has

mostly well-formed sentences and it should notcontain social media language. The mined paral-lel sentences are not limited to specific topics likemany of the currently available resources (parlia-ment proceedings, subtitles, software documenta-tion, . . .), but are expected to cover many topicsof Wikipedia. The fraction of unedited machinetranslated text is also expected to be low. We hopethat this resource will be useful to support researchin multilinguality, in particular machine transla-tion.

7 Acknowledgments

We would like to thank Edoaurd Grave for helpwith handling the Wikipedia corpus and MatthijsDouze for support with the use of FAISS.

ReferencesSadaf Abdul-Rauf and Holger Schwenk. 2009. On the

Use of Comparable Corpora to Improve SMT per-formance. In EACL, pages 16–23.

Sisay Fissaha Adafre and Maarten de Rijke. 2006.Finding similar sentences across multiple languagesin Wikipedia. In Proceedings of the Workshop onNEW TEXT Wikis and blogs and other dynamic textsources.

Ahmad Aghaebrahimian. 2018. Deep neural networksat the service of multilingual parallel sentence ex-traction. In Coling.

Mikel Artetxe and Holger Schwenk. 2018a. Margin-based Parallel Corpus Mining with MultilingualSentence Embeddings. https://arxiv.org/abs/1811.01136.

Mikel Artetxe and Holger Schwenk. 2018b. Mas-sively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. In https://arxiv.org/abs/1812.10464.

Andoni Azpeitia, Thierry Etchegoyhen, and EvaMartınez Garcia. 2017. Weighted Set-TheoreticAlignment of Comparable Sentences. In BUCC,pages 41–45.

Andoni Azpeitia, Thierry Etchegoyhen, and EvaMartınez Garcia. 2018. Extracting Parallel Sen-tences from Comparable Corpora with STACC Vari-ants. In BUCC.

Houda Bouamor and Hassan Sajjad. 2018.H2@BUCC18: Parallel Sentence Extractionfrom Comparable Corpora Using MultilingualSentence Embeddings. In BUCC.17http://commoncrawl.org/

10



http://www.aclweb.org/anthology/E09-1003



https://arxiv.org/abs/1811.01136




http://aclweb.org/anthology/W17-2508


http://commoncrawl.org/

Christian Buck and Philipp Koehn. 2016. Findings ofthe wmt 2016 bilingual document alignment sharedtask. In Proceedings of the First Conference on Ma-chine Translation, pages 554–563, Berlin, Germany.Association for Computational Linguistics.

Vishrav Chaudhary, Yuqing Tang, Francisco Guzman,Holger Schwenk, and Philipp Koehn. 2019. Low-resource corpus filtering using multilingual sentenceembeddings. In Proceedings of the Fourth Confer-ence on Machine Translation (WMT).

Cristina Espana-Bonet, Adam Csaba Varga, AlbertoBarron-Cedeno, and Josef van Genabith. 2017. AnEmpirical Analysis of NMT-Derived InterlingualEmbeddings and their Use in Parallel Sentence Iden-tification. IEEE Journal of Selected Topics in SignalProcessing, pages 1340–1348.

Thierry Etchegoyhen and Andoni Azpeitia. 2016. Set-Theoretic Alignment for Comparable Corpora. InACL, pages 2009–2018.

Simon Gottschalk and Elena Demidova. 2017. Mul-tiwiki: Interlingual text passage alignment inWikipedia. ACM Transactions on the Web (TWEB),11(1):6.

Francis Gregoire and Philippe Langlais. 2017. BUCC2017 Shared Task: a First Attempt Toward a DeepLearning Framework for Identifying Parallel Sen-tences in Comparable Corpora. In BUCC, pages 46–50.

Mandy Guo, Qinlan Shen, Yinfei Yang, HemingGe, Daniel Cer, Gustavo Hernandez Abrego, KeithStevens, Noah Constant, Yun-Hsuan Sung, BrianStrope, and Ray Kurzweil. 2018. Effective Paral-lel Corpus Mining using Bilingual Sentence Embed-dings. arXiv:1807.11906.

Hany Hassan, Anthony Aue, Chang Chen, VishalChowdhary, Jonathan Clark, Christian Feder-mann, Xuedong Huang, Marcin Junczys-Dowmunt,William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu,Renqian Luo, Arul Menezes, Tao Qin, Frank Seide,Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu,Yingce Xia, Dongdong Zhang, Zhirui Zhang, andMing Zhou. 2018. Achieving Human Parity onAutomatic Chinese to English News Translation.arXiv:1803.05567.

Jeff Johnson, Matthijs Douze, and Herve Jegou. 2017.Billion-scale similarity search with GPUs. arXivpreprint arXiv:1702.08734.

Armand Joulin, Edouard Grave, Piotr Bojanowski, andTomas Mikolov. 2016. Bag of tricks for efficienttext classification. https://arxiv.org/abs/1607.01759.

H. Jegou, M. Douze, and C. Schmid. 2011. Prod-uct quantization for nearest neighbor search. IEEETrans. PAMI, 33(1):117–128.

Philipp Koehn. 2005. Europarl: A parallel corpus forstatistical machine translation. In MT summit.

Philipp Koehn, Francisco Guzman, Vishrav Chaud-hary, and Juan M. Pino. 2019. Findings of thewmt 2019 shared task on parallel corpus filteringfor low-resource conditions. In Proceedings of theFourth Conference on Machine Translation, Volume2: Shared Task Papers, Florence, Italy. Associationfor Computational Linguistics.

Philipp Koehn, Huda Khayrallah, Kenneth Heafield,and Mikel L. Forcada. 2018. Findings of the wmt2018 shared task on parallel corpus filtering. In Pro-ceedings of the Third Conference on Machine Trans-lation: Shared Task Papers, pages 726–739, Bel-gium, Brussels. Association for Computational Lin-guistics.

Taku Kudo and John Richardson. 2018. Sentencepiece:A simple and language independent subword tok-enizer and detokenizer for neural text processing. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing: SystemDemonstrations, pages 66–71, Brussels, Belgium.Association for Computational Linguistics.

P. Lison and J. Tiedemann. 2016. Opensubtitles2016:Extracting large parallel corpora from movie and tvsubtitles. In LREC.

Mehdi Zadeh Mohammadi and NasserGhasemAghaee. 2010. Building bilingual par-allel corpora based on Wikipedia. In 2010 SecondInternational Conference on Computer Engineeringand Applications, pages 264–268.

Dragos Stefan Munteanu and Daniel Marcu. 2005. Im-proving Machine Translation Performance by Ex-ploiting Non-Parallel Corpora. Computational Lin-guistics, 31(4):477–504.

P Otero, I Lopez, S Cilenis, and Santiago de Com-postela. 2011. Measuring comparability of multi-lingual corpora extracted from Wikipedia. IberianCross-Language Natural Language ProcessingsTasks (ICL), page 8.

Pablo Gamallo Otero and Isaac Gonzalez Lopez. 2010.Wikipedia as multilingual source of comparable cor-pora. In Proceedings of the 3rd Workshop on Build-ing and Using Comparable Corpora, LREC, pages21–25.

Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. fairseq: A fast, extensibletoolkit for sequence modeling. In Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics(Demonstrations), pages 48–53, Minneapolis, Min-nesota. Association for Computational Linguistics.

Alexandre Patry and Philippe Langlais. 2011. Identi-fying parallel documents from a large bilingual col-lection of texts: Application to parallel article ex-traction in Wikipedia. In Proceedings of the 4th

11

http://www.aclweb.org/anthology/W/W16/W16-2347



https://doi.org/10.18653/v1/P16-1189

https://doi.org/10.18653/v1/P16-1189







https://www.aclweb.org/anthology/W18-6453


https://www.aclweb.org/anthology/D18-2012



http://www.aclweb.org/anthology/J05-4003



https://www.aclweb.org/anthology/N19-4009

https://www.aclweb.org/anthology/N19-4009

Workshop on Building and Using Comparable Cor-pora: Comparable Corpora and the Web, pages 87–95. Association for Computational Linguistics.

Matt Post. 2018. A call for clarity in reporting bleuscores. In Proceedings of the Third Conference onMachine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computa-tional Linguistics.

Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Pad-manabhan, and Graham Neubig. 2018. When andwhy are pre-trained word embeddings useful forneural machine translation? In Proceedings of the2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 2 (Short Pa-pers), pages 529–535, New Orleans, Louisiana. As-sociation for Computational Linguistics.

Philip Resnik. 1999. Mining the Web for BilingualText. In ACL.

Philip Resnik and Noah A. Smith. 2003. The Webas a Parallel Corpus. Computational Linguistics,29(3):349–380.

Holger Schwenk. 2018. Filtering and mining paralleldata in a joint multilingual space. In ACL, pages228–234.

Jason R. Smith, Chris Quirk, and Kristina Toutanova.2010. Extracting parallel sentences from compa-rable corpora using document level alignment. InNAACL, pages 403–411.

J. Tiedemann. 2012. Parallel data, tools and interfacesin OPUS. In LREC.

Chen-Tse Tsai and Dan Roth. 2016. Cross-lingual wik-ification using multilingual embeddings. In Pro-ceedings of the 2016 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages589–598.

Dan Tufis, Radu Ion, S, tefan Daniel, Dumitrescu, andDan S, tefanescu. 2013. Wikipedia as an smt trainingcorpus. In RANLP, pages 702–709.

Masao Utiyama and Hitoshi Isahara. 2003. ReliableMeasures for Aligning Japanese-English News Arti-cles and Sentences. In ACL.

Yinfei Yang, Gustavo Hernandez Abrego, Steve Yuan,Mandy Guo, Qinlan Shen, Daniel Cer, Yun-HsuanSung, Brian Strope, and Ray Kurzweil. 2019. Im-proving multilingual sentence embedding using bi-directional dual encoder with additive margin soft-max. In https://arxiv.org/abs/1902.08564.

Michał Ziemski, Marcin Junczys-Dowmunt, and BrunoPouliquen. 2016. The United Nations Parallel Cor-pus v1.0. In LREC.

12



http://www.aclweb.org/anthology/N18-2084



http://www.aclweb.org/anthology/P99-1068









A Appendix

Table 5 provides the amounts of mined parallelsentences for languages which have a rather smallWikipedia. Aligning those languages obviouslyyields to a very small amount of parallel sentences.Therefore, we only provide these results for align-ment with high resource languages. It is also likelythat several of these alignments are of low qual-ity since the LASER embeddings were not directlytrained on most these languages, but we still hopeto achieve reasonable results since other languagesof the same family may be covered.

ISO Name LanguageFamily

size ca da de en es fr it nl pl pt sv ru zh total

an Aragonese Romance 222 24 7 12 23 33 16 13 9 10 14 9 11 6 324arz Egyptian

ArabicArabic 120 7 6 11 18 12 12 10 8 9 10 8 12 7 278

as Assamese Indo-Aryan 124 8 6 11 7 11 12 10 9 9 8 8 9 3 216azb South Azer-

baijaniTurkic 398 6 4 9 8 9 10 9 7 6 8 6 7 3 172

bar Bavarian Germanic 214 7 6 41 16 12 12 10 8 9 10 8 10 5 261bpy Bishnupriya Indo-Aryan 128 2 1 4 4 3 4 2 2 3 2 2 3 1 71br Breton Celtic 413 20 16 22 23 22 19 16 6 200ce Chechen Northeast

Caucasian315 2 1 2 2 2 2 2 2 2 2 2 2 1 56

ceb Cebuano Malayo-Polynesian

17919 14 9 22 29 27 24 24 15 17 20 55 21 9 594

ckb Central Kur-dish

Iranian 127 2 2 6 8 5 5 4 4 4 4 3 6 4 113

cv Chuvash Turkic 198 4 3 5 4 6 6 7 5 4 6 5 8 2 129dv Maldivian Indo-Aryan 52 2 2 5 6 4 4 3 3 3 3 3 5 3 96fo Faroese Germanic 114 13 12 14 32 21 18 15 11 11 17 12 13 6 335fy Western

FrisianGermanic 493 13 8 16 32 21 18 17 38 12 18 13 14 5 453

gd Gaelic Celtic 66 1 1 1 1 1 1 1 1 1 1 1 1 1 41ga Irish Irish 216 2 3 4 3 3 3 2 2 3 2 3 1 70gom Goan

KonkamiIndo-Aryan 69 9 7 10 8 13 13 13 9 9 11 9 10 4 240

ht Haitian Cre-ole

Creole 60 2 1 3 4 3 4 3 2 3 2 2 3 1 72

ilo Iloko Philippine 63 3 2 4 5 4 4 4 3 3 4 3 4 2 96io Ido constructed 153 5 3 6 11 7 7 5 5 5 6 5 5 3 143jv Javanese Malayo-

Polynesian220 8 5 8 13 12 10 11 8 7 11 8 8 3 219

ka Georgian Kartvelian 480 11 7 15 12 16 17 16 12 11 14 12 13 5 288ku Kurdish Iranian 165 5 4 8 5 8 7 8 7 6 7 6 6 3 222la Latin Romance 558 12 9 17 32 20 18 17 12 13 18 13 14 6 478lb LuxembourgishGermanic 372 12 7 26 22 19 18 15 11 11 16 12 11 4 305lmo Lombard Romance 147 6 3 7 10 7 7 11 6 5 7 5 5 3 144mg Malagasy Malayo-

Polynesian263 6 5 9 13 9 12 8 7 7 7 8 7 4 199

mhr EasternMari

Uralic 61 3 2 4 3 4 4 5 3 3 4 3 4 2 96

min MinangkabauMalayo-Polynesian

255 4 2 6 7 5 5 5 4 4 4 5 5 2 121

mn Mongolian Mongolic 255 4 3 7 5 6 6 7 6 5 5 5 5 3 197mwl Mirandese Romance 64 6 3 4 10 8 6 5 3 4 34 3 4 2 154nds nl Low Ger-

man/SaxonGermanic 65 5 4 6 10 7 7 6 15 5 6 5 5 3 151

ps Pashto Iranian 89 2 3 2 3 3 3 3 3 3 3 3 1 73rm Romansh Italic 57 2 2 10 5 4 4 3 2 3 3 3 3 1 86sah Yakut Turkic/Sib 134 4 3 7 5 6 6 6 5 5 5 5 6 3 134scn Sicilian Romance 81 5 3 6 9 7 7 11 5 5 6 5 5 2 143sd Sindhi Iranian 115 3 9 8 8 7 7 6 7 5 8 5 152su Sundanese Malayo-

Polynesian120 4 3 5 7 6 5 6 4 4 5 4 4 2 117

tk Turkmen Turkic 56 2 2 3 3 4 3 4 2 2 4 2 3 1 76tg Tajik Iranian 248 5 4 11 15 9 9 8 8 7 8 6 10 6 192ug Uighur Turkic 83 4 3 9 10 7 8 6 6 5 6 5 9 6 168ur Urdu Indo-Aryan 150 2 2 3 5 3 3 3 3 3 3 3 3 2 123wa Walloon Romance 56 3 2 4 5 5 4 4 3 3 4 3 3 2 93wuu Wu Chinese Chinese 75 8 6 11 17 12 11 10 8 9 11 9 10 43 283yi Yiddish Germanic 131 3 2 4 3 4 4 5 3 3 4 3 4 1 92

Table 5: WikiMatrix (part 2): number of extracted sen-tences (in thousands) for languages with a rather smallWikipedia. Alignments with other languages yield lessthan 5k sentences and are omitted for clairty.

Table 3 gives the detailed configuration whichwas used to train NMT models on the mined datain Section 5.

--arch transformer--share-all-embeddings--encoder-layers 5--decoder-layers 5--encoder-embed-dim 512--decoder-embed-dim 512--encoder-ffn-embed-dim 2048--decoder-ffn-embed-dim 2048--encoder-attention-heads 2--decoder-attention-heads 2--encoder-normalize-before--decoder-normalize-before--dropout 0.4--attention-dropout 0.2--relu-dropout 0.2--weight-decay 0.0001--label-smoothing 0.2--criterion label smoothed cross entropy--optimizer adam--adam-betas ’(0.9, 0.98)’--clip-norm 0--lr-scheduler inverse sqrt--warmup-update 4000--warmup-init-lr 1e-7--lr 1e-3 --min-lr 1e-9--max-tokens 4000--update-freq 4--max-epoch 100--save-interval 10

Figure 3: Model settings for NMT training withfairseq

Finally, Table 6 gives the BLEU scores on theTED corpus when translating into and from En-glish for some additional languages.

Lang xx→ en en→ xx

et 15.9 14.3eu 10.1 7.6fa 16.7 8.8fi 10.9 10.9lt 13.7 10.0hi 17.8 21.9mr 2.6 3.5

Table 6: BLEU scores on the TED test set as proposedin (Qi et al., 2018). NMT systems were trained on bi-texts mined in Wikipedia only. No other resources wereused.

13

arXiv:1907.05791v2 [cs.CL] 16 Jul 2019 · WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia Holger Schwenk Facebook AI [email protected] Vishrav Chaudhary

Documents