ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identiﬁcation Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

ALTW 2010 Shared Task:

Multilingual Language Identification

Marco Lui & Tim BaldwinNICTA VRL

Department of Computer Science and Software EngineeringUniversity of Melbourne, VIC 3010, Australia

[email protected], [email protected]

University of Melbourne

10 December 2010

1 / 28


What is Language Identification?

Source(s): Wikipedia2 / 28


Can you LangID?

Source(s): Wikipedia3 / 28


Basic Assumptions

Monolingual

Homogeneous

Closed World

Narrow Scope

4 / 28


Cavnar & Trenkle - Dataset

• 3478 samples from the soc.culture newsgroup hierarchy

• 8 languages:

English 1208Spanish 697German 481Italian 316French 273Dutch 235Portuguese 151Polish 117

Reference(s): Cavnar and Trenkle, 19945 / 28


Cavnar & Trenkle - TechniquesData Representation

• keep only letters, apostrophes, whitespace

• union over byte-level N-grams (N = 1. . .5)

Examples

language identification

1-gram l, a, n, g, u . . .

2-gram la, an, gu, ua, ag . . .

3-gram lan, ang, gua, uag, age . . .

4-gram lang, angu, guag, uage, age . . .

5-gram langu, angua, guage, uage , age i . . .



Cavnar & Trenkle - TechniquesFeature Selection

• N-Gram Frequency Profile

• Top X (X = 100 . . . 400)

Examples

X = 3

from a:20 b:15 c:10 ab:12 ac:8 . . .

select a, b, ab



Cavnar & Trenkle - TechniquesClassification Algorithm

• nearest prototype

• 1 prototype per language

• sum of term frequencies across all instances

• out-of-place distance metric

Examples

doc1 a:10 b:15 c:2

doc2 a:2 b:3 c:1

doc3 a:25 b:20 c:15

prototype a:37 b:38 c:18



Cavnar & Trenkle - TechniquesOut-of-Place distance metric



Cavnar & Trenkle - Results

• 98.6% accuracy for articles ≤300 bytes

• 99.8% accuracy for articles > 300 bytes

• A solved problem?



Baldwin & Lui - Task Description

Corpus Docs Langs Encs Document Length (bytes)

EuroGOV 1500 10 1 17460.5±39353.4

TCL 3174 60 12 2623.2±3751.9

Wikipedia 4963 67 1 1480.8±4063.9

Reference(s): Baldwin and Lui, 201011 / 28


Baldwin & Lui - Method

• 10-fold cross-validation on each dataset

• 42 distinct classifiers

model (×7): nearest-neighbour (Cos1NN, Skew1NN, OOP1NN)nearest-prototype (CosAM, SkewAM)Naive BayesSVM

tokenisation (×2): byte, codepoint

n-gram (×3): 1-gram, 2-gram, 3-gram



Baldwin & Lui - TechniquesSkew Divergence

D(x || y) =∑

i

xi(log2 xi − log2 yi )

skewα(x , y) = D(x || αy + (1− α)x)

• variant of Kullback-Leibler divergence

• linear interpolation between x and y with smoothing factor α

• α typically 0.99

Reference(s): Lee, 200113 / 28


Baldwin & Lui - ResultsTokenization: Choice of n-gram order (Wikipedia)



Baldwin & Lui - ResultsTokenization: Bytes vs Codepoints (2-gram)



Baldwin & Lui - ResultsPerformance vs Time Taken



Baldwin & Lui - ResultsThe Long Tail

• Wikipedia

• byte bigram

• Skew Divergence

• Nearest Prototype

Language N P R FTamil 6 1.000 1.000 1.000Japanese 219 0.990 0.992 0.955English 1629 0.972 0.899 0.934

. . .Italian 202 0.735 0.906 0.812Danish 37 0.710 0.595 0.647Icelandic 10 0.188 0.300 0.231



Baldwin & Lui - ResultsConfusion Pairs

• Wikipedia

• byte bigram

• Skew Divergence

• Nearest Prototype

From To ProportionIndonesian Malay 0.405Malay Indonesian 0.214Danish Norwegian 0.270Norwegian Danish 0.043Russian Ukrainian 0.090Ukrainian Russian 0.043



Open Issues

Supporting Minority Languages

Open Class Language Identification

Sparse or Impoverished Training Data

Multilingual Documents

Standard Evaluation Corpora

Performance Evaluation Criteria

Reference(s): Hughes et al., 200619 / 28


ALTW 2010 Shared Task

• multiclass text categorization task

• select 2 languages from a closed set of 74

• addresses a number of open issues:• Sparse or Impoverished Training Data• Multilingual Documents• Standard Evaluation Corpora• Performance Evaluation Criteria

20 / 28


ALTW 2010 Shared Task Dataset

• 10000 synthetic bilingual documents in 74 languages

• randomly partitioned into• 8000 training documents• 1000 developement documents• 1000 test documents

• compiled from static dumps of language-specific Wikipedias

• downloaded between 9 June and 1 August 2008

• selected languages with > 1000 articles

21 / 28


Generating a synthetic bilingual document

• semantic linkage

• language-links: [[<language-prefix>:<page title>]]

1. select primary document

2. select secondary document via language-link

3. normalize: remove redirects, language-links and templates

4. chunk: split on two consecutive paragraphs

5. retain top 50% of paragaphs from primary, bottom 50% fromsecondary

22 / 28


Evaluation MetricsMulti-class Text Categorization

• IR-style performance metrics:precision= TP

TP+FP

recall= TPTP+FN

f-score= 2×precision×recallprecision+recall

• macroaveraging vs microaveraging

• competition metric: micro-averaged f-scoreReference(s): Sebastiani, 2002

23 / 28


Majority-class baseline

• most common classes• en(3330) de(747) fr(747) ja(442)

• most common pairs• en-de(1283) en-fr(1053) en-ja(606) en-it(479)

Baseline PM RM FM Pµ Rµ Fµ

en .011 .015 .012 .701 .350 .467en+de .014 .030 .018 .458 .458 .458

24 / 28


Nearest-prototype benchmarkSkew Divergence with Arithmetic Mean Language Prototypes

N-Gram Multiclass PM RM FM Pµ Rµ Fµ

1 .440 .274 .295 .264 .132 .1762 single .540 .376 .413 .583 .291 .3893 .564 .412 .453 .814 .407 .543

1 .412 .458 .414 .629 .622 .6252 stratified .460 .448 .435 .775 .768 .7713 .497 .467 .464 .833 .826 .829

1 .115 .786 .155 .057 .878 .1072 binarised .171 .705 .221 .114 .885 .2023 .227 .686 .292 .259 .903 .402

25 / 28


Source(s): Google Translate26 / 28


Questions?

27 / 28


Reference

Timothy Baldwin and Marco Lui. Language identification: The long and theshort of the matter. In Proceedings of Human Language Technologies: The11th Annual Conference of the North American Chapter of the Associationfor Computational Linguistics (NAACL HLT 2010), pages 229–237, LosAngeles, USA, 2010.

William B. Cavnar and John M. Trenkle. N-gram-based text categorization. InProceedings of the Third Symposium on Document Analysis andInformation Retrieval, Las Vegas, USA, 1994.

Baden Hughes, Timothy Baldwin, Steven Bird, Jeremy Nicholson, and AndrewMacKinlay. Reconsidering language identification for written languageresources. In Proceedings of the 5th International Conference on LanguageResources and Evaluation (LREC 2006), pages 485–488, Genoa, Italy, 2006.

Lillian Lee. On the effectiveness of the skew divergence for statistical languageanalysis. In Proceedings of Artificial Intelligence and Statistics 2001(AISTATS 2001), pages 65–72, Key West, USA, 2001.

Fabrizio Sebastiani. Machine learning in automated text categorization. ACMcomputing surveys (CSUR), 34(1):1–47, 2002.

28 / 28