Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References ALTW 2010 Shared Task: Multilingual Language Identification Marco Lui & Tim Baldwin NICTA VRL Department of Computer Science and Software Engineering University of Melbourne, VIC 3010, Australia [email protected], [email protected]University of Melbourne 10 December 2010 1 / 28
28
Embed
ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
ALTW 2010 Shared Task:
Multilingual Language Identification
Marco Lui & Tim BaldwinNICTA VRL
Department of Computer Science and Software EngineeringUniversity of Melbourne, VIC 3010, Australia
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Baldwin & Lui - ResultsConfusion Pairs
• Wikipedia
• byte bigram
• Skew Divergence
• Nearest Prototype
From To ProportionIndonesian Malay 0.405Malay Indonesian 0.214Danish Norwegian 0.270Norwegian Danish 0.043Russian Ukrainian 0.090Ukrainian Russian 0.043
Reference(s): Baldwin and Lui, 201018 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Open Issues
Supporting Minority Languages
Open Class Language Identification
Sparse or Impoverished Training Data
Multilingual Documents
Standard Evaluation Corpora
Performance Evaluation Criteria
Reference(s): Hughes et al., 200619 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
ALTW 2010 Shared Task
• multiclass text categorization task
• select 2 languages from a closed set of 74
• addresses a number of open issues:• Sparse or Impoverished Training Data• Multilingual Documents• Standard Evaluation Corpora• Performance Evaluation Criteria
20 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
ALTW 2010 Shared Task Dataset
• 10000 synthetic bilingual documents in 74 languages
• randomly partitioned into• 8000 training documents• 1000 developement documents• 1000 test documents
• compiled from static dumps of language-specific Wikipedias
• downloaded between 9 June and 1 August 2008
• selected languages with > 1000 articles
21 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Source(s): Google Translate26 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Questions?
27 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Reference
Timothy Baldwin and Marco Lui. Language identification: The long and theshort of the matter. In Proceedings of Human Language Technologies: The11th Annual Conference of the North American Chapter of the Associationfor Computational Linguistics (NAACL HLT 2010), pages 229–237, LosAngeles, USA, 2010.
William B. Cavnar and John M. Trenkle. N-gram-based text categorization. InProceedings of the Third Symposium on Document Analysis andInformation Retrieval, Las Vegas, USA, 1994.
Baden Hughes, Timothy Baldwin, Steven Bird, Jeremy Nicholson, and AndrewMacKinlay. Reconsidering language identification for written languageresources. In Proceedings of the 5th International Conference on LanguageResources and Evaluation (LREC 2006), pages 485–488, Genoa, Italy, 2006.
Lillian Lee. On the effectiveness of the skew divergence for statistical languageanalysis. In Proceedings of Artificial Intelligence and Statistics 2001(AISTATS 2001), pages 65–72, Key West, USA, 2001.
Fabrizio Sebastiani. Machine learning in automated text categorization. ACMcomputing surveys (CSUR), 34(1):1–47, 2002.