Top Banner
Corpus Collection, Cleaning, Translation and Related Applications Nepali English Parallel Corpus Bal Krishna Bal Project Manager PAN Localization Project Madan Puraskar Pustakalaya, Nepal URL : www.madanpuraskar.org Email: [email protected] 1 Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
11

nepali english parallel corpus - PAN Localization • Parallel Corpus and it’s significance in Natural Language Processing • Nepali English Parallel Corpus, an overview • Tools

May 22, 2018

Download

Documents

vutuyen
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: nepali english parallel corpus - PAN Localization • Parallel Corpus and it’s significance in Natural Language Processing • Nepali English Parallel Corpus, an overview • Tools

Corpus Collection, Cleaning, Translation and Related

ApplicationsNepali English Parallel Corpusp glish Parallel Corpus

Bal Krishna Bal

Project Manager

PAN Localization Project

Madan Puraskar Pustakalaya, Nepal

URL : www.madanpuraskar.org

Email: [email protected]

1Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos

Page 2: nepali english parallel corpus - PAN Localization • Parallel Corpus and it’s significance in Natural Language Processing • Nepali English Parallel Corpus, an overview • Tools

Contents• Parallel Corpus and it’s significance in Natural

Language Processing

• Nepali English Parallel Corpus, an overview

• Tools and procedures involved• Tools and procedures involved

• Resources involved

• Experiences collected while building the corpus

• Conclusion

2Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos

Page 3: nepali english parallel corpus - PAN Localization • Parallel Corpus and it’s significance in Natural Language Processing • Nepali English Parallel Corpus, an overview • Tools

Parallel corpus and it’s significance in Natural Language Processing

• Parallel corpus – translations in more than one language placed along side with each other.

• Text alignment in different levels (word, chunk, sentence) – a challenging task in parallel corpus alignment.

• Parallel corpus – a valuable resource for statistical machine translation;

3Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos

Page 4: nepali english parallel corpus - PAN Localization • Parallel Corpus and it’s significance in Natural Language Processing • Nepali English Parallel Corpus, an overview • Tools

Nepali English Parallel Corpus, an overview

• 100,000 words of English source from PENN Treebank corpus availablethrough Linguistic Data Consortium (LDC);

• The support for the work is provided by Language Resource Association(GSK) of Japan and International Development Research Center (IDRC) ofCanada, through the PAN Localization Project;

• The work is released under the Creative Commons License and available athttp://www.crulp.org/software/ling-resources/UrduNepaliEnglishParallelCorpus.htm

4Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos

Page 5: nepali english parallel corpus - PAN Localization • Parallel Corpus and it’s significance in Natural Language Processing • Nepali English Parallel Corpus, an overview • Tools

Tools and procedures involved

• OmegaT – a free translation memory application;

• Editor software – Gedit, OpenOffice.org Writer;

• Source files were broken down into several segments and translated;

• Translated files were subjected to verification and proof reading simultaneously.

S istics of translation files

5Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos

Source file name

Number of source words

Translation status

Verification status

00.txt 41056 100% 100% complete

01.txt 43419 100% 100% complete

02.txt 9612 100% 100% complete

Statistics of translation files

Page 6: nepali english parallel corpus - PAN Localization • Parallel Corpus and it’s significance in Natural Language Processing • Nepali English Parallel Corpus, an overview • Tools

Human resources involved

• 4 translators and 1 proof reader – 6 months

6Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos

Page 7: nepali english parallel corpus - PAN Localization • Parallel Corpus and it’s significance in Natural Language Processing • Nepali English Parallel Corpus, an overview • Tools

Experiences collected • Sentence to sentence translation not always practicable. A source

sentence can sometimes correspond to more than one translation sentence.

• Mapping constituent order in the sentences was difficult whiletranslating (English follows Subject, Verb, Object sentence patterntranslating (English follows Subject, Verb, Object sentence patternwhereas Nepali follows Subject, Object and Verb sentence pattern).

• The source text could not be 100% meaningfully translated in the targetlanguage owing to different socio-cultural contexts in the sourcelanguage.

• Some strings in the source text not having the Nepali counterpart likeLRB, *-1 etc. have been left as it is in the target text as well.

• Proper names in the source text have been transliterated in the targettext.

7Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos

Page 8: nepali english parallel corpus - PAN Localization • Parallel Corpus and it’s significance in Natural Language Processing • Nepali English Parallel Corpus, an overview • Tools

Experiences collected…

• OmegaT was surely a useful tool, however it had a few drawbacksthat we constantly had to face in terms of using it.

x The first one was the memory handling problem. Often while dealingwith bigger files, the system crashed leading to data loss.

x Although, the tool facilitates the segmentation via segmentation rules.x Although, the tool facilitates the segmentation via segmentation rules.Often there are ambiguous cases, for instance, the full stop symbol “.”that can appear at the end of the sentence or in between like “Mr.”,“A.D.” etc. This causes the segmentation to occur at undesiredpoints.

x Since the tool does not support versioning control systems like CVSor SVN, we had a hard time managing and configuring the filesmanually.

8Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization

Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos

Page 9: nepali english parallel corpus - PAN Localization • Parallel Corpus and it’s significance in Natural Language Processing • Nepali English Parallel Corpus, an overview • Tools

Conclusion• This work has great value especially in the context of

developing the linguistic resources for Nepali.

• The translated text would be further Parts-of-Speech tagged giving additional prospects for use and research from a Natural Language Processing perspective.

9Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, Hotel, Vientiane, Laos

Page 10: nepali english parallel corpus - PAN Localization • Parallel Corpus and it’s significance in Natural Language Processing • Nepali English Parallel Corpus, an overview • Tools

AcknowledgmentThis work was carried out with the aid of a grantfrom the Language Resource Association (GSK)of Japan and International Developmentp pResearch Centre (IDRC), Ottawa, Canada,administered through the Centre for Research inUrdu Language Processing (CRULP), NationalUniversity of Computer and Emerging Sciences(NUCES), Pakistan.

Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos

10

Page 11: nepali english parallel corpus - PAN Localization • Parallel Corpus and it’s significance in Natural Language Processing • Nepali English Parallel Corpus, an overview • Tools

Thank You!!

Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos

11