December 13, 2008 FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA http://terpconnect.umd.edu/~oard Slides from: Leah Larkey, Mike Maxwell, Franz Josef Och, David Yarowsky Ideas from: Just about all of “Team TIDES”
47
Embed
December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
December 13, 2008 FIRE
Not So Surprising Anymore: Hindi from TIDES to FIRE
Douglas W. Oard and Tan Xu
University of Maryland, USAhttp://terpconnect.umd.edu/~oard
Slides from: Leah Larkey, Mike Maxwell, Franz Josef Och, David YarowskyIdeas from: Just about all of “Team TIDES”
A Very Brief History of NLP
• 1966: ALPAC– Refocus investment on enabling technologies
Hindi Resources• Much more data available than for Cebuano
• Data collected by all project participants – Web pages, News, Handbooks, Manually created, …– Dictionaries
• Major problems: – Many non-standard encodings– Often no converters available– Available converters often did not work properly
• Huge effort: data conversion and cleaning
• Resulting bilingual corpus: 4.2 million words
Hindi Translation Elicitation Server- Johns Hopkins University (David Yarowsky)
People voluntarily translated large numbers of Hindi news sentences for nightly prizes at a novel Johns Hopkins University website
Performance is measured by Bleu score on 20% randomly interspersed test sentences Allows immediate way to rank and reward quality translations and exclude junk
Result: 300,000 words of perfectly sentence-aligned bitext (exactly on genre) for 1-2 cents/word within ~5 days
Much cheaper than 25 cents/word for translation services or
5 cents/word for a prior MT-group’s recruitment of local studentsSample Interface:
user (English) translations typed here…
and here ….
User choice of 2-3encoding alternatives
Observed exponential growth in usage (before prizes ended)
viral advertising via family, friends, newgroups, …
$0 in recruitment, advertising, and administrative costs
Nightly incentive rewards given automatically via amazon.com gift certificates to email addresses (any $ amount, no fee)
no need for hiring overhead. Rewards only given for proven high quality work already performed (prizes not salary).
immediate positive feedback encourages continued use
Direct immediate access to worldwide labor market fluent in source language
MT Challenges
• Lexicon coverage– Hindi morphology– Transliteration of Names
• Hindi word order: – SOV vs. SVO
• Training data inconsistencies, misalignments
• Incomplete tuning cycle– Same data/same model would give better results from
better tuning of model parameters
Example Translation
• Indonesian City of Bali in October last year in the bomb blast in the case of imam accused India of the sea on Monday began to be averted. The attack on getting and its plan to make the charges and decide if it were found guilty, he death sentence of May. Indonesia of the police said that the imam sea bomb blasts in his hand claim to be accepted. A night Club and time in the bomb blast in more than 200 people were killed and several injured were in which most foreign nationals. …
MT Results Overview - Hindi
50 60 70 80 90
bestcompeting
ISI public
ISI public+
ISIunrestricted
ISI late
Human 6
Human 5
PercentHuman CasedNISTr3n4score
Results in NIST evaluation: 7.43 Cased NIST (7.80 uncased)
Comparison to other languages
Language pair Words Training Data NIST score Relative Human NIST
Cebuano-English1.3M
(w/o Bible: 400K)? ?
Hindi-English 4.2M 7.4 73%
Chinese-English 150M 9.0 80%
Arabic-English 120M 10.1 89%
Note: different (news) test corpora, NIST scores incomparable
Hindi Week 1: Porting• Monday
– 2,973 BBC documents (UTF-8)– Batch CLIR (no stem, 2/3 known items rank 1)
• Tuesday– MIRACLE (“ITRANS”, gloss)– Stemmer (implemented from a paper)
• Wednesday– BBC CLIR collection (19 topic, known item)
• Friday:– Parallel text (Bible: 900k words, Web: 4k words) – Devanagari OCR system
Hindi Weeks 2/3/4: Exploration• N-grams (trigrams best for UTF-8)• Relative Average Term Frequency (Kwok)• Scanned bilingual dictionary (Oxford)• More topics for test collection (29)• Weighted structured queries (IBM lexicon)• Alternative stemmers (U Mass, Berkeley)• Blind relevance feedback• Transliteration• Noun phrase translation • MIRACLE integration (ISI MT, BBN headlines)