Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with: Shuly Wintner, Danny Shacham, Nurit Melnik, Yuval Krymolowski - University of Haifa Erik Peterson – Carnegie Mellon University
30
Embed
Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System Alon Lavie Language Technologies Institute Carnegie Mellon University.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Rapid Prototyping of a Transfer-based Hebrew-to-English
Machine Translation System
Alon LavieLanguage Technologies Institute
Carnegie Mellon University
Joint work with:Shuly Wintner, Danny Shacham, Nurit Melnik, Yuval Krymolowski - University of HaifaErik Peterson – Carnegie Mellon University
June 20, 2007 ISCOL/BISFAI-2007 2
Outline
• Context of this Work• CMU Statistical Transfer MT Framework• Hebrew and its Challenges for MT• Hebrew-to-English System• Morphological Analysis and Generation• MT Resources: lexicon and grammar• Translation Examples• Performance Evaluation• Conclusions, Current and Future Work
June 20, 2007 ISCOL/BISFAI-2007 3
Current State-of-the-art in Machine Translation
• MT underwent a major paradigm shift over the past 15 years:– From manually crafted rule-based systems with manually
designed knowledge resources– To search-based approaches founded on automatic
extraction of translation models/units from large sentence-parallel corpora
• Current Dominant Approach: Phrase-based Statistical MT:– Extract and statistically model large volumes of phrase-to-
phrase correspondences from automatically word-aligned parallel corpora
– “Decode” new input by searching for the most likely sequence of phrase matches, using a statistical Language Model for the target language
June 20, 2007 ISCOL/BISFAI-2007 4
Current State-of-the-art in Machine Translation
• Phrase-based MT State-of-the-art:– Requires minimally several million words of parallel
text for adequate training– Limited to language-pairs for which such data exists:
major European languages, Chinese, Japanese, a few others…
– Linguistically shallow and highly lexicalized models result in weak generalization
– Best performance levels (BLEU=~0.6) on Arabic-to-English provide understandable but often still somewhat disfluent translations
– Ill suited for Hebrew and most of the world’s minor languages
June 20, 2007 ISCOL/BISFAI-2007 5
CMU’s Statistical-Transfer (XFER) Approach
• Framework: Statistical search-based approach with syntactic translation transfer rules that can be acquired from data but also developed and extended by experts
• Elicitation: use bilingual native informants to produce a small high-quality word-aligned bilingual corpus of translated phrases and sentences
• Transfer-rule Learning: apply ML-based methods to automatically acquire syntactic transfer rules for translation between the two languages
• XFER + Decoder:– XFER engine produces a lattice of possible transferred
structures at all levels– Decoder searches and selects the best scoring combination
• Rule Refinement: refine the acquired rules via a process of interaction with bilingual informants
• Main algorithm: chart-style bottom-up integrated parsing+transfer with beam pruning– Seeded by word-to-word translations– Driven by transfer rules– Generates a lattice of transferred translation segments at
all levels• Some Unique Features:
– Works with either learned or manually-developed transfer grammars
– Handles rules with or without unification constraints– Supports interfacing with servers for morphological
analysis and generation– Can handle ambiguous source-word analyses and/or SL
segmentations represented in the form of lattice structures
XFER Lattice Decoder0 0 ON THE FOURTH DAY THE LION ATE THE RABBIT TO A MORNING MEALOverall: -8.18323, Prob: -94.382, Rules: 0, Frag: 0.153846, Length: 0,
Words: 13,13235 < 0 8 -19.7602: B H IWM RBI&I (PP,0 (PREP,3 'ON')(NP,2 (LITERAL 'THE')
(NP2,0 (NP1,1 (ADJ,2 (QUANT,0 'FOURTH'))(NP1,0 (NP0,1 (N,6 'DAY')))))))>918 < 8 14 -46.2973: H ARIH AKL AT H $PN (S,2 (NP,2 (LITERAL 'THE') (NP2,0
• In progress or planned:– Mapudungun-to-Spanish– Quechua-to-Spanish– Brazilian Portuguese-to-English– Native-Brazilian languages to Brazilian Portuguese– Hebrew-to-Arabic
June 20, 2007 ISCOL/BISFAI-2007 13
Challenges for Hebrew MT
• Puacity in existing language resources for Hebrew– No publicly available broad coverage morphological
analyzer– No publicly available bilingual lexicons or dictionaries– No POS-tagged corpus or parse tree-bank corpus for
Hebrew– No large Hebrew/English parallel corpus
• Scenario well suited for CMU transfer-based MT framework for languages with limited resources
June 20, 2007 ISCOL/BISFAI-2007 14
Modern Hebrew Spelling
• Two main spelling variants– “KTIV XASER” (difficient): spelling with the vowel
diacritics, and consonant words when the diacritics are removed
– “KTIV MALEH” (full): words with I/O/U vowels are written with long vowels which include a letter
• KTIV MALEH is predominant, but not strictly adhered to even in newspapers and official publications inconsistent spelling
• Example: – niqud (spelling): NIQWD, NQWD, NQD– When written as NQD, could also be niqed, naqed,
nuqad
June 20, 2007 ISCOL/BISFAI-2007 15
Morphological Analyzer
• We use a publicly available morphological analyzer distributed by the Technion’s Knowledge Center, adapted for our system
• Coverage is reasonable (for nouns, verbs and adjectives)
• Produces all analyses or a disambiguated analysis for each word
• Output format includes lexeme (base form), POS, morphological features
• Output was adapted to our representation needs (POS and feature mappings)
• Initial prototype developed within a two month intensive effort
• Accomplished:– Adapted available morphological analyzer– Constructed a preliminary translation lexicon– Translated and aligned Elicitation Corpus– Learned XFER rules– Developed (small) manual XFER grammar– System debugging and development– Evaluated performance on unseen test data using
automatic evaluation metrics
June 20, 2007 ISCOL/BISFAI-2007 22
Example Translation
• Input: – הנסיגה בנושא עם משאל לערוך הממשלה החליטה רבים דיונים לאחר– After debates many decided the government to hold
referendum in issue the withdrawal
• Output: – AFTER MANY DEBATES THE GOVERNMENT DECIDED
TO HOLD A REFERENDUM ON THE ISSUE OF THE WITHDRAWAL
Without transfer grammar:IN FRONT OF A FEW WEEKS ANNOUNCED ADMINISTRATION THE HOTEL THAT THE HOTEL WILL CLOSE AT THE END THIS YEAR
With transfer grammar:SEVERAL WEEKS AGO THE MANAGEMENT OF THE HOTEL ANNOUNCED THAT THE HOTEL WILL CLOSE AT THE END OF THE YEAR
June 20, 2007 ISCOL/BISFAI-2007 27
Evaluation Results
• Test set of 62 sentences from Haaretz newspaper, 2 reference translations
System BLEU NIST P R METEOR
No Gram 0.0616 3.4109 0.4090 0.4427 0.3298
Learned 0.0774 3.5451 0.4189 0.4488 0.3478
Manual 0.1026 3.7789 0.4334 0.4474 0.3617
June 20, 2007 ISCOL/BISFAI-2007 28
Current and Future Work
• Issues specific to the Hebrew-to-English system:– Coverage: further improvements in the translation lexicon
and morphological analyzer– Manual Grammar development– Acquiring/training of word-to-word translation probabilities– Acquiring/training of a Hebrew language model at a post-
morphology level that can help with disambiguation• General Issues related to XFER framework:
– Discriminative Language Modeling for MT– Effective models for assigning scores to transfer rules– Improved grammar learning– Merging/integration of manual and acquired grammars
June 20, 2007 ISCOL/BISFAI-2007 29
Conclusions
• Test case for the CMU XFER framework for rapid MT prototyping
• Preliminary system was a two-month, three person effort – we were quite happy with the outcome
• Core concept of XFER + Decoding is very powerful and promising for MT
• We experienced the main bottlenecks of knowledge acquisition for MT: morphology, translation lexicons, grammar...