AVENUE/LETRAS: Learning-based MT for Languages with Limited Resources Faculty: Jaime Carbonell, Alon Lavie, Lori Levin, Ralf Brown, Robert Frederking Students and Staff: Erik Peterson, Christian Monson, Ariadna Font Llitjós, Alison Alvarez, Roberto Aranovich, Rodolfo Vega
123
Embed
AVENUE/LETRAS: Learning-based MT for Languages with Limited Resources Faculty: Jaime Carbonell, Alon Lavie, Lori Levin, Ralf Brown, Robert Frederking Students.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
AVENUE/LETRAS:Learning-based MT for Languages
with Limited Resources
Faculty: Jaime Carbonell, Alon Lavie, Lori Levin, Ralf Brown, Robert FrederkingStudents and Staff: Erik Peterson, Christian Monson, Ariadna Font Llitjós, Alison Alvarez, Roberto Aranovich, Rodolfo Vega
Mar 1, 2006 AVENUE/LETRAS 2
Outline
• Scientific Objectives• Framework Overview• Learning Morphology• Elicitation• Learning Transfer Rules• Automatic Rule Refinement• Language Prototypes• New Directions
Mar 1, 2006 AVENUE/LETRAS 3
Why Machine Translation for Languages with Limited Resources?
• We are in the age of information explosion– The internet+web+Google anyone can get the information
they want anytime…• But what about the text in all those other languages?
– How do they read all this English stuff?– How do we read all the stuff that they put online?
• MT for these languages would Enable:– Better government access to native indigenous and minority
communities– Better minority and native community participation in
information-rich activities (health care, education, government) without giving up their languages.
– Civilian and military applications (disaster relief)– Language preservation
Mar 1, 2006 AVENUE/LETRAS 4
The Roadmap to Learning-based MT
• Automatic acquisition of necessary language resources and knowledge using machine learning methodologies
• A framework for integrating the acquired MT resources into effective MT prototype systems
• Effective integration of acquired knowledge with statistical/distributional information
Mar 1, 2006 AVENUE/LETRAS 5
CMU’s AVENUE Approach
• Elicitation: use bilingual native informants to produce a small high-quality word-aligned bilingual corpus of translated phrases and sentences
• Transfer-rule Learning: apply ML-based methods to automatically acquire syntactic transfer rules for translation between the two languages– Learn from major language to minor language– Translate from minor language to major language
• XFER + Decoder:– XFER engine produces a lattice of possible transferred structures
at all levels– Decoder searches and selects the best scoring combination
• Rule Refinement: automatically refine and correct the acquired transfer rules via a process of interaction with bilingual informants which help the system identify translation errors
• Morphology Learning: unsupervised learning of morpheme structure of words based on their organization into paradigms and distributional information
Mar 1, 2006 AVENUE/LETRAS 6
AVENUE MT Approach
Interlingua
Syntactic Parsing
Semantic Analysis
Sentence Planning
Text Generation
Source (e.g. Quechua)
Target(e.g. English)
Transfer Rules
Direct: SMT, EBMT
AVENUE: Automate Rule Learning
Mar 1, 2006 AVENUE/LETRAS 7
Avenue Architecture
Learning
Module
Learned Transfer
Rules
Lexical Resources
Run Time Transfer System
Decoder
Translation
Correction
Tool
Word-Aligned Parallel Corpus
Elicitation Tool
Elicitation Corpus
Elicitation Rule Learning
Run-Time System
Rule Refinement
Rule
Refinement
Module
Morphology
Morphology Analyzer
Learning Module Handcrafted
rules
INPUT TEXT
OUTPUT TEXT
Mar 1, 2006 AVENUE/LETRAS 8
Transfer Rule Formalism
Type information
Part-of-speech/constituent information
Alignments
x-side constraints
y-side constraints
xy-constraints,
e.g. ((Y1 AGR) = (X1 AGR))
;SL: the old man, TL: ha-ish ha-zaqen
NP::NP [DET ADJ N] -> [DET N DET ADJ]((X1::Y1)(X1::Y3)(X2::Y4)(X3::Y2)
copula-role Restrictions: 1. (c-secondary-type secondary-copula)Notes: 1. A role is something like a job or a function. "He is a teacher" "This is a vegetable peeler"
copula-identity Restrictions: 1. (c-secondary-type secondary-copula)Notes: 1. "Clark Kent is Superman" "Sam is the teacher"
copula-location Restrictions: 1. (c-secondary-type secondary-copula)Notes: 1. "The book is on the table" There is a long list of locative relations later in the feature specification.
copula-description Restrictions: 1. (c-secondary-type secondary-copula)Notes: 1. A description is an attribute. "The children are happy." "The books are long."
Mar 1, 2006 AVENUE/LETRAS 50
Feature Maps
• Some features interact in the grammar– English –s reflects person and number of the subject and tense of
the verb.– In expressing the English present progressive tense, the auxiliary
verb is in a different place in a question and a statement:• He is running.
• Is he running?
• We need to check many, but not all combinations of features and values.
• Using unlimited feature combinations leads to an unmanageable number of sentences
Mar 1, 2006 AVENUE/LETRAS 51
Mar 1, 2006 AVENUE/LETRAS 52
Evidentiality Map
Lexical Aspect
Assertiveness
Polarity
Source
Tense
Gram.
Aspect
activity-accomplishment
Assertiveness-asserted, Assetiveness-neutral
Polarity-positive, Polarity-negative
Hearsay, quotative, inferred, assumption
Visual, Auditory, non-visual-or-auditory
Past Present, Future Past Present
Perfective, progressive, habitual, neutral
habitual, neutral, progressive
Perfective, progressive, habitual, neutral
habitual, neutral, progressive
Mar 1, 2006 AVENUE/LETRAS 53
Current Work
• Navigation– Start: large search space of all possible
feature combinations– Finish: each feature has been eliminated as
irrelevant or has been explored– Goal: dynamically find the most efficient path
through the search space for each language.
Mar 1, 2006 AVENUE/LETRAS 54
Current Work
• Feature Detection– Which features have an effect on
morphosyntax?– What is the effect?– Drives the Navigation process
Mar 1, 2006 AVENUE/LETRAS 55
Feature Detection: Spanish
The girl saw a red book.((1,1)(2,2)(3,3)(4,4)(5,6)(6,5))La niña vió un libro rojo
A girl saw a red book((1,1)(2,2)(3,3)(4,4)(5,6)(6,5))Una niña vió un libro rojo
I saw the red book((1,1)(2,2)(3,3)(4,5)(5,4))Yo vi el libro rojo
• Corpus Navigation: which minimal pairs to pursue next.– Don’t pursue gender in Mapudungun– Do pursue definiteness in Hebrew
• Morphology Learning:– Morphological learner identifies the forms of the morphemes– Feature detection identifies the functions
• Rule learning:– Rule learner will have to learn a constraint for each morpho-
syntactic marker that is discovered• E.g., Adjectives and nouns agree in gender, number, and definiteness
in Hebrew.
Mar 1, 2006 AVENUE/LETRAS 60
Rule Learning
Learning
Module
Learned Transfer
Rules
Lexical Resources
Run Time Transfer System
Decoder
Translation
Correction
Tool
Word-Aligned Parallel Corpus
Elicitation Tool
Elicitation Corpus
Elicitation Rule Learning
Run-Time System
Rule Refinement
Rule
Refinement
Module
Morphology
Morphology Analyzer
Learning Module Handcrafted
rules
INPUT TEXT
OUTPUT TEXT
Mar 1, 2006 AVENUE/LETRAS 61
Rule Learning - Overview
• Goal: Acquire Syntactic Transfer Rules• Use available knowledge from the major-
language side (grammatical structure)• Three steps:
1. Flat Seed Generation: first guesses at transfer rules; flat syntactic structure
2. Compositionality Learning: use previously learned rules to learn hierarchical structure
3. Constraint Learning: refine rules by learning appropriate feature constraints
Mar 1, 2006 AVENUE/LETRAS 62
Flat Seed Rule Generation
Learning Example: NP
Eng: the big apple
Heb: ha-tapuax ha-gadol
Generated Seed Rule:
NP::NP [ART ADJ N] [ART N ART ADJ]
((X1::Y1)
(X1::Y3)
(X2::Y4)
(X3::Y2))
Mar 1, 2006 AVENUE/LETRAS 63
Flat Seed Rule Generation
• Create a “flat” transfer rule specific to the sentence pair, partially abstracted to POS– Words that are aligned word-to-word and have the same POS in
both languages are generalized to their POS– Words that have complex alignments (or not the same POS)
remain lexicalized
• One seed rule for each translation example• No feature constraints associated with seed rules (but
mark the example(s) from which it was learned)
Mar 1, 2006 AVENUE/LETRAS 64
Compositionality Learning
Initial Flat Rules: S::S [ART ADJ N V ART N] [ART N ART ADJ V P ART N]
• Goal: add appropriate feature constraints to the acquired rules• Methodology:
– Preserve general structural transfer– Learn specific feature constraints from example set
• Seed rules are grouped into clusters of similar transfer structure (type, constituent sequences, alignments)
• Each cluster forms a version space: a partially ordered hypothesis space with a specific and a general boundary
• The seed rules in a group form the specific boundary of a version space
• The general boundary is the (implicit) transfer rule with the same type, constituent sequences, and alignments, but no feature constraints
Mar 1, 2006 AVENUE/LETRAS 68
Rule Refinement
Learning
Module
Learned Transfer
Rules
Lexical Resources
Run Time Transfer System
Decoder
Translation
Correction
Tool
Word-Aligned Parallel Corpus
Elicitation Tool
Elicitation Corpus
Elicitation Rule Learning
Run-Time System
Rule Refinement
Rule
Refinement
Module
Morphology
Morphology Analyzer
Learning Module Handcrafted
rules
INPUT TEXT
OUTPUT TEXT
Mar 1, 2006 AVENUE/LETRAS 69
Interactive and Automatic Refinement of Translation Rules
• Problem: Improve Machine Translation quality.
• Proposed Solution: Put bilingual speakers back into the loop; use their corrections to detect the source of the error and automatically improve the lexicon and the grammar.
• Approach: Automate post-editing efforts by feeding them back into the MT system.Automatic refinement of translation rules that
caused an error beyond post-editing.
• Goal: Improve MT coverage and overall quality.
Mar 1, 2006 AVENUE/LETRAS 70
Technical Challenges
Elicit minimal MT information from non-expert users
Automatically Refine and Expand
Translation Rules minimally
Manually written Automatically Learned
Automatic Evaluation of Refinement process
AVENUE/LETRAS 71
Error Typology for Automatic Rule Refinement (simplified)
Missing word
Extra word
Wrong word order
Incorrect word
Wrong agreement
Local vs Long distance
Word vs. phrase
+ Word change
Sense
Form
Selectional restrictions
Idiom
Missing constraint
Extra constraint
Mar 1, 2006 AVENUE/LETRAS 72
TCTool (Demo)• Add a word• Delete a word• Modify a word• Change word order
- Given the initial and final Translation Lattices, the Rule Refinement module needs to take into account, whether the following are present:- Corrected Translation Sentence- Original Translation Sentence (labelled as incorrect
by the user)
un artista gran
un gran artista
un grande artista
*un artista grande
Mar 1, 2006 AVENUE/LETRAS 80
Evaluating Improvement
Automatic Rule Adaptation
- Given the initial and final Translation Lattices, the Rule Refinement module needs to take into account, whether the following are present:- Corrected Translation Sentence- Original Translation Sentence (labelled as incorrect
by the user)
*un artista gran
un gran artista
*un grande artista
*un artista grande
Mar 1, 2006 AVENUE/LETRAS 81
Challenges and future work
• Credit and Blame assignment from TCTool Log Files and Xfer engine’s trace
• Order of corrections matters ~ explore rule interactions
• Explore the space between batch mode and fully interactive system
• Online TCTool always running to collect corrections from bilingual speakers make it into a game with rewards for the best users
Mar 1, 2006 AVENUE/LETRAS 82
AVENUE Prototypes
• General XFER framework under development for past three years
• Prototype systems so far:– German-to-English, Dutch-to-English– Chinese-to-English– Hindi-to-English– Hebrew-to-English
• In progress or planned:– Mapudungun-to-Spanish– Quechua-to-Spanish– Native Alaskan languages (Inupiaq) to English– Native-Bolivian languages (Aymara) to Spanish– Native-Brazilian languages to Brazilian Portuguese
Mar 1, 2006 AVENUE/LETRAS 83
Mapudungun
• Indigenous Language of Chile and Argentina• ~ 1 Million Mapuche Speakers
Mar 1, 2006 AVENUE/LETRAS 84
Collaboration
• Mapuche Language Experts – Universidad de la Frontera (UFRO)
• Instituto de Estudios Indígenas (IEI)– Institute for Indigenous Studies
• Chilean Funding– Chilean Ministry of Education
(Mineduc)• Bilingual and Multicultural Education
Program
Eliseo Cañulef
Rosendo Huisca
Hugo Carrasco
Hector Painequeo
Flor Caniupil
Luis Caniupil Huaiquiñir
Marcela Collio Calfunao
Cristian Carrillan Anton
Salvador Cañulef
Carolina Huenchullan Arrúe
Claudio Millacura Salas
Mar 1, 2006 AVENUE/LETRAS 85
Accomplishments
• Corpora Collection
– Spoken Corpus• Collected: Luis Caniupil Huaiquiñir • Medical Domain• 3 of 4 Mapudungun Dialects
– 120 hours of Nguluche– 30 hours of Lafkenche– 20 hours of Pwenche
• Transcribed in Mapudungun• Translated into Spanish
– Written Corpus• ~ 200,000 words• Bilingual Mapudungun – Spanish• Historical and newspaper text
nmlch-nmjm1_x_0405_nmjm_00:M: <SPA>no pütokovilu kay koC: no, si me lo tomaba con agua
M: chumgechi pütokoki femuechi pütokon pu <Noise> C: como se debe tomar, me lo tomé pués
nmlch-nmjm1_x_0406_nmlch_00:M: ChengewerkelafuymiürkeC: Ya no estabas como gente entonces!
Mar 1, 2006 AVENUE/LETRAS 86
Accomplishments
• Developed At UFRO– Bilingual Dictionary with Examples
• 1,926 entries
– Spelling Corrected Mapudungun Word List• 117,003 fully-inflected word forms
– Segmented Word List• 15,120 forms• Stems translated into Spanish
Mar 1, 2006 AVENUE/LETRAS 87
Accomplishments
• Developed at LTI using Mapudungun language resources from UFRO– Spelling Checker
• Integrated into OpenOffice
– Hand-built Morphological Analyzer– Prototype Machine Translation Systems
• Rule-Based• Example-Based
– Website: LenguasAmerindias.org
Mar 1, 2006 AVENUE/LETRAS 88
QuechuaSpanish MT
• V-Unit: funded Summer project in Cusco (Peru) June-August 2005 [preparations and data collection started earlier]
• Intensive Quechua course in Centro Bartolome de las Casas (CBC)
• Worked together with two Quechua native and one non-native speakers on developing infrastructure (correcting elicited translations, segmenting and translating list of most frequent words)
maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat
a few weeks ago announced if management club hotel that for him to leave israel according to the government instructions and immigration police
in a letter in broken english which spread among the foreign workers thanks to them hotel for their hard work and announced that will purchase for hm flight tickets for their countries from their money
Mar 1, 2006 AVENUE/LETRAS 94
Future Research Directions
• Automatic Transfer Rule Learning:– In the “large-data” scenario: from large volumes of
uncontrolled parallel text automatically word-aligned– In the absence of morphology or POS annotated lexica– Learning mappings for non-compositional structures– Effective models for rule scoring for
• Decoding: using scores at runtime• Pruning the large collections of learned rules
– Learning Unification Constraints
• Integrated Xfer Engine and Decoder– Improved models for scoring tree-to-tree mappings,
integration with LM and other knowledge sources in the course of the search
Mar 1, 2006 AVENUE/LETRAS 95
Future Research Directions
• Automatic Rule Refinement
• Morphology Learning
• Feature Detection and Corpus Navigation
• Prototypes for New Languages
Mar 1, 2006 AVENUE/LETRAS 96
Publications• 2005, Carbonell, J. G., A. Lavie
, L. Levin and A. Black, "Language Technologies for Humanitarian Aid". In Technology for Humanitarian Action, K. M. Cahill (ed.), pp. 111-138, Fordham University Press, ISBN 0-8232-2393-0, 2005.
• 2005. Font Llitjós, A., R. Aranovich and L. Levin. "Building Machine translation systems for indigenous languages". Second Conference on the Indigenous Languages of Latin America (CILLA II), 27-29 October 2005, Texas, USA.
• 2005, Font-Llitjos, A., J.G. Carbonell and A. Lavie. "A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation" . In Proceedings of the 10th Annual Conference of the European Association for Machine Translation (EAMT-2005), Budapest, Hungary, May 2005.
• 2004, Lavie, A., S. Wintner, Y. Eytani, E. Peterson and K. Probst. "Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System". In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004), Baltimore, MD, October 2004. Pages 1-10.
• 2004, Probst, K. and A. Lavie. "A Structurally Diverse Minimal Corpus for Eliciting Structural Mappings between Languages". In Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-2004), Washington, DC, September 2004.
Publications• 2004. Font Llitjós, A., K. Probst and J.G. Carbonell .
"Error Analysis of Two Types of Grammar for the Purpose of Automatic Rule Refinement". In Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-2004), Washington, DC, September 2004.
• 2004, Monson, C., A. Lavie, J. Carbonell and L. Levin "Unsupervised Induction of Natural Language Morphology Inflection Classes". In Proceedings of Workshop on Current Themes in Computational Phonology and Morphology at the 42th Annual Meeting of the Association of Computational Linguistics (ACL-2004), Barcelona, Spain, July 2004.
• 2004, Monson, C., L. Levin, R. Vega, R. Brown, A. Font Llitjos, A. Lavie, J. Carbonell, E. Cañulef, R. Huisca. "Data Collection and Analysis of Mapudungun Morphology for Spelling Correction". In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC-2004), Lisbon, Portugal, May 2004.
• 2004. Font Llitjós, A. and J.G. Carbonell . "The Translation Correction Tool: English-Spanish user studies“. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC-2004). Lisbon, Portugal, May 2004.
• 2004, Lavie, A., K. Probst, E. Peterson, S. Vogel, L.Levin, A. Font-Llitjos and J. Carbonell. "A Trainable Transfer-based Machine Translation Approach for Languages with Limited Resources". In Proceedings of Workshop of the European Association for Machine Translation (EAMT-2004), Valletta, Malta, April 2004.
Publications• 2003, Lavie, A., S. Vogel, L. Levin, E. Peterson, K. Probst, A. Font Llitjos
, R. Reynolds, J. Carbonell, and R. Cohen, "Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario". ACM Transactions on Asian Language Information Processing (TALIP), 2(2). June 2003. Pages 143-163.
• 2002, Probst, K., L. Levin, E. Peterson, A. Lavie, and J. Carbonell, "MT for Minority Languages Using Elicitation-Based Learning of Syntactic Transfer Rules". Machine Translation, 17(4). Pages 245-270.
• 2002, Carbonell, J., K. Probst, E. Peterson, C. Monson, A. Lavie, R. Brown and L. Levin. "Automatic Rule Learning for Resource Limited MT". In Proceedings of 5th Conference of the Association for Machine Translation in the Americas (AMTA-2002), Tiburon, CA, October 2002.
• 2002, Levin, L., R. Vega, J. Carbonell, R. Brown, A. Lavie, E. Canulef and C. Huenchullan. "Data Collection and Language Technologies for Mapudungun". In Proceedings of International Workshop on Resources and Tools in Field Linguistics at the Third International Conference on Language Resources and Evaluation (LREC-2002), Las Palmas, Canary Islands, Spain, June 2002.
• 2001, Probst, K., R. Brown, J. Carbonell, A. Lavie, L. Levin, and E. Peterson. "Design and Implementation of Controlled Elicitation for Machine Translation of Low-density Languages". In Proceedings of the MT-2010 Workshop at MT-Summit VIII, Santiago de Compostela, Spain, September 2001.