Administration Introduction/Signup sheet Course web site http://www.cs.princeton.edu/courses/archive/spring09/cos401/ Course location and time: Thursday, 1:30pm – 4:20pm, Robertson Hall 023 TA: Juan Carlos Niebles Office: 215 Computer Science Bldg. Phone: (609) 258-8241 Email: jniebles [at] princeton Office hour: TBD or by appointment. Suggested Reading List: (NSW) Readings in Machine Translation, S. Nirenberg, H. Somers and Y. Wilks, MIT Press, 2002 (AT) Translation Engines: Techniques for Machine Translation, Arturo Trujillo, Springer 1999 (JM) Speech and Language Processing, Jurafsky and Martin, Prentice Hall (HS) An introduction to machine translation, W.John Hutchins and Harold L. Somers, London: Academic Press, 1992. Assessment: Class participation and attendance 15% Homework assignments 20% Midterm exam 30% Final exam/Term Paper 35%
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
– Writing from left-to-right vs right-to-left– Character sets (alphabetic, logograms, pictograms)– Segmentation into word/word-like units
• Morphology
• Lexical: Word senses– bank “river bank”, “financial institution”
• Syntactic: Word order– Subject-verb-object subject-object-verb
• Semantic: meaning– “ate pasta with a spoon”, “ate pasta with marinara”, “ate pasta with John”
• Pragmatic: world knowledge– “Can you pass me the salt?”
• Social: conversational norms– pronoun usage depends on the conversational partner
• Cultural: idioms and phrases– “out of the ballpark”, “came from leftfield”
• Contextual
•In addition for Speech Translation– Prosody: JOHN eats bananas: John EATS bananas; John eats BANANAS– Pronunciation differences– Speech recognition errors
• In a multilingual environment– Code Switching: Use of linguistic apparatus of one language to express ideas in another language.
Machine Translation: Why and what’s it good for?
• Understanding people across linguistic barriers– Socio-Political– Commercial: Globalization
• Limited availability of human expertise
• What is it good for?– Tasks with limited vocabulary and syntax (technical manuals)– Rough translations for web pages, emails– Applications that use translation as one of the components
• What is it not good for?– Hard and Important domains (Literature, Legal, Medical)
• Machine Translation need not be fully automated!!– Human assisted machine translation– Machine assisted human translation– Machine Translation as a productivity enhancement tool.
Machine Translation: Past and Present
1947-1954
1954-1966
1966-1980s
1980-1990
1990-present
MT as code breaking, IBM-Georgetown Univ. demonstration
Large bilingual dictionaries, linguistic and formal grammar motivated syntactic reordering, lots of funding, little progress
ALPAC report: “there is no immediate or predictable prospect of useful fully automatic machine translation”.1966
Translation continued in Canada, France and Germany. Beyond English-Russian translation. Meteo for translating weather reports. Systran in 1970
Emphasis on ‘indirect’ translation: semantic and knowledge-based.Advent of microcomputers. Translation companies: Systran, Logos, GlobalLink. Domain specific machine-aided translation systems.
Corpus-based methods: IBM’s Candide, Japanese ‘example-based’ translation.Speech-to-Speech translation: Verbmobil, Janus. ‘Pure’ to practical MT for embedded applications: Cross-lingual IR
MT Approaches: Different levels of meaning transfer
Direct MT
Interlingua
Transfer-basedMT
Source Target
Depth of Analysis
Parsing
Semantic Interpretation
Semantic Generation
Syntactic Generation
Syntactic Structure
Syntactic Structure
Spanish : ajá quiero usar mi tarjeta de crédito
English : yeah I wanna use my credit card
Alignment : 1 3 4 5 7 0 6
Direct Machine Translation • Words are replaced using a dictionary
– Some amount of morphological processing
• Word reordering is limited
• Quality depends on the size of the dictionary, closeness of languages
English : I need to make a collect call
Japanese : 私は コレクト コールを かける 必要があります
Alignment : 1 5 0 3 0 2 4
Example-based MT
Translation-by-analogy:
a. A collection of source/target text pairs
b. A matching metric
c. An word or phrase-level alignment
d. Method for recombination
ATR EBMT System (E. Sumita, H. Iida, 1991); CMU Pangloss EBMT (R. Brown, 1996)
Exact match (direct translation)
Target
ALIGNMENT (transfer)
MATCHING(analysis)
RECOMBINATION(generation)
Source
Example run of EBMT
English-Japanese Examples in the Corpus:
1. He buys a notebook Kare wa noto o kau
2. I read a book on international politics Watashi wa kokusai seiji nitsuite kakareta hon o yomu
Translation Input: He buys a book on international politics
Translation Output: Kare wa kokusai seiji nitsuite kakareta hon o kau
• Challenge: Finding a good matching metric• He bought a notebook
• A book was bought
• I read a book on world politics
NLP Pipeline: Beads on a String
Tokenization Sentence Segmentation Part-of-speech
tagging
Named Entity Detection
Noun/Verb Chunking
Syntactic Parsing
Semantic Role Labeling
Word Sense Disambiguation
Co-reference resolution
Named Entity Detection
Noun/Verb Chunking
Syntactic Parsing
Semantic Role Labeling
Word Sense Disambiguation
Co-reference resolution
Part-of-speech tagging
Tokenization Sentence Segmentation
NLP Pipeline: Sentence Segmentation
U.S. President lives in Washington D.C. He will travel to Florida this week.
President Bush will travel to Florida on February 20 2007 to meet with the CEO of AT&T
President Bush will travel to Florida on February 20 2007 to meet with the CEO of AT&T
Syntactic Parsing
Word Sense Disambiguation
Co-reference resolution
Named Entity Detection
Noun/Verb Chunking
Semantic Role Labeling
TokenizationPart-of-speech tagging
Sentence Segmentation
NLP Pipeline: Noun/Verb Chunking
President Bush will travel to Florida on February 20 2007 to meet with the CEO of AT&T
President Bush will travel to Florida on February 20 2007 to meet with the CEO of AT&T
Word Sense Disambiguation
Semantic Role Labeling
Noun/Verb Chunking
Sentence Segmentation
Syntactic Parsing
Co-reference resolution
Named Entity Detection
TokenizationPart-of-speech tagging
NLP Pipeline: Syntactic Parsing
$PERSON will travel to $PLACE on $DATE to meet with the $JOB of $ORG
will travel
$Person to on to meet
$PLACE $DATEwith
$JOB
the of
$ORG
Noun/Verb Chunking
Word Sense Disambiguation
Semantic Role Labeling
Sentence Segmentation
Syntactic Parsing
Co-reference resolution
Named Entity Detection
TokenizationPart-of-speech tagging
NLP Pipeline: Semantic Role Labeling
will travel
$Person to on
$PLACE $DATE
the of
$ORGNamed Entity Detection
Part-of-speech tagging
will travel
$Personto on
$PLACE $DATE
ARG0 ARGM-tmp
ARGM-loc
Word Sense Disambiguation
Semantic Role Labeling
Noun/Verb Chunking
Sentence Segmentation
Syntactic Parsing
TokenizationPart-of-speech tagging
NLP Pipeline: Word Sense Disambiguation
The man went to the bank to get some money
The man went to the bank to get some money
The man went to the bank to get some flowers
The man went to the bank to get some flowers
Co-reference resolution
Word Sense Disambiguation
Semantic Role Labeling
Noun/Verb Chunking
Sentence Segmentation
Syntactic Parsing
TokenizationPart-of-speech tagging
NLP Pipeline: Co-reference resolution
The U.S. President lives in Washington D.C.
He will return to the capital this week .
Co-reference resolution
The U.S. President lives in Washington D.C.
He will return to the capital this week .
Syntactic Transfer-based Machine Translation
• Direct and Example-based approaches – Two ends of a spectrum– Recombination of fragments for better coverage.
• What if the matching/transfer is done at syntactic parse level
• Three Steps – Parse: Syntactic parse of the source language sentence
• Hierarchical representation of a sentence– Transfer: Rules to transform source parse tree into target parse
tree• Subject-Verb-Object Subject-Object-Verb
– Generation: Regenerating target language sentence from parse tree• Morphology of the target language
• Tree-structure provides better matching and longer distance transformations than is possible in string-based EBMT.
I
Examples of SynTran-MT
quiero
ajá usar
mi tarjeta
de
crédito
wanna
yeah use
my card
credit
•Mostly parallel parse structures
• Might have to insert word – pronouns, morphological particles
Example of SynTran MT -2
• Pros:– Allows for structure transfer– Re-orderings are typically restricted to the parent-child nodes.
• Cons:– Transfer rules are for each language pair (N2 sets of rules)– Hard to reuse rules when one of the languages is changed
need
I make
to call
a collect
必要があります (need)
私は (I)
かける (make)
コールを (call)
コレクト (collect)
Interlingua-based Machine Translation
• Syntactic transfer-based MT – Couples the syntax of the two
languages
• What if we abstract away the syntax
– All that remains is meaning – Meaning is the same across
languages – Simplicity: Only N components
needed to translate among N languages
• Two “small” problems:– What is meaning?– How do we represent meaning?
Direct MT
Interlingua
Transfer-basedMT
Source Target
Parsing
Semantic Interpretation
Semantic Generation
Syntactic Generation
Syntactic Structure
Syntactic Structure
English analyzer
Spanish analyzer
Japanese analyzer
Spanish Generator
Japanese Generator
English generator
Interlingual representation
Example of Interlingua Machine Translation
)2(_);2,(1);1,( ecallcollecteIMakeeeINeed
need
I make
to call
a collect
indefssDefinitene
collectattributes
call
Theme
IAgent
InfinitiveTense
MakeEvent
Theme
IAgent
presentTense
NeedEvent
:
::
:
:
:
:
:
:
:
必要があります (need)
私は (I)
かける (make)
コールを (call)
コレクト (collect)
Interlingua representation
Probabilistic Direct Machine Translation
• Starting early 1990s, full circle back to code-breaking paradigm of machine translation
– With a probabilistic twist
• What is it:
•If you want to translate from English to Japanese– assume that the English text started out as a Japanese text– but went through a noisy channel which changed it into English
• Goal is to recover the best (most probable) Japanese text– J*=argmaxJ P(J|E) = argmaxJ P(E|J)*P(J)