Computing & Information Sciences Kansas State University Friday, 01 Dec 2006 CIS 490 / 730: Artificial Intelligence Lecture 40 of 42 Friday, 01 December 2006 William H. Hsu Department of Computing and Information Sciences, KSU KSOL course page: http://snipurl.com/v9v3 Course web site: http://www.kddresearch.org/Courses/Fall-2006/CIS730 Instructor home page: http://www.cis.ksu.edu/~bhsu Reading for Next Class: Sections 22.1, 22.6-7, Russell & Norvig 2 nd edition NLP and Philosophical Issues Discussion: Machine Translation (MT)
Lecture 40 of 42. NLP and Philosophical Issues Discussion: Machine Translation (MT). Friday, 01 December 2006 William H. Hsu Department of Computing and Information Sciences, KSU KSOL course page: http://snipurl.com/v9v3 Course web site: http://www.kddresearch.org/Courses/Fall-2006/CIS730 - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
Lecture 40 of 42
Friday, 01 December 2006
William H. Hsu
Department of Computing and Information Sciences, KSU
KSOL course page: http://snipurl.com/v9v3
Course web site: http://www.kddresearch.org/Courses/Fall-2006/CIS730
Instructor home page: http://www.cis.ksu.edu/~bhsu
Reading for Next Class:
Sections 22.1, 22.6-7, Russell & Norvig 2nd edition
NLP and Philosophical IssuesDiscussion: Machine Translation (MT)
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
(Hidden) Markov Models:Review
Definition of Hidden Markov Models (HMMs) Stochastic state transition diagram (HMMs: states, aka nodes, are hidden)
Compare: probabilistic finite state automaton (Mealy/Moore model)
Annotated transitions (aka arcs, edges, links)
• Output alphabet (the observable part)
• Probability distribution over outputs
Forward Problem: One Step in ML Estimation Given: model h, observations (data) D
Estimate: P(D | h)
Backward Problem: Prediction Step Given: model h, observations D
Maximize: P(h(X) = x | h, D) for a new X
Forward-Backward (Learning) Problem Given: model space H, data D
Find: h H such that P(h | D) is maximized (i.e., MAP hypothesis)
HMMs Also A Case of LSQ (f Values in [Roth, 1999])
0.4 0.5
0.6
0.8
0.2
0.5
1 2 3
A 0.4B 0.6
A 0.5G 0.3H 0.2
E 0.1F 0.9
E 0.3F 0.7
C 0.8D 0.2
A 0.1G 0.9
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
NLP Hierarchy:Review
Problem Definition
Given: m sentences containing untagged words
Example: “The can will rust.”
Label (one per word, out of ~30-150): vj s (art, n, aux, vi)
USC/Information Sciences InstituteUSC/Computer Science Department
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa
Spanish/English Parallel Corpora:Review
Spanish/English Parallel Corpora:Review
1a. Garcia and associates .1b. Garcia y asociados .
7a. the clients and the associates are enemies .7b. los clients y los asociados son enemigos .
2a. Carlos Garcia has three associates .2b. Carlos Garcia tiene tres asociados .
8a. the company has three groups .8b. la empresa tiene tres grupos .
3a. his associates are not strong .3b. sus asociados no son fuertes .
9a. its groups are in Europe .9b. sus grupos estan en Europa .
4a. Garcia has a company also .4b. Garcia tambien tiene una empresa .
10a. the modern groups sell strong pharmaceuticals .10b. los grupos modernos venden medicinas fuertes .
5a. its clients are angry .5b. sus clientes estan enfadados .
11a. the groups do not sell zenzanine .11b. los grupos no venden zanzanina .
6a. the associates are also angry .6b. los asociados tambien estan enfadados .
12a. the small groups are not modern .12b. los grupos pequenos no son modernos .
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
Data for Statistical MTand data preparation
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
Ready-to-Use Online Bilingual DataReady-to-Use Online Bilingual Data
0
20
40
60
80
100
120
140
1994
1996
1998
2000
2002
2004
Chinese/English
Arabic/English
French/English
(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).
Millions of words(English side)
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
Ready-to-Use Online Bilingual DataReady-to-Use Online Bilingual Data
020
406080
100120140
160180
1994
1996
1998
2000
2002
2004
Chinese/English
Arabic/English
French/English
(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).
Millions of words(English side)
+ 1m-20m words formany language pairs
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
Ready-to-Use Online Bilingual DataReady-to-Use Online Bilingual Data
020
406080
100120140
160180
1994
1996
1998
2000
2002
2004
Chinese/English
Arabic/English
French/English
Millions of words(English side)
One Billion?
???
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
From No Data to Sentence PairsFrom No Data to Sentence Pairs
Easy way: Linguistic Data Consortium (LDC) Really hard way: pay $$$
Suppose one billion words of parallel data were sufficient At 20 cents/word, that’s $200 million
Pretty hard way: Find it, and then earn it! De-formatting Remove strange characters Character code conversion Document alignment Sentence alignment Tokenization (also called Segmentation)
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
Sentence AlignmentSentence Alignment
The old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await.
El viejo está feliz porque ha pescado muchos veces. Su mujer habla con él. Los tiburones esperan.
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
Sentence AlignmentSentence Alignment
1. The old man is happy.
2. He has fished many times.
3. His wife talks to him.
4. The fish are jumping.
5. The sharks await.
1. El viejo está feliz porque ha pescado muchos veces.
2. Su mujer habla con él.
3. Los tiburones esperan.
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
Sentence AlignmentSentence Alignment
1. The old man is happy.
2. He has fished many times.
3. His wife talks to him.
4. The fish are jumping.
5. The sharks await.
1. El viejo está feliz porque ha pescado muchos veces.
2. Su mujer habla con él.
3. Los tiburones esperan.
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
Sentence AlignmentSentence Alignment
1. The old man is happy. He has fished many times.
2. His wife talks to him.
3. The sharks await.
1. El viejo está feliz porque ha pescado muchos veces.
2. Su mujer habla con él.
3. Los tiburones esperan.
Note that unaligned sentences are thrown out, andsentences are merged in n-to-m alignments (n, m > 0).
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
Testing in an application that uses MT as one sub-component Question answering from foreign language documents
Automatic: WER (word error rate) BLEU (Bilingual Evaluation Understudy)
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .
Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.
BLEU Evaluation Metric(Papineni et al, ACL-2002)
• N-gram precision (score is between 0 & 1)– What percentage of machine n-grams can
be found in the reference translation? – An n-gram is an sequence of n words
– Not allowed to use same portion of reference translation twice (can’t cheat by typing out “the the the the the”)
• Brevity penalty– Can’t just type out single word “the”
(precision 1.0!)
*** Amazingly hard to “game” the system (i.e., find a way to change machine output so that BLEU goes up, but quality doesn’t)
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .
Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .
Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport . Guam authority has been on alert .
Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia . They said there would be biochemistry air raid to Guam Airport and other public places . Guam needs to be in high precaution about this matter .
Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places .
Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.
Multiple Reference Translations
Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .
Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport . Guam authority has been on alert .
Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia . They said there would be biochemistry air raid to Guam Airport and other public places . Guam needs to be in high precaution about this matter .
Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places .
Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
BLEU Tends to Predict Human JudgmentsBLEU Tends to Predict Human Judgments
R2 = 88.0%
R2 = 90.2%
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5
Human Judgments
NIS
T S
co
re
Adequacy
Fluency
Linear(Adequacy)Linear(Fluency)
slide from G. Doddington (NIST)
(va
ria
nt
of
BL
EU
)
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
Word-Based Statistical MT
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
Statistical MT SystemsStatistical MT Systems
Spanish BrokenEnglish
English
Spanish/EnglishBilingual Text
EnglishText
Statistical Analysis Statistical Analysis
Que hambre tengo yo
What hunger have I,Hungry I am so,I am so hungry,Have I that hunger …
I am so hungry
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
Statistical MT SystemsStatistical MT Systems
Spanish BrokenEnglish
English
Spanish/EnglishBilingual Text
EnglishText
Statistical Analysis Statistical Analysis
Que hambre tengo yo I am so hungry
TranslationModel P(s|e)
LanguageModel P(e)
Decoding algorithmargmax P(e) * P(s|e) e
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
Three Problems for Statistical MTThree Problems for Statistical MT
Language model Given an English string e, assigns P(e) by formula good English string -> high P(e) random word sequence -> low P(e)
Translation model Given a pair of strings <f,e>, assigns P(f | e) by formula <f,e> look like translations -> high P(f | e) <f,e> don’t look like translations -> low P(f | e)
Decoding algorithm Given a language model, a translation model, and a new sentence f …
find translation e maximizing P(e) * P(f | e)
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
The Classic Language ModelWord N-Grams
The Classic Language ModelWord N-Grams
Goal of the language model -- choose among:
He is on the soccer fieldHe is in the soccer field
Is table the on cup theThe cup is on the table
Rice shrineAmerican shrineRice companyAmerican company
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
The Classic Language ModelWord N-Grams
The Classic Language ModelWord N-Grams
Generative approach: w1 = STARTrepeat until END is generated:
produce word w2 according to a big table P(w2 | w1)w1 := w2
Possible English translations,to be rescored by language model
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
Decoding for “Classic” Models Decoding for “Classic” Models
Of all conceivable English word strings, find the one maximizing P(e) x P(f | e)
Decoding is an NP-complete challenge (Knight, 1999)
Several search strategies are available
Each potential English output is called a hypothesis.
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
The Classic ResultsThe Classic Results
la politique de la haine . (Foreign Original) politics of hate . (Reference Translation) the policy of the hatred . (IBM4+N-grams+Stack)
nous avons signé le protocole . (Foreign Original) we did sign the memorandum of agreement . (Reference Translation) we have signed the protocol . (IBM4+N-grams+Stack)
où était le plan solide ? (Foreign Original) but where was the solid plan ? (Reference Translation) where was the economic base ? (IBM4+N-grams+Stack)
the Ministry of Foreign Trade and Economic Cooperation, including foreigndirect investment 40.007 billion US dollars today provide data includethat year to November china actually using foreign 46.959 billion US dollars and
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
Flaws of Word-Based MTFlaws of Word-Based MT
Multiple English words for one French word IBM models can do one-to-many (fertility) but not many-to-one
Phrasal Translation “real estate”, “note that”, “interest in”
Syntactic Transformations Verb at the beginning in Arabic Translation model penalizes any proposed re-ordering Language model not strong enough to force the verb to move to the right place
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
Phrase-Based Statistical MT
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
SummarySummary
Phrase-based models are state-of-the-art Word alignments Phrase pair extraction & probabilities N-gram language models Beam search decoding Feature functions & learning weights
But the output is not English Fluency must be improved Better translation of person names, organizations, locations More automatic acquisition of parallel data, exploitation of monolingual data across a
variety of domains/languages Need good accuracy across a variety of domains/languages
Computing & Information SciencesKansas State University
Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence
Available ResourcesAvailable Resources
Bilingual corpora 100m+ words of Chinese/English and Arabic/English, LDC (www.ldc.upenn.edu) Lots of French/English, Spanish/French/English, LDC European Parliament (sentence-aligned), 11 languages, Philipp Koehn, ISI
(www.isi.edu/~koehn/publications/europarl) 20m words (sentence-aligned) of English/French, Ulrich Germann, ISI