CS 479, section 1: Natural Language Processing Lecture #35: Word Alignment Models (cont.) This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License . Content by Eric Ringger, partially based on earlier slides from Dan Klein of U.C. Berkeley.
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License . CS 479, section 1: Natural Language Processing. Lecture # 35: Word Alignment Models (cont.). Content by Eric Ringger, partially based on earlier slides from Dan Klein of U.C. Berkeley. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CS 479, section 1:Natural Language Processing
Lecture #35: Word Alignment Models (cont.)
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.
Content by Eric Ringger, partially based on earlier slides from Dan Klein of U.C. Berkeley.
Reminder: No presentation, unless you really want to give one!
Check the schedule Plan enough time to succeed! Don’t get or stay blocked. Get your questions answered early. Get the help you need to keep moving forward. No late work accepted after the last day of instruction.
Announcements (2) Project Report:
Early: Wednesday Due: Friday
Homework 0.4 Due: today
Reading Report #14 Phrase-based MT paper Due: next Monday (online again)
EM Revisited
1. What are the four steps of the Expectation Maximization (EM) algorithm? Think of document clustering and/or training IBM
Model 1!
2. What are the two primary purposes of EM?
Objectives
Observe problems with IBM Model 1
Model ordering issues as IBM Model 2!
“Monotonic Translation”
Le Japon secoué par deux nouveaux séismes
Japan shaken by two new quakes
NULL
How would you implement a monotone decoder?(to translate the French)
MT System You could now build a simple MT system using:
English language model English to French alignment model (IBM Model 1)
Canadian Hansard data Monotone Decoder
Greedy Or Viterbi
IBM Model 1
1... Ja a a1 2a 2 3a 3 4a 4 5a 5 6a 6 6a 7 6a
( |ˆ ( , | )1)1 jj
jaI
t fa e eP f
ˆ ˆ( | ) ( , | )a
P f e P f a e
Target:
Source:
One-to-Many Alignments
But there are other problems to think about as the following examples will show:
Problem: Many-to-One Alignments
Problem: Many-to-Many Alignments
Problem: Local Order Change
Le Japon est au confluent de quatre plaques tectoniques
Japan is at the junction of four tectonic plates
“Distortions”
Problem: More Distortions
Le tremblement de terre a fait 39 morts et 3,183 blessés.
The earthquake killed 39 and wounded 3,183.
Insights
How to include “distortion” in the model?
How to prefer nearby distortions over long-distance distortions?
IBM Model 2 Reminder: Model 1
Could model distortions without any strong assumptions about where they occur as a distribution over target language positions:
Could build a model as a distribution over distortion distances:
Matrix View of an Alignment
Preference for the Diagonal But alignments for some language pairs tend to the
diagonal in general: Can use a normal distribution for the distortion model
EM for Model 2 Model 2 Parameters:
Translation probabilities: Distortion parameters:
Initialize with Model 1 Initialize as uniform E-step: For each pair of sentences :
For each French position 1. Calculate posterior over English positions :
2. Increment count of word with word by these amounts:
3. Similarly, for each English position , update:
( | , , )( | , , )
( | )( | )
, )·
( |·
,i
j i
j i
t f et
a i j I Ja f
P i f e jj I J ei
( | , ,, )( )j i P i jf fC ee
| , , ; ,C i j J I f e
EM for Model 2 (cont.) M-step:
Re-estimate by normalizing these counts one conditional distribution for each context
Re-estimate by normalizing the earlier counts one conditional distribution per word e
Iterate until convergence of or a handful of times
See the directions for Project #5 on the course wiki for a more detailed version of this EM algorithm, including implementation tips.