12/07/1999 JHU CS 600.465/Jan Hajic 1
*Introduction to Natural Language Processing (600.465)
Statistical Machine Translation
Dr. Jan Hajič
cCS Dept., Johns Hopkins Univ.
www.cs.jhu.edu/~hajic
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 2
The Main Idea
• Treat translation as a noisy channel problem: Input (Source) “Noisy” Output (target)
The channel
E: English words... (adds “noise”) F: Les mots Anglais...
• The Model: P(E|F) = P(F|E) P(E) / P(F)
• Interested in rediscovering E given F:
After the usual simplification (P(F) fixed):
argmaxE P(E|F) = argmaxE P(F|E) P(E) !
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 3
The Necessities
• Language Model (LM)P(E)
• Translation Model (TM): Target given source P(F|E)
• Search procedure– Given E, find best F using the LM and TM distributions.
• Usual problem: sparse data– We cannot create a “sentence dictionary” E ↔F
– Typically, we do not see a sentence even twice!
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 4
The Language Model
• Any LM will do:– 3-gram LM
– 3-gram class-based LM
– decision tree LM with hierarchical classes
• Does not necessarily operates on word forms:– cf. later the “analysis” and “generation” procedures
– for simplicity, imagine now it does operate on word forms
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 5
The Translation Models
• Do not care about correct strings of English words (that’s the task of the LM)
• Therefore, we can make more independence assumptions:– for start, use the “tagging” approach:
• 1 English word (“tag”) ~ 1 French word (“word”)
– not realistic: rarely even the number of words is the same in both sentences (let alone there is 1:1 correspondence!)
• use “Alignment”.
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 6
The Alignment
0 1 2 3 4 5 6
• e0 And the program has been implemented
• f0 Le programme a été mis en application
0 1 2 3 4 5 6 7• Linear notation:
• f0(1) Le(2) programme(3) a(4) été(5) mis(6) en(6) application(6)
• e0 And(0) the(1) program(2) has(3) been(4) implemented(5,6,7)
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 7
Alignment Mapping
• In general:– |F| = m, |E| = l (length of sent.):
•lm connections (each French word to any English word),
• 2lm different alignments for any pair (E,F) (any subset)
• In practice:– From English to French
• each English word 1-n connections (n - empirical max.-fertility?)
• each French word exactly 1 connection
– therefore, “only” (l+1)m alignments ( << 2lm ) • aj = i (link from j-th French word goes to i-th English word)
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 8
Elements of Translation Model(s)
• Basic distribution:• P(F,A,E) - the joint distribution of the English sentence,
the Alignment, and the French sentence (length m)• Interested also in marginal distributions:
P(F,E) = A P(F,A,E)
P(F|E) = P(F,E) / P(E) = A P(F,A,E) / A,F P(F,A,E) = A P(F,A|E)
• Useful decomposition [one of possible decompositions]:
P(F,A|E) = P(m | E) j=1..m P(aj|a1j-1,f1
j-1,m,E) P(fj|a1j,f1
j-1,m,E)
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 9
Decomposition
• Decomposition formula again:
P(F,A|E) = P(m | E) j=1..m P(aj|a1j-1,f1
j-1,m,E) P(fj|a1j,f1
j-1,m,E)
m - length of French sentence
aj - the alignment (single connection) going from j-th French w.
fj - the j-th French word from F
a1j-1 - sequence of alignments ai up to the word preceding fj
a1j - sequence of alignments ai up to and including the word fj
f1j-1 - sequence of French words up to the word preceding fj
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 10
Decomposition and the Generative Model
• ...and again:
P(F,A|E) = P(m | E) j=1..m P(aj|a1j-1,f1
j-1,m,E) P(fj|a1j,f1
j-1,m,E)
• Generate:– first, the length of the French given the English words E;
– then, the link from the first position in F (not knowing the actual word yet) now we know the English word
– then, given the link (and thus the English word), generate the French word at the current position
– then, move to the next position in F until m position filled.
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 11
Approximations
• Still too many parameters– similar situation as in n-gram model with “unlimited” n– impossible to estimate reliably.
• Use 5 models, from the simplest to the most complex
(i.e. from heavy independence assumptions to light)• Parameter estimation:
Estimate parameters of Model 1; use as an initial estimate for estimating Model 2 parameters; etc.
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 12
Model 1
• Approximations:– French length P(m | E) is constant (small )
– Alignment link distribution P(aj|a1j-1,f1
j-1,m,E) depends on English length l only (= 1/(l+1))
– French word distribution depends only on the English and French word connected with link aj.
• Model 1 distribution:
P(F,A|E) = / (l+1)m j=1..m p(fj|eaj)
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 13
Models 2-5
• Model 2– adds more detail into P(aj|...): more “vertical” links preferred
• Model 3– adds “fertility” (number of links for a given English word is
explicitly modeled: P(n|ei)
– “distortion” replaces alignment probabilities from Model 2
• Model 4– the notion of “distortion” extended to chunks of words
• Model 5 is Model 4, but not deficient (does not waste probability to non-strings)
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 14
The Search Procedure
• “Decoder”:– given “output” (French), discover “input” (English)
• Translation model goes in the opposite direction:p(f|e) = ....
• Naive methods do not work.• Possible solution (roughly):
– generate English words one-by-one, keep only n-best (variable n) list; also, account for different lengths of the English sentence candidates!
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 15
Analysis - Translation - Generation (ATG)
• Word forms: too sparse• Use four basic analysis, generation steps:
– tagging
– lemmatization
– word-sense disambiguation
– noun-phrase “chunks” (non-compositional translations)
• Translation proper:– use chunks as “words”
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 16
Training vs. Test with ATG
• Training:– analyze both languages using all four analysis steps
– train TM(s) on the result (i.e. on chunks, tags, etc.)
– train LM on analyzed source (English)
• Runtime/Test:– analyze given language sentence (French) using identical
tools as in training
– translate using the trained Translation/Language model(s)
– generate source (English), reversing the analysis process
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 17
Analysis: Tagging and Morphology
• Replace word forms by morphologically processed text:– lemmas
– tags• original approach: mix them into the text, call them “words”• e.g. She bought two books. she buy VBP two book NNS.
• Tagging: yes– but reversed order:
• tag first, then lemmatize [NB: does not work for inflective languages]
• technically easy
• Hand-written deterministic rules for tag+form lemma
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 18
Word Sense Disambiguation, Word Chunking
• Sets of senses for each E, F word:– e.g. book-1, book-2, ..., book-n
– prepositions (de-1, de-2, de-3,...), many others
• Senses derived automatically using the TM– translation probabilities measured on senses: p(de-3|from-5)
• Result:– statistical model for assigning senses monolingually based on
context (also MaxEnt model used here for each word)
• Chunks: group words for non-compositional translation
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 19
Generation
• Inverse of analysis• Much simpler:
– Chunks words (lemmas) with senses (trivial) – Words (lemmas) with senses words (lemmas) (trivial)– Words (lemmas) + tags word forms
• Additional step:– Source-language ambiguity:
• electric vs. electrical, hath vs. has, you vs. thou: treated as a single unit in translation proper, but must be disambiguated at the end of generation phase; using additional pure LM on word forms.
12/07/1999 JHU CS 600.465/Jan Hajic 20
*Introduction to Natural Language Processing (600.465)
Statistical Translation: Alignment and Parameter Estimation
Dr. Jan Hajič
CS Dept., Johns Hopkins Univ.
www.cs.jhu.edu/~hajic
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 21
Alignment
• Available corpus assumed:– parallel text (translation E ↔F)
• No alignment present (day marks only)!• Sentence alignment
– sentence detection
– sentence alignment
• Word alignment– tokenization
– word alignment (with restrictions)
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 22
Sentence Boundary Detection• Rules, lists:
– Sentence breaks:• paragraphs (if marked)
• certain characters: ?, !, ; (...almost sure)
• The Problem: period .– could be end of sentence (... left yesterday. He was heading to...)– decimal point: 3.6 (three-point-six)– thousand segment separator: 3.200 (three-thousand-two-hundred)– abbreviation never at the end of sentence: cf., e.g., Calif., Mt., Mr.– ellipsis: ...– other languages: ordinal number indication (2nd ~ 2.)– initials: A. B. Smith
• Statistical methods: e.g., Maximum Entropy
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 23
Sentence Alignment• The Problem: sentences detected only:• E:• F:• Desired output: Segmentation with equal number of segments,
spanning continuously the whole text.• Original sentence boundaries kept:• E:• F:• Alignments obtained: 2-1, 1-1, 1-1, 2-2, 2-1, 0-1
• New segments called “sentences” from now on.
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 24
Alignment Methods
• Several methods (probabilistic and not prob.)– character-length based
– word-length based
– “cognates” (word identity used)• using an existing dictionary (F: prendre ~ E: make, take)
• using word “distance” (similarity): names, numbers, borrowed words, Latin origin words, ...
• Best performing: – statistical, word- or character- length based (with some
words perhaps)
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 25
Length-based Alignment
• First, define the problem probabilistically: argmaxA P(A|E,F) = argmaxA P(A,E,F) (E,F fixed)
• Define a “bead”:• E:• F:• Approximate:
P(A,E,F) i=1..nP(Bi),
where Bi is a bead; P(Bi) does not depend on the rest of E,F.
“bead” (2:2 in this case)
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 26
The Alignment Task
• Given the model definition,
P(A,E,F) i=1..nP(Bi),
find the partitioning of (E,F) into n beads Bi=1..n, that maximizes P(A,E,F) over training data.
• Define Bi = p:qi, where p:q {0:1,1:0,1:1,1:2,2:1,2:2}
– describes the type of alignment
• Want to use some sort of dynamic programming:• Define Pref(i,j)... probability of the best alignment
from the start of (E,F) data (1,1) up to (i,j)
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 27
Recursive Definition
• Initialize: Pref(0,0) = 0.• Pref(i,j) = max (
Pref(i,j-1) P(0:1k), Pref(i-1,j) P(1:0k), Pref(i-1,j-1) P(1:1k),
Pref(i-1,j-2) P(1:2k), Pref(i-2,j-1) P(2:1k), Pref(i-2,j-2) P(2:2k) )
• This is enough for a Viterbi-like search.
• E:• F:
i
j
Pref(i-2,j-2) P(2:2k)Pref(i-2,j-1) P(2:1k)Pref(i-1,j-2) P(1:2k)Pref(i-1,j-1) P(1:1k)Pref(i-1,j)P(1:0k)Pref(i,j-1) P(0:1k)
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 28
Probability of a Bead
• Remains to define P(p:qk) (the red part): – k refers to the “next” bead, with segments of p and q sentences,
lengths lk,e and lk,f.
• Use normal distribution for length variation:
• P(p:qk) = P(lk,e,lk,f,,2,p:q) P(lk,e,lk,f,,2)P(p:q)
• lk,e,lk,f,,2 = (lk,f - lk,e)/lk,e2
• Estimate P(p:q) from small amount of data, or even guess and re-estimate after aligning some data.
• Words etc. might be used as better clues in P(p:qak) def.
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 29
Saving time
• For long texts (> 104 sentences), even Viterbi (in the version needed) is not effective (o(S2) time)
• Go paragraph by paragraph if they are aligned 1:1• What if not?• Apply the same method first to paragraphs!
– identify paragraphs roughly in both languages
– run the algorithm to get aligned paragraph-like segments
– then, run on sentences within paragraphs.
• Performs well if not many consecutive 1:0 or 0:1 beads.
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 30
Word alignment
• Length alone does not help anymore.– mainly because words can be swapped, and mutual translations
have often vastly different length.
• ...but at least, we have “sentences” (sentence-like segments) aligned; that will be exploited heavily.
• Idea:– Assume some (simple) translation model (such as Model 1).– Find its parameters by considering virtually all alignments.– After we have the parameters, find the best alignment given
those parameters.
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 31
Word Alignment Algorithm
• Start with sentence-aligned corpus.
• Let (E,F) be a pair of sentences (actually, a bead).
• Initialize p(f|e) randomly (e.g., uniformly), fF, eE.
• Compute expected counts over the corpus:
c(f,e) = (E,F);eE,fF p(f|e)
aligned pair (E,F), find if e in E and f in F; if yes, add p(f|e).
• Reestimate:
p(f|e) = c(f,e) / c(e) [c(e) = f c(f,e)]
• Iterate until change of p(f|e) is small.
12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 32
Best Alignment
• Select, for each (E,F),
A = argmaxA P(A|F,E) = argmaxA P(F,A|E)/P(F) =
argmaxA P(F,A|E) = argmaxA ( / (l+1)m j=1..m p(fj|eaj)) =
argmaxA j=1..mp(fj|eaj) (IBM Model 1)
• Again, use dynamic programming, Viterbi-like algorithm.• Recompute p(f|e) based on the best alignment
• (only if you are inclined to do so; the “original” summed-over-all distribution might perform better).
• Note: we have also got all Model 1 parameters.