Lecture8-statmt

8/3/2019 Lecture8-statmt

1/44

1

Statistical Machine Translation

Bonnie Dorr Christof Monz

CMSC 723: Introduction to Computational Linguistics

Lecture 8

October 27, 2004


2/44

2

OverviewWhy MTStatistical vs. rule-based MTComputing translation probabilities from a parallecorpusIBM Models 1-3


3/44

3

A Brief HistoryMachine translation was one of the firstapplications envisioned for computersWarren Weaver (1949):I have a text in front of me which iswritten in Russian but I am going to pretend that it is really written in Engliand that it has been coded in some strange symbols. All I need to do is stripoff the code in order to retrieve the information contained in the text.

First demonstrated by IBM in 1954 with a basicword-for-word translation system


4/44

4

Interest in MTCommercial interest:

U.S. has invested in MT for intelligence purposes

MT is popular on the webit is the most used of Googles special featuresEU spends more than $1 billion on translation costseach year.(Semi-)automated translation could lead to hugesavings


5/44

5

Interest in MT Academic interest:

One of the most challenging problems in NLP research

Requires knowledge from many NLP sub-areas, e.g., lexicalsemantics, parsing, morphological analysis, statisticalmodeling,Being able to establish links between two languages allows for

transferring resources from one language to another


6/44

6

Rule-Based vs. Statistical MTRule-based MT:

Hand-written transfer rules

Rules can be based on lexical or structural transfer Pro: firm grip on complex translation phenomenaCon: Often very labor-intensive -> lack of robustness

Statistical MT

Mainly word or phrase-based translationsTranslation are learned from actual dataPro: Translations are learned automaticallyCon: Difficult to model complex translation phenomena


7/44

7

Parallel CorpusExample from DE-News (8/1/1996)

English GermanDiverging opinions about planned taxreform

Unterschiedliche Meinungen zur geplantenSteuerreform

The discussion around the envisagedmajor tax reform continues .

Die Diskussion um die vorgesehenegrosse Steuerreform dauert an .

The FDP economics expert , Graf Lambsdorff , today came out in favor of advancing the enactment of significantparts of the overhaul , currently plannedfor 1999 .

Der FDP - Wirtschaftsexperte Graf Lambsdorff sprach sich heute dafuer aus ,wesentliche Teile der fuer 1999 geplantenReform vorzuziehen .


8/44

8

Word-Level AlignmentsGiven a parallel sentence pair we can link (align)words or phrases that are translations of eachother:


9/44

9

Parallel ResourcesNewswire: DE-News (German-English), Hong-Kong NeXinhua News (Chinese-English),

Government: Canadian-Hansards (French-English),Europarl (Danish, Dutch, English, Finnish, French,German, Greek, Italian, Portugese, Spanish, Swedish),UN Treaties (Russian, English, Arabic, . . . )

Manuals: PHP, KDE, OpenOffice (all from OPUS, manylanguages)Web pages: STRAND project (Philip Resnik)


10/44

10

Sentence AlignmentIf document De is translation of document Df how do wefind the translation for each sentence?

Then-th sentence in De is not necessarily the translationof then-th sentence in document Df In addition to 1:1 alignments, there are also 1:0, 0:1, 1:n,and n:1 alignments

Approximately 90% of the sentence alignments are 1:1


11/44

11

Sentence Alignment (cntd)There are several sentence alignment algorithms:

Align (Gale & Church): Aligns sentences based on their character length (shorter sentences tend to have shorter translations then longer sentences). Works astonishingly wellChar-align: (Church): Aligns based on shared character sequences. Works fine for similar languages or technicaldomains

K-Vec (Fung & Church): Induces a translation lexicon from theparallel texts based on the distribution of foreign-English wordpairs.


12/44

12

Computing Translation ProbabilitiesGiven a parallel corpus we can estimate P(e | f) Themaximum likelihood estimation of P(e | f) is: freq(e,f)/fre

Way too specific to get any reasonable frequencies! Vastmajority of unseen data will have zero counts!P(e | f ) could be re-defined as:

Problem: The English words maximizingP(e | f ) might not result in a readable sentence

P (e | f ) ! maxe i f j

P (e i | f j )


13/44


14/44

14

DecodingThe decoder combines the evidence from P(e) and P(f | eto find the sequence e that is the best translation:

The choice of word e as translation of f depends on thetranslation probability P(f | e) and on the context, i.e.

other English words preceding e

argmaxe

P (e | f ) ! argmaxe

P ( f | e) P (e)


15/44

15

Noisy Channel Model for Translation


16/44

16

Language ModelingDetermines the probability of some English sequenceof length l

P(e) is hard to estimate directly, unless l is very small

P(e) is normally approximated as:

where m is size of the context, i.e. number of previouswords that are considered, normally m=2 (tri-gramlanguage model

e1l

P (e1l ) ! P (e1 ) P (e ii ! 2

l

| e1i 1 )

P (e1l ) ! P (e1 ) P (e2 | e1 ) P (e ii! 3l

| e i mi1 )


17/44

17

Translation ModelingDetermines the probability that the foreign word f is atranslation of the English word e

How to compute P(f | e) from a parallel corpus?Statistical approaches rely on the co-occurrence of e and in the parallel data: If e and f tend to co-occur in parallelsentence pairs, they are likely to be translations of oneanother


18/44

18

Finding Translations in a Parallel CorpusInto which foreign words f, . . . , f does e translate?Commonly, four factors are used:

How often do e and f co-occur? (translation)How likely is a word occurring at position i to translate into aword occurring at position j? (distortion) For example: Englisha verb-second language, whereas German is a verb-finallanguage

How likely is e to translate into more than one word? (fertility)For example:defeated can translate intoeine NiederlageerleidenHow likely is a foreign word to be spuriously generated? (nulltranslation)


19/44

19

Translation Steps


20/44

20

IBM Models 15Model 1: Bag of words

Unique local maxima

Efficient EM algorithm (Model 12)Model 2: General alignment:Model 3: fertility: n(k | e)

No full EM, count only neighbors (Model 35)

Deficient (Model 34)Model 4: Relative distortion, word classesModel 5: Extra variables to avoid deficiency

a (e pos | f pos ,e l ength , f l ength )


21/44

21

IBM Model 1Given an English sentence e1 . . . el and a foreign sentence f 1 . . . f mWe want to find the best alignment a, where a is a set pairs of the form {. . . , (i, j)},

0


22/44

22

IBM Model 1Simplest of the IBM modelsDoes not consider word order (bag-of-wordsapproach)Does not model one-to-many alignmentsComputationally inexpensiveUseful for parameter estimations that are passedon to more elaborate models


24/44

24

IBM Model 1We want to find the most likely alignment:

Since P(a | e) is the same for all a:

Problem: We still have to enumerate all alignment

argmaxa A

1

(l 1)m P ( f j

j! 1

m

| ea j

)

argmaxa A

P ( f j j! 1

m

| ea j

)


25/44

25

IBM Model 1Since P(f j | ei) is independent from P(f j| ei) we can find themaximum alignment by looking at the individual translatprobabilities onlyLet , then for each a j:

The best alignment can computed in a quadratic number of steps: (l+1 x m)

argmaxa A

! (a 1 , ..., a m )

a j

! argmax0 e ie l

P ( f j | e i )


26/44

26

Computing Model 1 ParametersHow to compute translation probabilities for mod1 from a parallel corpus?Step 1: Determine candidates. For each Englishword e collect all foreign words f that co-occur atleast once with e

Step 2: Initialize P(f | e) uniformly, i.e. P(f | e) =1/(no of co-occurring foreign words


27/44

27

Computing Model 1 ParametersStep 3: Iteratively refine translation probablities:1 for n iterations2 set tc to zero3 for each sentence pair (e,f) of lengths (l,m)4 for j=1 to m

5 total=0;6 for i=1 to l7 total += P(f j | ei);8 for i=1 to l9 tc(f

j| e

i) += P(f

j| e

i)/total;

10 for each word e11 total=0;12 for each word f s.t. tc(f | e) is defined13 total += tc(f | e);14 for each word f s.t. tc(f | e) is defined15 P(f | e) = tc(f | e)/total;


36/44

36

IBM Model 1 RecapIBM Model 1 allows for an efficient computation of translation probabilities

No notion of fertility, i.e., its possible that the sameEnglish word is the best translation for all foreign wordsNo positional information, i.e., depending on the languagpair, there might be a tendency that words occurring at thbeginning of the English sentence are more likely to alignto words at the beginning of the foreign sentence


37/44

37

IBM Model 3IBM Model 3 offers two additional featurescompared to IBM Model 1:

How likely is an English word e to align to k foreignwords (fertility)?Positional information (distortion), how likely is a worin position i to align to a word in position j?


38/44

38

IBM Model 3: FertilityThe best Model 1 alignment could be that a single English word alignto all foreign wordsThis is clearly not desirable and we want to constrain the number of

words an English word can align toFertility models a probability distribution that word e aligns to k wordn(k,e)Consequence: translation probabilities cannot be computedindependently of each other anymoreIBM Model 3 has to work with full alignments, note there are up to(l+1)m different alignments


39/44

39

IBM Model 1 + Model 3Iterating over all possible alignments iscomputationally infeasible

Solution: Compute the best alignment with Modeland change some of the alignments to generate aset of likely alignments (pegging)

Model 3 takes this restricted set of alignments asinput


40/44

40

PeggingGiven an alignment a we can derive additionalalignments from it by making small changes:

Changing a link (j,i) to (j,i)Swapping a pair of links (j,i) and (j,i) to (j,i) and (j

The resulting set of alignments is called the

neighborhood of a


41/44

41

IBM Model 3: DistortionThe distortion factor determines how likely it is that anEnglish word in position i aligns to a foreign word inposition j, given the lengths of both sentences:

d(j | i, l, m)Note, positions are absolute positions


42/44

42

DeficiencyProblem with IBM Model 3: It assigns probability mass timpossible strings

Well formed string: This is possibleIll-formed but possible string: This possible isImpossible string:

Impossible strings are due to distortion values thatgenerate different words at the same positionImpossible strings can still be filtered out in later stages othe translation process


43/44

43

Limitations of IBM ModelsOnly 1-to-N word mappingHandling fertility-zero words (difficult for decoding)

Almost no syntactic informationWord classesRelative distortion

Long-distance word movement

Fluency of the output depends entirely on the Englishlanguage model


44/44

44

DecodingHow to translate new sentences? A decoder uses the parameters learned on a parallel

corpusTranslation probabilitiesFertilitiesDistortions

In combination with a language model the decoder generates the most likely translationStandard algorithms can be used to explore the searchspace (A*, greedy searching, )

Similar to the traveling salesman problem

Lecture8-statmt

Documents