Top Banner

of 44

Lecture8-statmt

Apr 06, 2018

Download

Documents

Abiy Nigusu
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/3/2019 Lecture8-statmt

    1/44

    1

    Statistical Machine Translation

    Bonnie Dorr Christof Monz

    CMSC 723: Introduction to Computational Linguistics

    Lecture 8

    October 27, 2004

  • 8/3/2019 Lecture8-statmt

    2/44

    2

    OverviewWhy MTStatistical vs. rule-based MTComputing translation probabilities from a parallecorpusIBM Models 1-3

  • 8/3/2019 Lecture8-statmt

    3/44

    3

    A Brief HistoryMachine translation was one of the firstapplications envisioned for computersWarren Weaver (1949):I have a text in front of me which iswritten in Russian but I am going to pretend that it is really written in Engliand that it has been coded in some strange symbols. All I need to do is stripoff the code in order to retrieve the information contained in the text.

    First demonstrated by IBM in 1954 with a basicword-for-word translation system

  • 8/3/2019 Lecture8-statmt

    4/44

    4

    Interest in MTCommercial interest:

    U.S. has invested in MT for intelligence purposes

    MT is popular on the webit is the most used of Googles special featuresEU spends more than $1 billion on translation costseach year.(Semi-)automated translation could lead to hugesavings

  • 8/3/2019 Lecture8-statmt

    5/44

    5

    Interest in MT Academic interest:

    One of the most challenging problems in NLP research

    Requires knowledge from many NLP sub-areas, e.g., lexicalsemantics, parsing, morphological analysis, statisticalmodeling,Being able to establish links between two languages allows for

    transferring resources from one language to another

  • 8/3/2019 Lecture8-statmt

    6/44

    6

    Rule-Based vs. Statistical MTRule-based MT:

    Hand-written transfer rules

    Rules can be based on lexical or structural transfer Pro: firm grip on complex translation phenomenaCon: Often very labor-intensive -> lack of robustness

    Statistical MT

    Mainly word or phrase-based translationsTranslation are learned from actual dataPro: Translations are learned automaticallyCon: Difficult to model complex translation phenomena

  • 8/3/2019 Lecture8-statmt

    7/44

    7

    Parallel CorpusExample from DE-News (8/1/1996)

    English GermanDiverging opinions about planned taxreform

    Unterschiedliche Meinungen zur geplantenSteuerreform

    The discussion around the envisagedmajor tax reform continues .

    Die Diskussion um die vorgesehenegrosse Steuerreform dauert an .

    The FDP economics expert , Graf Lambsdorff , today came out in favor of advancing the enactment of significantparts of the overhaul , currently plannedfor 1999 .

    Der FDP - Wirtschaftsexperte Graf Lambsdorff sprach sich heute dafuer aus ,wesentliche Teile der fuer 1999 geplantenReform vorzuziehen .

  • 8/3/2019 Lecture8-statmt

    8/44

    8

    Word-Level AlignmentsGiven a parallel sentence pair we can link (align)words or phrases that are translations of eachother:

  • 8/3/2019 Lecture8-statmt

    9/44

    9

    Parallel ResourcesNewswire: DE-News (German-English), Hong-Kong NeXinhua News (Chinese-English),

    Government: Canadian-Hansards (French-English),Europarl (Danish, Dutch, English, Finnish, French,German, Greek, Italian, Portugese, Spanish, Swedish),UN Treaties (Russian, English, Arabic, . . . )

    Manuals: PHP, KDE, OpenOffice (all from OPUS, manylanguages)Web pages: STRAND project (Philip Resnik)

  • 8/3/2019 Lecture8-statmt

    10/44

    10

    Sentence AlignmentIf document De is translation of document Df how do wefind the translation for each sentence?

    Then-th sentence in De is not necessarily the translationof then-th sentence in document Df In addition to 1:1 alignments, there are also 1:0, 0:1, 1:n,and n:1 alignments

    Approximately 90% of the sentence alignments are 1:1

  • 8/3/2019 Lecture8-statmt

    11/44

    11

    Sentence Alignment (cntd)There are several sentence alignment algorithms:

    Align (Gale & Church): Aligns sentences based on their character length (shorter sentences tend to have shorter translations then longer sentences). Works astonishingly wellChar-align: (Church): Aligns based on shared character sequences. Works fine for similar languages or technicaldomains

    K-Vec (Fung & Church): Induces a translation lexicon from theparallel texts based on the distribution of foreign-English wordpairs.

  • 8/3/2019 Lecture8-statmt

    12/44

    12

    Computing Translation ProbabilitiesGiven a parallel corpus we can estimate P(e | f) Themaximum likelihood estimation of P(e | f) is: freq(e,f)/fre

    Way too specific to get any reasonable frequencies! Vastmajority of unseen data will have zero counts!P(e | f ) could be re-defined as:

    Problem: The English words maximizingP(e | f ) might not result in a readable sentence

    P (e | f ) ! maxe i f j

    P (e i | f j )

  • 8/3/2019 Lecture8-statmt

    13/44

  • 8/3/2019 Lecture8-statmt

    14/44

    14

    DecodingThe decoder combines the evidence from P(e) and P(f | eto find the sequence e that is the best translation:

    The choice of word e as translation of f depends on thetranslation probability P(f | e) and on the context, i.e.

    other English words preceding e

    argmaxe

    P (e | f ) ! argmaxe

    P ( f | e) P (e)

  • 8/3/2019 Lecture8-statmt

    15/44

    15

    Noisy Channel Model for Translation

  • 8/3/2019 Lecture8-statmt

    16/44

    16

    Language ModelingDetermines the probability of some English sequenceof length l

    P(e) is hard to estimate directly, unless l is very small

    P(e) is normally approximated as:

    where m is size of the context, i.e. number of previouswords that are considered, normally m=2 (tri-gramlanguage model

    e1l

    P (e1l ) ! P (e1 ) P (e ii ! 2

    l

    | e1i 1 )

    P (e1l ) ! P (e1 ) P (e2 | e1 ) P (e ii! 3l

    | e i mi1 )

  • 8/3/2019 Lecture8-statmt

    17/44

    17

    Translation ModelingDetermines the probability that the foreign word f is atranslation of the English word e

    How to compute P(f | e) from a parallel corpus?Statistical approaches rely on the co-occurrence of e and in the parallel data: If e and f tend to co-occur in parallelsentence pairs, they are likely to be translations of oneanother

  • 8/3/2019 Lecture8-statmt

    18/44

    18

    Finding Translations in a Parallel CorpusInto which foreign words f, . . . , f does e translate?Commonly, four factors are used:

    How often do e and f co-occur? (translation)How likely is a word occurring at position i to translate into aword occurring at position j? (distortion) For example: Englisha verb-second language, whereas German is a verb-finallanguage

    How likely is e to translate into more than one word? (fertility)For example:defeated can translate intoeine NiederlageerleidenHow likely is a foreign word to be spuriously generated? (nulltranslation)

  • 8/3/2019 Lecture8-statmt

    19/44

    19

    Translation Steps

  • 8/3/2019 Lecture8-statmt

    20/44

    20

    IBM Models 15Model 1: Bag of words

    Unique local maxima

    Efficient EM algorithm (Model 12)Model 2: General alignment:Model 3: fertility: n(k | e)

    No full EM, count only neighbors (Model 35)

    Deficient (Model 34)Model 4: Relative distortion, word classesModel 5: Extra variables to avoid deficiency

    a (e pos | f pos ,e l ength , f l ength )

  • 8/3/2019 Lecture8-statmt

    21/44

    21

    IBM Model 1Given an English sentence e1 . . . el and a foreign sentence f 1 . . . f mWe want to find the best alignment a, where a is a set pairs of the form {. . . , (i, j)},

    0

  • 8/3/2019 Lecture8-statmt

    22/44

    22

    IBM Model 1Simplest of the IBM modelsDoes not consider word order (bag-of-wordsapproach)Does not model one-to-many alignmentsComputationally inexpensiveUseful for parameter estimations that are passedon to more elaborate models

  • 8/3/2019 Lecture8-statmt

    23/44

    23

    IBM Model 1Translation probability in terms of alignments:

    where:

    and:

    P ( f | e) ! P ( f , a | e)a A

    P ( f , a | e) ! P (a | e) P ( f | a ,e)

    !1

    ( l 1)m P ( f j

    j! 1

    m

    | ea j )

    P ( f | e) !1

    ( l 1)mP ( f j

    j! 1

    m

    | ea j )a A

  • 8/3/2019 Lecture8-statmt

    24/44

    24

    IBM Model 1We want to find the most likely alignment:

    Since P(a | e) is the same for all a:

    Problem: We still have to enumerate all alignment

    argmaxa A

    1

    (l 1)m P ( f j

    j! 1

    m

    | ea j

    )

    argmaxa A

    P ( f j j! 1

    m

    | ea j

    )

  • 8/3/2019 Lecture8-statmt

    25/44

    25

    IBM Model 1Since P(f j | ei) is independent from P(f j| ei) we can find themaximum alignment by looking at the individual translatprobabilities onlyLet , then for each a j:

    The best alignment can computed in a quadratic number of steps: (l+1 x m)

    argmaxa A

    ! (a 1 , ..., a m )

    a j

    ! argmax0 e ie l

    P ( f j | e i )

  • 8/3/2019 Lecture8-statmt

    26/44

    26

    Computing Model 1 ParametersHow to compute translation probabilities for mod1 from a parallel corpus?Step 1: Determine candidates. For each Englishword e collect all foreign words f that co-occur atleast once with e

    Step 2: Initialize P(f | e) uniformly, i.e. P(f | e) =1/(no of co-occurring foreign words

  • 8/3/2019 Lecture8-statmt

    27/44

    27

    Computing Model 1 ParametersStep 3: Iteratively refine translation probablities:1 for n iterations2 set tc to zero3 for each sentence pair (e,f) of lengths (l,m)4 for j=1 to m

    5 total=0;6 for i=1 to l7 total += P(f j | ei);8 for i=1 to l9 tc(f

    j| e

    i) += P(f

    j| e

    i)/total;

    10 for each word e11 total=0;12 for each word f s.t. tc(f | e) is defined13 total += tc(f | e);14 for each word f s.t. tc(f | e) is defined15 P(f | e) = tc(f | e)/total;

  • 8/3/2019 Lecture8-statmt

    28/44

    28

    IBM Model 1 ExampleParallel corpus:the dog :: le chienthe cat :: le chat

    Step 1+2 (collect candidates and initialize uniformly):P(le | the) = P(chien | the) = P(chat | the) = 1/3P(le | dog) = P(chien | dog) = P(chat | dog) = 1/3P(le | cat) = P(chien | cat) = P(chat | cat) = 1/3

    P(le | NULL) = P(chien | NULL) = P(chat | NULL) = 1/3

  • 8/3/2019 Lecture8-statmt

    29/44

    29

    IBM Model 1 ExampleStep 3: IterateNULL the dog :: le chien

    j=1total = P(le | NULL)+P(le | the)+P(le | dog)= 1tc(le | NULL) += P(le | NULL)/1 = 0 += .333/1 = 0.333tc(le | the) += P(le | the)/1 = 0 += .333/1 = 0.333tc(le | dog) += P(le | dog)/1 = 0 += .333/1 = 0.333 j=2total = P(chien | NULL)+P(chien | the)+P(chien | dog)=1tc(chien | NULL) += P(chien | NULL)/1 = 0 += .333/1 = 0.333tc(chien | the) += P(chien | the)/1 = 0 += .333/1 = 0.333tc(chien | dog) += P(chien | dog)/1 = 0 += .333/1 = 0.333

  • 8/3/2019 Lecture8-statmt

    30/44

    30

    IBM Model 1 ExampleNULL the cat :: le chat

    j=1total = P(le | NULL)+P(le | the)+P(le | cat)=1

    tc(le | NULL) += P(le | NULL)/1 = 0.333 += .333/1 = 0.666

    tc(le | the) += P(le | the)/1 = 0.333 += .333/1 = 0.666tc(le | cat) += P(le | cat)/1 = 0 +=.333/1 = 0.333 j=2total = P(chien | NULL)+P(chien | the)+P(chien | dog)=1tc(chat | NULL) += P(chat | NULL)/1 = 0 += .333/1 = 0.333

    tc(chat | the) += P(chat | the)/1 = 0 += .333/1 = 0.333tc(chat | cat) += P(chat | dog)/1 = 0 += .333/1 = 0.333

  • 8/3/2019 Lecture8-statmt

    31/44

    31

    IBM Model 1 ExampleRe-compute translation probabilities

    total(the) = tc(le | the) + tc(chien | the) + tc(chat | the)= 0.666 + 0.333 + 0.333 = 1.333

    P(le | the) = tc(le | the)/total(the)= 0.666 / 1.333 = 0.5

    P(chien | the) = tc(chien | the)/total(the)= 0.333/1.333 0.25

    P(chat | the) = tc(chat | the)/total(the)= 0.333/1.333 0.25

    total(dog) = tc(le | dog) + tc(chien | dog) = 0.666P(le | dog) = tc(le | dog)/total(dog)

    = 0.333 / 0.666 = 0.5P(chien | dog) = tc(chien | dog)/total(dog)

    = 0.333 / 0.666 = 0.5

  • 8/3/2019 Lecture8-statmt

    32/44

    32

    IBM Model 1 ExampleIteration 2:NULL the dog :: le chien

    j=1total = P(le | NULL)+P(le | the)+P(le | dog)= 1.5

    = 0.5 + 0.5 + 0.5 = 1.5tc(le | NULL) += P(le | NULL)/1 = 0 += .5/1.5 = 0.333tc(le | the) += P(le | the)/1 = 0 += .5/1.5 = 0.333tc(le | dog) += P(le | dog)/1 = 0 += .5/1.5 = 0.333 j=2total = P(chien | NULL)+P(chien | the)+P(chien | dog)=1

    = 0.25 + 0.25 + 0.5 = 1tc(chien | NULL) += P(chien | NULL)/1 = 0 += .25/1 = 0.25tc(chien | the) += P(chien | the)/1 = 0 += .25/1 = 0.25tc(chien | dog) += P(chien | dog)/1 = 0 += .5/1 = 0.5

  • 8/3/2019 Lecture8-statmt

    33/44

    33

    IBM Model 1 ExampleNULL the cat :: le chat

    j=1total = P(le | NULL)+P(le | the)+P(le | cat)= 1.5

    = 0.5 + 0.5 + 0.5 = 1.5tc(le | NULL) += P(le | NULL)/1 = 0.333 += .5/1 = 0.833tc(le | the) += P(le | the)/1 = 0.333 += .5/1 = 0.833tc(le | cat) += P(le | cat)/1 = 0 += .5/1 = 0.5 j=2total = P(chat | NULL)+P(chat | the)+P(chat | cat)=1

    = 0.25 + 0.25 + 0.5 = 1tc(chat | NULL) += P(chat | NULL)/1 = 0 += .25/1 = 0.25tc(chat | the) += P(chat | the)/1 = 0 += .25/1 = 0.25tc(chat | cat) += P(chat | cat)/1 = 0 += .5/1 = 0.5

  • 8/3/2019 Lecture8-statmt

    34/44

    34

    IBM Model 1 ExampleRe-compute translations (iteration 2):

    total(the) = tc(le | the) + tc(chien | the) + tc(chat | the)= .833 + 0.25 + 0.25 = 1.333

    P(le | the) = tc(le | the)/total(the)= .833 / 1.333 = 0.625

    P(chien | the) = tc(chien | the)/total(the)= 0.25/1.333 = 0.188

    P(chat | the) = tc(chat | the)/total(the)= 0.25/1.333 = 0.188

    total(dog) = tc(le | dog) + tc(chien | dog)= 0.333 + 0.5 = 0.833

    P(le | dog) = tc(le | dog)/total(dog)= 0.333 / 0.833 = 0.4

    P(chien | dog) = tc(chien | dog)/total(dog)

    = 0.5 / 0.833 = 0.6

  • 8/3/2019 Lecture8-statmt

    35/44

    35

    IBM Model 1Example After 5 iterations:P(le | NULL) = 0.755608028335301P(chien | NULL) = 0.122195985832349P(chat | NULL) = 0.122195985832349P(le | the) = 0.755608028335301P(chien | the) = 0.122195985832349P(chat | the) = 0.122195985832349P(le | dog) = 0.161943319838057P(chien | dog) = 0.8380566 80161943P(le | cat) = 0.161943319838057P(chat | cat) = 0.8380566 80161943

  • 8/3/2019 Lecture8-statmt

    36/44

    36

    IBM Model 1 RecapIBM Model 1 allows for an efficient computation of translation probabilities

    No notion of fertility, i.e., its possible that the sameEnglish word is the best translation for all foreign wordsNo positional information, i.e., depending on the languagpair, there might be a tendency that words occurring at thbeginning of the English sentence are more likely to alignto words at the beginning of the foreign sentence

  • 8/3/2019 Lecture8-statmt

    37/44

    37

    IBM Model 3IBM Model 3 offers two additional featurescompared to IBM Model 1:

    How likely is an English word e to align to k foreignwords (fertility)?Positional information (distortion), how likely is a worin position i to align to a word in position j?

  • 8/3/2019 Lecture8-statmt

    38/44

    38

    IBM Model 3: FertilityThe best Model 1 alignment could be that a single English word alignto all foreign wordsThis is clearly not desirable and we want to constrain the number of

    words an English word can align toFertility models a probability distribution that word e aligns to k wordn(k,e)Consequence: translation probabilities cannot be computedindependently of each other anymoreIBM Model 3 has to work with full alignments, note there are up to(l+1)m different alignments

  • 8/3/2019 Lecture8-statmt

    39/44

    39

    IBM Model 1 + Model 3Iterating over all possible alignments iscomputationally infeasible

    Solution: Compute the best alignment with Modeland change some of the alignments to generate aset of likely alignments (pegging)

    Model 3 takes this restricted set of alignments asinput

  • 8/3/2019 Lecture8-statmt

    40/44

    40

    PeggingGiven an alignment a we can derive additionalalignments from it by making small changes:

    Changing a link (j,i) to (j,i)Swapping a pair of links (j,i) and (j,i) to (j,i) and (j

    The resulting set of alignments is called the

    neighborhood of a

  • 8/3/2019 Lecture8-statmt

    41/44

    41

    IBM Model 3: DistortionThe distortion factor determines how likely it is that anEnglish word in position i aligns to a foreign word inposition j, given the lengths of both sentences:

    d(j | i, l, m)Note, positions are absolute positions

  • 8/3/2019 Lecture8-statmt

    42/44

    42

    DeficiencyProblem with IBM Model 3: It assigns probability mass timpossible strings

    Well formed string: This is possibleIll-formed but possible string: This possible isImpossible string:

    Impossible strings are due to distortion values thatgenerate different words at the same positionImpossible strings can still be filtered out in later stages othe translation process

  • 8/3/2019 Lecture8-statmt

    43/44

    43

    Limitations of IBM ModelsOnly 1-to-N word mappingHandling fertility-zero words (difficult for decoding)

    Almost no syntactic informationWord classesRelative distortion

    Long-distance word movement

    Fluency of the output depends entirely on the Englishlanguage model

  • 8/3/2019 Lecture8-statmt

    44/44

    44

    DecodingHow to translate new sentences? A decoder uses the parameters learned on a parallel

    corpusTranslation probabilitiesFertilitiesDistortions

    In combination with a language model the decoder generates the most likely translationStandard algorithms can be used to explore the searchspace (A*, greedy searching, )

    Similar to the traveling salesman problem