Top Banner
Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008
41
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Machine Translation- 5

Autumn 2008

Lecture 20

11 Sep 2008

Page 2: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Decoding

Decoding… Given a trained model and a foreign sentence

produce Argmax P(e|f) Can’t use Viterbi it’s too restrictive Need a reasonable efficient search technique that explores

the sequence space based on how good the options look… A*

Page 3: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

A*

Recall for A* we need Goal State Operators Heuristic

Page 4: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

A*

Recall for A* we need Goal State Good coverage of source Operators Translation of

phrases/words

distortions

deletions/insertions Heuristic Probabilities (tweaked)

Page 5: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

A* Decoding

Why not just use the probability as we go along? Turns it into Uniform-cost not A* That favors shorter sequences over longer ones. Need to counter-balance the probability of the

translation so far with its “progress towards the goal”.

Page 6: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

A*/Beam

Sorry… Even that doesn’t work because the space is too large So as we go we’ll prune the space as paths fall below

some threshold

Page 7: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

A* Decoding

Page 8: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

A* Decoding

Page 9: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

A* Decoding

Page 10: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

How do we evaluate MT? Human tests for fluency

Rating tests: Give the raters a scale (1 to 5) and ask them to rate Or distinct scales for

Clarity, Naturalness, Style Or check for specific problems

Cohesion (Lexical chains, anaphora, ellipsis) Hand-checking for cohesion.

Well-formedness 5-point scale of syntactic correctness

Comprehensibility tests Noise test Multiple choice questionnaire

Readability tests cloze

Page 11: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

How do we evaluate MT? Human tests for fidelity

Adequacy Does it convey the information in the original? Ask raters to rate on a scale

Bilingual raters: give them source and target sentence, ask how much information is preserved

Monolingual raters: give them target + a good human translation

Informativeness Task based: is there enough info to do some task? Give raters multiple-choice questions about content

Page 12: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Evaluating MT: Problems

Asking humans to judge sentences on a 5-point scale for 10 factors takes time and money(weeks or months!)

We can’t build language engineering systems if we can only evaluate them once every quarter!!!!

We need a metric that we can run every time we change our algorithm.

It would be OK if it wasn’t perfect, but just tended to correlate with the expensive human metrics, which we could still run in quarterly.

Bonnie Dorr

Page 13: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Automatic evaluation

Miller and Beebe-Center (1958) Assume we have one or more human translations of the

source passage Compare the automatic translation to these human

translations Bleu NIST Meteor Precision/Recall

Page 14: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Reference proximity methods

Assumption of Reference Proximity (ARP): “…the closer the machine translation is to a

professional human translation, the better it is” (Papineni et al., 2002: 311)

Finding a distance between 2 texts Minimal edit distance N-gram distance …

Page 15: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Minimal edit distance

Minimal number of editing operations to transform text1 into text2 deletions (sequence xy changed to x) insertions (x changed to xy) substitutions (x changed by y) transpositions (sequence xy changed to yx)

Algorithm by Wagner and Fischer (1974). Edit distance implementation: RED method

Akiba Y., K Imamura and E. Sumita. 2001

Page 16: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Problem with edit distance: Legitimate translation variation

ORI: De son côté, le département d'Etat américain, dans un communiqué, a déclaré: ‘Nous ne comprenons pas la décision’ de Paris.

HT-Expert: For its part, the American Department of State said in a communique that ‘We do not understand the decision’ made by Paris.

HT-Reference: For its part, the American State Department stated in a press release: We do not understand the decision of Paris.

MT-Systran: On its side, the American State Department, in an official statement, declared: ‘We do not include/understand the decision’ of Paris.

Page 17: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Legitimate translation variation to which human translation should we compute the edit

distance? is it possible to integrate both human translations into a

reference set?

Page 18: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

N-gram distance

the number of common words (evaluating lexical choices);

the number of common sequences of 2, 3, 4 … N words (evaluating word order): 2-word sequences (bi-grams) 3-word sequences (tri-grams) 4-word sequences (four-grams) … N-word sequences (N-grams)

N-grams allow us to compute several parameters…

Page 19: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Matches of N-grams

HT

MT

True positives

False positivesFalse negatives

Page 20: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Matches of N-grams (contd.)

MT + MT –

Human text + true positives false negatives → recall (avoiding false negatives)

Human text – false positives

precision (avoiding false positives)

Page 21: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Precision and Recall

Precision = how accurate is the answer? “Don’t guess, wrong answers are deducted!”

Recall = how complete is the answer? “Guess if not sure!”, don’t miss anything!

ivesFalsePositvesTruePositi

vesTruePositiprecision

ivesFalseNegatvesTruePositi

vesTruePositirecall

Page 22: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Translation variation and N-grams

N-gram distance to multiple human reference translations

Precision on the union of N-gram sets in HT1, HT2, HT3…

N-grams in all independent human translations taken together with repetitions removed

Recall on the intersection of N-gram sets N-grams common to all sets – only repeated N-grams! (most

stable across different human translations)

Page 23: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Human and automated scores

Empirical observations: Precision on the union gives indication of Fluency Recall on intersection gives indication of Adequacy

Automated Adequacy evaluation is less accurate – harder

Now most successful N-gram proximity -- BLEU evaluation measure (Papineni et al., 2002)

BiLingual Evaluation Understudy

Page 24: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

BiLingual Evaluation Understudy (BLEU —Papineni, 2001)

Automatic Technique, but …. Requires the pre-existence of Human (Reference) Translations Approach:

Produce corpus of high-quality human translations Judge “closeness” numerically (word-error rate) Compare n-gram matches between candidate translation and

1 or more reference translations

http://www.research.ibm.com/people/k/kishore/RC22176.pdf

Slide from Bonnie Dorr

Page 25: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

BLEU evaluation measure

computes Precision on the union of N-grams accurately predicts Fluency produces scores in the range of [0,1]

Page 26: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .

Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.

BLEU Evaluation Metric(Papineni et al, ACL-2002)

• N-gram precision (score is between 0 & 1)– What percentage of machine n-grams can

be found in the reference translation? – An n-gram is an sequence of n words

– Not allowed to use same portion of reference translation twice (can’t cheat by typing out “the the the the the”)

• Brevity penalty– Can’t just type out single word “the”

(precision 1.0!)

*** Amazingly hard to “game” the system (i.e., find a way to change machine output so that BLEU goes up, but quality doesn’t)

Slide from Bonnie Dorr

Page 27: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .

Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.

BLEU Evaluation Metric(Papineni et al, ACL-2002)

• BLEU4 formula (counts n-grams up to length 4)

exp (1.0 * log p1 + 0.5 * log p2 + 0.25 * log p3 + 0.125 * log p4 – max(words-in-reference / words-in-machine – 1, 0)

p1 = 1-gram precisionP2 = 2-gram precisionP3 = 3-gram precisionP4 = 4-gram precision

Slide from Bonnie Dorr

Page 28: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .

Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport . Guam authority has been on alert .

Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia . They said there would be biochemistry air raid to Guam Airport and other public places . Guam needs to be in high precaution about this matter .

Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places .

Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.

Multiple Reference Translations

Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .

Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport . Guam authority has been on alert .

Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia . They said there would be biochemistry air raid to Guam Airport and other public places . Guam needs to be in high precaution about this matter .

Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places .

Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.

Slide from Bonnie Dorr

Page 29: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Bleu Comparison

Chinese-English Translation Example:

Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party.

Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct.

Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.

Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.

Reference 3: It is the practical guide for the army always to heed the directions of the party.

Slide from Bonnie Dorr

Page 30: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

How Do We Compute Bleu Scores? Intuition: “What percentage of words in candidate occurred in some

human translation?”

Proposal: count up # of candidate translation words (unigrams) # in any reference translation, divide by the total # of words in # candidate translation

But can’t just count total # of overlapping N-grams! Candidate: the the the the the the Reference 1: The cat is on the mat

Solution: A reference word should be considered exhausted after a matching candidate word is identified.

Slide from Bonnie Dorr

Page 31: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

“Modified n-gram precision”

For each word compute: (1) total number of times it occurs in any single reference translation(2) number of times it occurs in the candidate translation

Instead of using count #2, use the minimum of #2 and #2, I.e. clip the counts at the max for the reference transcription

Now use that modified count. And divide by number of candidate words.

Slide from Bonnie Dorr

Page 32: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Modified Unigram Precision: Candidate #1

Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.

Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.

Reference 3: It is the practical guide for the army always to heed the directions of the party.

It(1) is(1) a(1) guide(1) to(1) action(1) which(1) ensures(1) that(2) the(4) military(1) always(1) obeys(0) the commands(1) of(1) the party(1)

What’s the answer???

17/18

Slide from Bonnie Dorr

Page 33: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Modified Unigram Precision: Candidate #2

It(1) is(1) to(1) insure(0) the(4) troops(0) forever(1) hearing(0) the activity(0) guidebook(0) that(2) party(1) direct(0)

What’s the answer????

8/14

Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.

Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.

Reference 3: It is the practical guide for the army always to heed the directions of the party.

Slide from Bonnie Dorr

Page 34: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Modified Bigram Precision: Candidate #1

It is(1) is a(1) a guide(1) guide to(1) to action(1) action which(0) which ensures(0) ensures that(1) that the(1) the military(1) military always(0) always obeys(0) obeys the(0) the commands(0) commands of(0) of the(1) the party(1)

What’s the answer????

10/17

Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.

Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.

Reference 3: It is the practical guide for the army always to heed the directions of the party.

Slide from Bonnie Dorr

Page 35: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Modified Bigram Precision: Candidate #2

Reference 1: It is a guide to action that ensures that themilitary will forever heed Party commands.

Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.

Reference 3: It is the practical guide for the army always to heed the directions of the party.

It is(1) is to(0) to insure(0) insure the(0) the troops(0) troops forever(0) forever hearing(0) hearing the(0) the activity(0) activity guidebook(0) guidebook that(0) that party(0) party direct(0)

What’s the answer????

1/13

Slide from Bonnie Dorr

Page 36: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Catching Cheaters

Reference 1: The cat is on the mat

Reference 2: There is a cat on the mat

the(2) the the the(0) the(0) the(0) the(0)

What’s the unigram answer?

2/7

What’s the bigram answer?

0/7

Slide from Bonnie Dorr

Page 37: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

BLEU Tends to Predict Human Judgments

R2 = 88.0%

R2 = 90.2%

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5

Human Judgments

NIS

T S

co

re

Adequacy

Fluency

Linear(Adequacy)Linear(Fluency)

slide from G. Doddington (NIST)

(va

ria

nt

of

BL

EU

)

Page 38: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Bleu problems with sentence length

Candidate: of the

Solution: brevity penalty; prefers candidates translations which are same length as one of the references

Reference 1: It is a guide to action that ensures that themilitary will forever heed Party commands.

Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.

Reference 3: It is the practical guide for the army always to heed the directions of the party.

Problem: modified unigram precision is 2/2, bigram 1/1!

Slide from Bonnie Dorr

Page 39: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

NIST

based on the BLEU metric Where BLEU simply calculates n-gram precision adding

equal weight to each one, NIST also calculates how informative a particular n-gram is. That is to say when a correct n-gram is found, the rarer that n-gram is, the more weight it will be given.

For example, if the bigram "on the" is correctly matched, it will receive lower weight than the correct matching of bigram "interesting calculations", as this is less likely to occur.

Different way of calculating brevity penalty

Page 40: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Meteor (Metric for Evaluation of Translation with

Explicit ORdering)

based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision.

Uses stemming and synonymy matching, along with the standard exact word matching.

Page 41: Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Recent developments: N-gram distance paraphrasing instead of multiple RT more weight to more “important” words

relatively more frequent in a given text relations between different human scores accounting for dynamic quality criteria