Top Banner
Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge
39

Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Dec 14, 2015

Download

Documents

Karina Worthley
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Evaluation of Text Generation: Automatic Evaluation vs. Variation

Amanda Stent, Mohit Singhai, Matthew Marge

Page 2: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Natural Language Generation

Content Planning

Text Planning

Sentence Planning

Surface Realization

Page 3: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Approaches to Surface Realization

Template based Domain-specific All output tends to be high quality because

highly constrained Grammar based

Typically one high quality output per input Forest based

Many outputs per input Text-to-text

No need for other generation components

Page 4: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Surface Realization Tasks

To communicate the input meaning as completely, clearly and elegantly as possible by careful: Word selection Word and phrase arrangement Consideration of context

Page 5: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Importance of Lexical Choice

I drove to Rochester.

I raced to Rochester.

I went to Rochester.

Page 6: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Importance of Syntactic Structure

I picked up my coat three weeks later from the dry cleaners in Smithtown

In Smithtown I picked up my coat from the dry cleaners three weeks later

Page 7: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Pragmatic Considerations

An Israeli civilian was killed and another wounded when Palestinian militants opened fire on a vehicle

Palestinian militant gunned down a vehicle on a road

Page 8: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Evaluating Text Generators

Per-generator: Coverage Per-sentence:

Adequacy Fluency / syntactic accuracy Informativeness

Additional metrics of interest: Range: Ability to produce valid variants Readability Task-specific metrics

E.g. for dialog

Page 9: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Evaluating Text Generators

Human judgments Parsing+interpretation Automatic evaluation metrics – for generation

or machine translation Simple string accuracy+ NIST* BLEU*+ F-measure* LSA

Page 10: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Question

What is a “good” sentence?

readable

fluent

adequate

Page 11: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Approach

Question: Which evaluation metric or set of evaluation metrics least punishes variation? Word choice variation Syntactic structure variation

Procedure: Correlation between human and automatic judgments of variations Context not included

Page 12: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

String Accuracy

Simple String Accuracy (I+D+S) / #Words (Callaway 2003, Langkilde

2002, Rambow et. al. 2002, Leusch et. al. 2003)

Generation String Accuracy (M + I’ + D’ + S) / #Words (Rambow et. al.

2002)

Page 13: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Simple String Accuracy

The dog saw the manThe man was seen by the dog

M S D D MI, D I, D

Page 14: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

BLEU

Developed by Papenini et. al. at IBM Key idea: count matching subsequences

between the reference and candidate sentences

Avoid counting matches multiple times by clipping

Punish differences in sentence length

Page 15: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

BLEU

The dog saw the manThe man was seen by the dog

Page 16: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

NIST ngram

Designed to fix two problems with BLEU: Geometric mean penalizes large N Might like to prefer ngrams that are more

informative = less likely

Page 17: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

NIST ngram

Arithmetic average over all ngram co-occurrences

Weight “less likely” ngrams more Use a brevity factor to punish varying

sentence lengths

Page 18: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

F Measure

Idea due to Melamed 1995, Turian et. al. 2003 Same basic idea as ngram measures Designed to eliminate “double counting” done by

ngram measures F = 2*precision*recall / (precision + recall) Precision(candidate|reference) = maximum matching

size(candidate, reference) / |candidate| Precision(candidate|reference) = maximum matching

size(candidate, reference) / |reference|

Page 19: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

F Measure

the ? ?dog ?sawthe ? ?man ?

the man was seen by the dog

Page 20: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

LSA

Doesn’t care about word order Evaluates how similar two bags of words are

with respect to a corpus A good way of evaluating word choice?

Page 21: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Eval. Metric

Means of measuring fluency

Means of measuring adequacy

Means of measuring readability

Punishes length differences?

SSA Comparison against reference sentence

Comparison against reference sentence

Comparison against reference sentence from same context*

Yes (punishes deletions, insertions)

NIST n-gram, BLEU

Comparison against reference sentences -- matching n-grams

Comparison against reference sentences

Comparison against reference sentences from same context*

Yes (weights)

F measure

Comparison against reference sentences -- longest matching substrings

Comparison against reference sentences

Comparison against reference sentences from same context*

Yes (length factor)

LSA None Comparison against word co-occurrence frequencies learned from corpus

None Not explicitly

Page 22: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Experiment 1

Sentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and Lee 2002) Includes word choice variation, e.g. Another person

was also seriously wounded in the attack vs. Another individual was also seriously wounded in the attack

Includes word order variation, e.g. A suicide bomber blew himself up at a bus stop east of Tel Aviv on Thursday, killing himself and wounding five bystanders, one of them seriously, police and paramedics said vs. A suicide bomber killed himself and wounded five, when he blew himself up at a bus stop east of Tel Aviv on Thursday

Page 23: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Paraphrase Generation

1. Cluster like sentences• By hand or using word n-gram co-occurrence

statistics• May first remove certain details

2. Compute multiple-sequence alignment• Choice points and regularities in input sentence

pairs/sets in a corpus

3. Match lattices• Match between corpora

4. Generate• Lattice alignment

Page 24: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Paraphrase Generation Issues

Sometimes words chosen for substitution carry unwanted connotations

Sometimes extra words are chosen for inclusion (or words removed) that change the meaning

Page 25: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Experiment 1

To explore the impact of word choice variation, we used the baseline system data

To explore the impact of word choice and word order variation, we used the MSA data

Page 26: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Human Evaluation

Barzilay and Lee used a binary rating for meaning preservation and difficulty

We used a five point scale How much of the meaning expressed in

Sentence A is also expressed in Sentence B? All/Most/Half/Some/None

How do you judge the fluency of Sentence B? It is Flawless/Good/Adequate/Poor/Incomprehensible

Page 27: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Human Evaluation: Sentences

Mean St. Dev.H1 Adequacy 4.29 0.71H1 Fluency 4.20 0.90H2 Adequacy 3.74 1.00H2 Fluency 3.50 1.06H3 Adequacy 3.99 0.78H3 Fluency 4.68 0.58

Page 28: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Results Sentences:Automatic Evaluations

BLEU NIST SSA F LSABLEU 1NIST 0.91 1SSA 0.894 0.863 1F 0.927 0.9 0.955 1LSA 0.725 0.727 0.742 0.795 1

Page 29: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Results Sentences:Human Evaluations

Adequacy FluencyBLEU 0.388 -0.492NIST 0.421 -0.563SSA 0.412 -0.400F 0.457 -0.412LSA 0.467 -0.290Adequacy 1.000 -0.032

Page 30: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Discussion

These metrics achieve some level of correlation with human judgments of adequacy But could we do better?

Most metrics are negatively correlated with fluency Word order or constituent order variation

requires a different metric

Page 31: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Results Sentences: Naïve Combining

Adequacy FluencyLSA.plus.F 0.485 -0.391LSA.plus.BLEU 0.428 -0.464LSA.plus.SSA 0.457 -0.382

Page 32: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Effect of sentence length

Length.source Length.targetBLEU 0.148 0.530NIST 0.288 0.712SSA 0.084 0.432F 0.085 0.459LSA 0.126 0.384Adequacy -0.131 0.134Fluency -0.099 -0.401Length.source 1.000 0.723

Page 33: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Baseline vs. MSA

Note: For Baseline system, there is no significant correlation between human judgments of adequacy and fluency and automatic evaluation scores.

For MSA, there is a positive correlation between human judgments of adequacy and automatic evaluation scores,but not fluency.

System BLEU NIST SSA F LSA Adequacy FluencyBaseline 0.753 4.156 0.86 0.89 0.954 0.833 0.756MSA 0.29 1.945 0.42 0.53 0.845 0.77 0.897

Page 34: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Discussion

Automatic evaluation metrics other than LSA punish word choice variation

Automatic evaluation metrics other than LSA punish word order variation Are not as effected by word order variation as

by word choice variation Punish legitimate and illegitimate word and

constituent reorderings equally

Page 35: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Discussion

Fluency These metrics are not adequate for evaluating

fluency in the presence of variation Adequacy

These metrics are barely adequate for evaluating adequacy in the presence of variation

Readability These metrics do not claim to evaluate

readability

Page 36: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

A Preliminary Proposal

Modify automatic evaluation metrics as follows: Not punish legitimate word choice variation

E.g. using WordNet But the ‘simple’ approach doesn’t work

Not punish legitimate word order variation But need a notion of constituency

Page 37: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Another Preliminary Proposal

When using metrics that depend on a reference sentence, use A set of reference sentences

Try to get as many of the word choice and word order variations as possible in the reference sentences

Reference sentences from the same context as the candidate sentence

To approach an evaluation of readability And combine with some other metric for fluency

For example, a grammar checker

Page 38: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

A Proposal

To evaluate a generator: Evaluate for coverage using recall or related

metric Evaluate for ‘precision’ using separate metrics

for fluency, adequacy and readability At this point in time, only fluency may be

evaluable automatically, using a grammar checker

Adequacy can be approached using LSA or related metric

Readability can only be evaluated using human judgments at this time

Page 39: Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Current and Future Work

Other potential evaluation metrics: F measure plus WordNet Parsing as measure of fluency F measure plus LSA Use multiple-sequence alignment as an

evaluation metric Metrics that evaluate readability