Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge
Dec 14, 2015
Evaluation of Text Generation: Automatic Evaluation vs. Variation
Amanda Stent, Mohit Singhai, Matthew Marge
Natural Language Generation
Content Planning
Text Planning
Sentence Planning
Surface Realization
Approaches to Surface Realization
Template based Domain-specific All output tends to be high quality because
highly constrained Grammar based
Typically one high quality output per input Forest based
Many outputs per input Text-to-text
No need for other generation components
Surface Realization Tasks
To communicate the input meaning as completely, clearly and elegantly as possible by careful: Word selection Word and phrase arrangement Consideration of context
Importance of Lexical Choice
I drove to Rochester.
I raced to Rochester.
I went to Rochester.
Importance of Syntactic Structure
I picked up my coat three weeks later from the dry cleaners in Smithtown
In Smithtown I picked up my coat from the dry cleaners three weeks later
Pragmatic Considerations
An Israeli civilian was killed and another wounded when Palestinian militants opened fire on a vehicle
Palestinian militant gunned down a vehicle on a road
Evaluating Text Generators
Per-generator: Coverage Per-sentence:
Adequacy Fluency / syntactic accuracy Informativeness
Additional metrics of interest: Range: Ability to produce valid variants Readability Task-specific metrics
E.g. for dialog
Evaluating Text Generators
Human judgments Parsing+interpretation Automatic evaluation metrics – for generation
or machine translation Simple string accuracy+ NIST* BLEU*+ F-measure* LSA
Question
What is a “good” sentence?
readable
fluent
adequate
Approach
Question: Which evaluation metric or set of evaluation metrics least punishes variation? Word choice variation Syntactic structure variation
Procedure: Correlation between human and automatic judgments of variations Context not included
String Accuracy
Simple String Accuracy (I+D+S) / #Words (Callaway 2003, Langkilde
2002, Rambow et. al. 2002, Leusch et. al. 2003)
Generation String Accuracy (M + I’ + D’ + S) / #Words (Rambow et. al.
2002)
Simple String Accuracy
The dog saw the manThe man was seen by the dog
M S D D MI, D I, D
BLEU
Developed by Papenini et. al. at IBM Key idea: count matching subsequences
between the reference and candidate sentences
Avoid counting matches multiple times by clipping
Punish differences in sentence length
BLEU
The dog saw the manThe man was seen by the dog
NIST ngram
Designed to fix two problems with BLEU: Geometric mean penalizes large N Might like to prefer ngrams that are more
informative = less likely
NIST ngram
Arithmetic average over all ngram co-occurrences
Weight “less likely” ngrams more Use a brevity factor to punish varying
sentence lengths
F Measure
Idea due to Melamed 1995, Turian et. al. 2003 Same basic idea as ngram measures Designed to eliminate “double counting” done by
ngram measures F = 2*precision*recall / (precision + recall) Precision(candidate|reference) = maximum matching
size(candidate, reference) / |candidate| Precision(candidate|reference) = maximum matching
size(candidate, reference) / |reference|
F Measure
the ? ?dog ?sawthe ? ?man ?
the man was seen by the dog
LSA
Doesn’t care about word order Evaluates how similar two bags of words are
with respect to a corpus A good way of evaluating word choice?
Eval. Metric
Means of measuring fluency
Means of measuring adequacy
Means of measuring readability
Punishes length differences?
SSA Comparison against reference sentence
Comparison against reference sentence
Comparison against reference sentence from same context*
Yes (punishes deletions, insertions)
NIST n-gram, BLEU
Comparison against reference sentences -- matching n-grams
Comparison against reference sentences
Comparison against reference sentences from same context*
Yes (weights)
F measure
Comparison against reference sentences -- longest matching substrings
Comparison against reference sentences
Comparison against reference sentences from same context*
Yes (length factor)
LSA None Comparison against word co-occurrence frequencies learned from corpus
None Not explicitly
Experiment 1
Sentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and Lee 2002) Includes word choice variation, e.g. Another person
was also seriously wounded in the attack vs. Another individual was also seriously wounded in the attack
Includes word order variation, e.g. A suicide bomber blew himself up at a bus stop east of Tel Aviv on Thursday, killing himself and wounding five bystanders, one of them seriously, police and paramedics said vs. A suicide bomber killed himself and wounded five, when he blew himself up at a bus stop east of Tel Aviv on Thursday
Paraphrase Generation
1. Cluster like sentences• By hand or using word n-gram co-occurrence
statistics• May first remove certain details
2. Compute multiple-sequence alignment• Choice points and regularities in input sentence
pairs/sets in a corpus
3. Match lattices• Match between corpora
4. Generate• Lattice alignment
Paraphrase Generation Issues
Sometimes words chosen for substitution carry unwanted connotations
Sometimes extra words are chosen for inclusion (or words removed) that change the meaning
Experiment 1
To explore the impact of word choice variation, we used the baseline system data
To explore the impact of word choice and word order variation, we used the MSA data
Human Evaluation
Barzilay and Lee used a binary rating for meaning preservation and difficulty
We used a five point scale How much of the meaning expressed in
Sentence A is also expressed in Sentence B? All/Most/Half/Some/None
How do you judge the fluency of Sentence B? It is Flawless/Good/Adequate/Poor/Incomprehensible
Human Evaluation: Sentences
Mean St. Dev.H1 Adequacy 4.29 0.71H1 Fluency 4.20 0.90H2 Adequacy 3.74 1.00H2 Fluency 3.50 1.06H3 Adequacy 3.99 0.78H3 Fluency 4.68 0.58
Results Sentences:Automatic Evaluations
BLEU NIST SSA F LSABLEU 1NIST 0.91 1SSA 0.894 0.863 1F 0.927 0.9 0.955 1LSA 0.725 0.727 0.742 0.795 1
Results Sentences:Human Evaluations
Adequacy FluencyBLEU 0.388 -0.492NIST 0.421 -0.563SSA 0.412 -0.400F 0.457 -0.412LSA 0.467 -0.290Adequacy 1.000 -0.032
Discussion
These metrics achieve some level of correlation with human judgments of adequacy But could we do better?
Most metrics are negatively correlated with fluency Word order or constituent order variation
requires a different metric
Results Sentences: Naïve Combining
Adequacy FluencyLSA.plus.F 0.485 -0.391LSA.plus.BLEU 0.428 -0.464LSA.plus.SSA 0.457 -0.382
Effect of sentence length
Length.source Length.targetBLEU 0.148 0.530NIST 0.288 0.712SSA 0.084 0.432F 0.085 0.459LSA 0.126 0.384Adequacy -0.131 0.134Fluency -0.099 -0.401Length.source 1.000 0.723
Baseline vs. MSA
Note: For Baseline system, there is no significant correlation between human judgments of adequacy and fluency and automatic evaluation scores.
For MSA, there is a positive correlation between human judgments of adequacy and automatic evaluation scores,but not fluency.
System BLEU NIST SSA F LSA Adequacy FluencyBaseline 0.753 4.156 0.86 0.89 0.954 0.833 0.756MSA 0.29 1.945 0.42 0.53 0.845 0.77 0.897
Discussion
Automatic evaluation metrics other than LSA punish word choice variation
Automatic evaluation metrics other than LSA punish word order variation Are not as effected by word order variation as
by word choice variation Punish legitimate and illegitimate word and
constituent reorderings equally
Discussion
Fluency These metrics are not adequate for evaluating
fluency in the presence of variation Adequacy
These metrics are barely adequate for evaluating adequacy in the presence of variation
Readability These metrics do not claim to evaluate
readability
A Preliminary Proposal
Modify automatic evaluation metrics as follows: Not punish legitimate word choice variation
E.g. using WordNet But the ‘simple’ approach doesn’t work
Not punish legitimate word order variation But need a notion of constituency
Another Preliminary Proposal
When using metrics that depend on a reference sentence, use A set of reference sentences
Try to get as many of the word choice and word order variations as possible in the reference sentences
Reference sentences from the same context as the candidate sentence
To approach an evaluation of readability And combine with some other metric for fluency
For example, a grammar checker
A Proposal
To evaluate a generator: Evaluate for coverage using recall or related
metric Evaluate for ‘precision’ using separate metrics
for fluency, adequacy and readability At this point in time, only fluency may be
evaluable automatically, using a grammar checker
Adequacy can be approached using LSA or related metric
Readability can only be evaluated using human judgments at this time
Current and Future Work
Other potential evaluation metrics: F measure plus WordNet Parsing as measure of fluency F measure plus LSA Use multiple-sequence alignment as an
evaluation metric Metrics that evaluate readability