1 The Pyramid Method at DUC05 Ani Nenkova Becky Passonneau Kathleen McKeown Other team members: David Elson, Advaith Siddharthan, Sergey Siegelman.

1

The Pyramid Method at The Pyramid Method at DUC05DUC05

Ani Nenkova

Becky Passonneau

Kathleen McKeown

Other team members: David Elson, Advaith Siddharthan, Sergey Siegelman

2

OverviewOverview

Review of Pyramids (Kathy) Characteristics of the responses Analyses (Ani)

Scores and Significant Differences Reliability of Pyramid scoring

Comparisons between annotators Impact of editing on scores Impact of Weight 1 SCUs Correlation with responsiveness and Rouge

Lessons learned

3

PyramidsPyramids Uses multiple human summaries

Previous data indicated 5 needed for score stability

Information is ranked by its importance Allows for multiple good summaries A pyramid is created from the human

summaries Elements of the pyramid are content units System summaries are scored by comparison with

the pyramid

4

Summarization Content UnitsSummarization Content Units

Near-paraphrases from different human summaries

Clause or less

Avoids explicit semantic representation

Emerges from analysis of human summaries

5

SCU: SCU: A cable car caught fireA cable car caught fire (Weight = 4)(Weight = 4)A. The cause of the fire was unknown.B. A cable car caught fire just after entering a

mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, 2000.

C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps, caught fire inside a mountain tunnel, killing approximately 170 people.

D. On November 10, 2000, a cable car filled to capacity caught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.

6

SCU: SCU: The cause of the fire is The cause of the fire is unknownunknown (Weight = 1) (Weight = 1)A. The cause of the fire was unknown.B. A cable car caught fire just after entering a




7

SCU: SCU: The accident happened in The accident happened in the Austrian Alpsthe Austrian Alps (Weight = 3) (Weight = 3)A. The cause of the fire was unknown.B. A cable car caught fire just after entering a




8

Idealized representationIdealized representation

Tiers of differentially weighted SCUs

Top: few SCUs, high weight

Bottom: many SCUs, low weight

W=1

W=2

W=3

9

Creation of pyramids Creation of pyramids

Done for each of 20 out of 50 sets

Primary annotator, secondary checker

Held round-table discussions of problematic constructions that occurred in this data set

Comma separated lists Extractive reserves have been formed for managed harvesting of

timber, rubber, Brazil nuts, and medical plants without deforestation.

General vs. specific Eastern Europe vs. Hungary, Poland, Lithuania, and Turkey

10

Characteristics of the ResponsesCharacteristics of the Responses

Proportion of SCUs of Weight 1 is large 44% (D324) to 81% (D695)

Mean SCU weight: 1.9

Agreement among human responders is quite low

11 SCU Weights

# of SCUs at each weight

12

Pyramids: DUC 2003Pyramids: DUC 2003

100 word summaries (vs. 250 word) 10 500-word articles per cluster (vs. 30 720-

word articles) 3 clusters (vs. 20 clusters)

Mean SCU Weight (7 models) 2005: avg 1.9 2003: avg 2.4

Proportion of SCUs of W=1 2005: avg – 60%, 44% to 81% 2003: avg – 40%, 37% to 47%

13

DUC03 DUC05DUC03 DUC05

.4

.4

14

Computing pyramid scores:Computing pyramid scores:Ideally informative summaryIdeally informative summary

Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well

15

Ideally informative summaryIdeally informative summary


16



17



18



19



20

Original Pyramid ScoreOriginal Pyramid Score

SCORE = D/MAX

D: Sum of the weights of the SCUs in a summary

MAX: Sum of the weights of the SCUs in a ideally informative summary

Measures the proportion of good information in the summary: precision

21

Modified pyramid score Modified pyramid score (recall)(recall) EN = average SCUs in human models

This is the number of content units humans chose to convey about the story

W=Compute the weight of a maximally informative summary of size EN

D/W is the modified pyramid score Shows the proportion of expected good

information

22

Scoring MethodsScoring Methods

Presents scores for the 20 pyramid sets Recompute Rouge for comparison

We compute Rouge using only 7 models 8 and 9 reserved for computing human performance Best because of significant topic effect

Comparisons between Pyramid (original,modified), responsiveness, and Rouge-SU4

Pyramids score computed from multiple humans Responsiveness is just one human’s judgment Rouge-SU4 equivalent to Rouge-2

23

Preview of ResultsPreview of Results

Manual metrics Large differences between humans and machines

No single system the clear winner But a top group identified by all metrics

Significant differences Different predictions from manual and automatic metrics

Correlations between metrics Some correlation but one cannot be substituted for another This is good

24

Human performance/Best sysHuman performance/Best sys

Pyramid Modified Resp ROUGE-SU4

B: 0.5472 B: 0.4814 A: 4.895 A: 0.1722 A: 0.4969 A: 0.4617 B: 4.526 B: 0.1552~~~~~~~~~~~~~~~~~

14: 0.2587 10: 0.2052 4: 2.85 15: 0.139 Best system ~50% of human performance on manual metrics

Best system ~80% of human performance on ROUGE

25

Pyramid original Modified Resp Rouge-SU414: 0.2587 10: 0.2052 4: 2.85 15: 0.139 17: 0.2492 17: 0.1972 14: 2.8 4: 0.134 15: 0.2423 14: 0.1908 10: 2.65 17: 0.1346 10: 0.2379 7: 0.1852 15: 2.6 19: 0.1275 4: 0.2321 15: 0.1808 17: 2.55 11: 0.1259 7: 0.2297 4: 0.177 11: 2.5 10: 0.127816: 0.2265 16: 0.1722 28: 2.45 6: 0.1239 6: 0.2197 11: 0.1703 21: 2.45 7: 0.1213 32: 0.2145 6: 0.1671 6: 2.4 14: 0.1264 21: 0.2127 12: 0.1664 24: 2.4 25: 0.1188 12: 0.2126 19: 0.1636 19: 2.4 21: 0.1183 11: 0.2116 21: 0.1613 6: 2.4 16: 0.1218 26: 0.2106 32: 0.1601 27: 2.35 24: 0.118 19: 0.2072 26: 0.1464 12: 2.35 12: 0.116 28: 0.2048 3: 0.145 7: 2.3 3: 0.1198 13: 0.1983 28: 0.1427 25: 2.2 28: 0.1203 3: 0.1949 13: 0.1424 32: 2.15 27: 0.110 1: 0.1747 25: 0.1406 3: 2.1 13: 0.1097

26


27


28


29

Significant DifferencesSignificant Differences

Manual metrics Few differences between systems

Pyramid: 23 is worse Responsive: 23 and 31 are worse

Both humans better than all systems

Automatic (Rouge-SU4) Many differences between systems One human indistinguishable from 5 systems

30

Multiple and pairwise comparisonsMultiple and pairwise comparisons

Multiple comparisons Tukey’s method Control for the experiment-wise type I error Show fewer significant differences

Pairwise comparisons Wilcoxon paired test Controls the error for individual comparisons Appropriate how your system did for development

31

21

32

6

12

19

11

16

4

15

7

14

17

10

A

B

23

23

23

23

23

23

23

23

23

23

23

23 20

23 20

23 20 30 24 31 1 27 25 28 13 26 3 21 32 6 12 19 11 16 4 15 7 14 17 10

23 20 30 24 31 1 27 25 28 13 26 3 21 32 6 12 19 11 16 4 15 7 14 17 10

Modified pyramid: significant differences• One systems accounts for most of the differences

• Humans significantly better than all systems

Peer Better than

32

26

13

20

3

32

25

7

12

27

6

16

19

24

21

28

11

17

15

10

14

4

B

A

23

23

23

23

23

23

23

23

23

23 31

23 31

23 31

23 31

23 31

23 31

23 31

23 31

23 31 1

23 31 1

23 31 1 30 26 13 20

23 31 1 30 26 13 20 3

23 31 1 30 26 13 20 3 32 25 7 12 27 6 16 19 24 21 28 11 17 15 10 14 4

23 31 1 30 26 13 20 3 32 25 7 12 27 6 16 19 24 21 28 11 17 15 10 14 4

Responsiveness 1: Significant differences

• Differences primarily between 2 systems

• Differences between humans and each system

33

16

12

15

28

3

7

4

14

17

10

B

A

23

23

23

23

23

23

23

23 31 20

23 31 20

23 31 20

23 31 1 30 26 13 20 3 32 25 7 12 27 6 16 19 24 21 28 11 17 15 10 14 4

23 31 1 30 26 13 20 3 32 25 7 12 27 6 16 19 24 21 28 11 17 15 10 14 4

Responsive-2

• Similar shape to original

34

20

31

26

1

32 11

28

13

30

27

3

16

21

12

24

25

7

14

6

19

10

17

4

15

B

A

23

23

23

23

23 20

23 20 31

23 20 31

23 20 31

23 20 31

23 20 31

23 20 31

23 20 31 26 1

23 20 31 26 1

23 20 31 26 1

23 20 31 26 1

23 20 31 26 1

23 20 31 26 1

23 20 31 26 1

23 20 31 26 1

23 20 31 26 1

23 20 31 26 1

23 20 31 26 1

23 20 31 26 1

23 20 31 26 1

23 20 31 26 1 32 11 28 13 30 27 3 16 21 12 24 25 7 14 6

23 20 31 26 1 32 11 28 13 30 27 3 16 21 12 24 25 7 14 6 19 10 17 4 15

Skip-bigram: significant differences

• Many more differences between systems than any manual metric

• No difference between human and 5 systems

35

36

Pairwise comparisons: Modified Pairwise comparisons: Modified PyramidPyramid

10

17

14

7

15

4

16

11

19

12

6

32

21

3

26

13

28

25 27 31 24 30 20 23

3 25 27 24 30 20 23

25 27 1 24 30 20 23

13 25 27 31 24 30 20 23

3 25 27 1 24 30 20 23

25 27 31 24 30 20 23

24 30 23

24 30 23

24 30 23

30 23

31 30 23

24 30 20 23

24 30 23

30 23

23

23

30 20 23

37

Agreement between annotatorsAgreement between annotators

Overall Low High

Percent

Agreement

95% 90% 96%

Kappa .57 .46 .62

Alpha .57 .41 .59

Alpha-Dice .67 .49 .68

38

Editing of participant annotationsEditing of participant annotations

To correct obvious errors Ensures uniform checking Predominantly involved correct splitting

unmatching SCUs Average paired differences

Original: 0.0043 Modified: 0.0005

Average magnitude of the difference Original: 0.0115 Modified: 0.0032

39

Excluding weight 1 SCUsExcluding weight 1 SCUs

Removing weight 1 SCUs improves agreement Kappa: 0.64 (was 0.57)

Annotating without weight 1 has negligible impact on scores Set D324 done without weight 1 SCUs Ave.magnitude between paired differences

On average 0.07 difference

40

Correlations: Pearson’s, 25 Correlations: Pearson’s, 25 systemssystems

Pyr-mod Resp-1 Resp2 R-2 R-SU4

Pyr-orig 0.96 0.77 0.86 0.84 0.80

Pyr-mod 0.81 0.90 0.90 0.86

Resp-1 0.83 0.92 0.92

Resp-2 0.88 0.87

R-2 0.98

41

Correlations: Pearson’s, 25 Correlations: Pearson’s, 25 systemssystems


Pyr-orig 0.96 0.77 0.86 0.84 0.80

Pyr-mod 0.81 0.90 0.90 0.86

Resp-1 0.83 0.92 0.92

Resp-2 0.88 0.87

R-2 0.98

Questionable that responsiveness could be a gold standard

42

Pyramid and responsivenessPyramid and responsiveness


Pyr-orig 0.96 0.77 0.86 0.84 0.80

Pyr-mod 0.81 0.90 0.90 0.86

Resp-1 0.83 0.92 0.92

Resp-2 0.88 0.87

R-2 0.98

High correlation, but the metrics are not mutually substitutable

43

Pyramid and RougePyramid and Rouge


Pyr-orig 0.96 0.77 0.86 0.84 0.80

Pyr-mod 0.81 0.90 0.90 0.86

Resp-1 0.83 0.92 0.92

Resp-2 0.88 0.87

R-2 0.98

High correlation, but the metrics are not mutually substitutable

44

Lessons LearnedLessons Learned

Comparing content is hard All kinds of judgment calls We didn’t evaluate the NIST assessors in previous years

Paraphrases VP vs. NP

Ministers have been exchanged Reciprocal ministerial visits

Length and constituent type Robotics assists doctors in the medical operating theater Surgeons started using robotic assistants

45

Modified scores betterModified scores better

Easier peer annotation Can drop weight 1 SCUs

Better agreement No emphasis on splitting non-matching

SCUs

46

Agreement between annotatorsAgreement between annotators

Participants can perform peer annotation reliably

Absolute difference between scores Original: 0.0555 Modified: 0.0617 Empirical prediction of difference 0.06

(HLT 2004)

47

CorrelationsCorrelations

Original and modified can substitute for each other

High correlation between manual and automatic, but automatic not yet a substitute

Similar patterns between pyramid and responsiveness

48

Current DirectionsCurrent Directions

Automated identification of SCUs (Harnly et al 05)

Applied to DUC05 pyramid data set

Correlation of .91 with modified pyramid scores

49

QuestionsQuestions

What was the experience annotating pyramids?

Does it shed insight on the problem Are people willing to do it again? Would you have been willing to go through

training?

If you’ve done pyramid analysis, can you share your insights

50

51

Annotators Setid Alpha Dice Alpha-dice102:218 324 0.59 0.71 0.67108:120 400 0.45 0.72 0.53109:122 407 0.41 0.59 0.49112:126 426 0.54 0.74 0.63116:124 633 0.58 0.87 0.68121:125 695 0.51 0.75 0.61

102:123 324 0.6 0.82 0.69218:123 324 0.49 0.66 0.56

52

Correlations of Scores on Correlations of Scores on Matched SetsMatched Sets

102:123 324 0.7 (.44-.85) 0.73 (.48-.87) 218:123 324 0.6 (.29-.80) 0.77 (.55-.89)

AnnotatorsSet Id Pearson's w/ Orig Pearson's w/ Modif102:218 324 0.76 (.54-.89) 0.83 (.66-.92) 108:120 400 0.84 (.67-.92) 0.89 (.77-.95) 109:122 407 0.92 (.83-.96) 0.91 (.80-.96) 112:126 426 0.9 (.78-.95) 0.95 (.90-.98) 116:124 633 0.81 (.62-.91) 0.78 (.57-.90) 121:125 695 0.91 (.81-.96) 0.92 (.83-.96)

1 The Pyramid Method at DUC05 Ani Nenkova Becky Passonneau Kathleen McKeown Other team members: David Elson, Advaith Siddharthan, Sergey Siegelman.

Documents