-
Single-document and multi-document summary
evaluation using Relative Utility
Dragomir R. Radev1,2, Daniel Tam1 and Güneş Erkan1
1Department of Electrical Engineering and Computer
Science2School of Information
{radev,dtam,gerkan}@umich.eduUniversity of Michigan, Ann Arbor
MI 48109
Abstract
We present a series of experiments to demonstrate the validity
ofRelative Utility (RU) as a measure for evaluating extractive
summa-rization systems. Like some other evaluation metrics, it
compares sen-tence selection between machine and reference
summarizers. Addition-ally, RU is applicable in both
single-document and multi-documentsummarization, is extendable to
arbitrary compression rates with noextra annotation effort, and
takes into account both random systemperformance and interjudge
agreement. RU also provides an option forpenalizing summaries that
include sentences with redundant informa-tion. Our results are
based on the JHU summary corpus and indicatethat Relative Utility
is a reasonable, and often superior alternative toseveral common
summary evaluation metrics. We also give a compar-ison of RU with
some other well-known metrics with respect to thecorrelation with
the human judgements on the DUC corpus.
1 Introduction
One major bottleneck in the development of text summarization
systemsis the absence of well-defined and standardized evaluation
metrics. In thispaper we will discuss Relative Utility (RU), a
method for evaluating ex-tractive summarizers, both single-document
and multi-document. We willaddress some advantages of RU over
existing co-selection metrics such asPrecision, Recall, percent
agreement, and Kappa. We will present some ex-periments performed
on a large text corpus to discuss how RU is affectedby interjudge
agreement, compression rate (or summary length), and sum-marization
method.
1
-
The main problem with traditional co-selection metrics (thus
namedbecause they measure the degree of overlap between the list of
sentences se-lected by a judge and an automatically produced
extract) such as Precision,Recall, and Percent Agreement for
evaluating extractive summarizers is thathuman judges often
disagree about which the top n% most important sen-tences in a
document or cluster are and yet, there appears to be an
implicitimportance value for all sentences which is
judge-independent. We base thisobservation on an experiment in
which we asked three judges to give scoresfrom 0 to 10 to each
sentence in a multi-document cluster. Even though therelative
rankings of the sentences based on the judge-assigned
importancevaries significantly from judge to judge, their absolute
importance scores arehighly correlated. We have measured the
utility correlation for three judgeson 3,932 sentences from 200
documents from the HK News corpus. Theaverage pairwise Pearson
correlation was 0.71, which is indicative of highagreement.
In the next section, we will formally introduce the Relative
Utility method.The following two sections discuss our evaluation
framework. Our goalwas to understand what properties of
multi-document extractive summariesmake them hard to evaluate using
co-selection metrics and how RelativeUtility can be used to capture
summaries in which equally important sen-tences are substituted for
one another. Section 3 describes our experimentalsetup while
Section 4 summarizes our results and our analysis of these
re-sults. In Section 5, we discuss the fact that the presence of a
sentence ina multi-document summary may readjust the importance
score of anothersentence (e.g., when the two sentences are
paraphrases of each other or whenthe included sentence subsumes the
other sentences). We propose a variantof Relative Utility (RU with
subsumption) which addresses this problem bygiving only partial
credit for redundant sentences that are included in a sum-mary.
Section 6 concludes our presentation by summarizing our
conclusionsand setting the agenda for future research.
2 The Relative Utility evaluation method
Extractive summarization is the process of identifying highly
salient units(usually words, phrases, sentences, or paragraphs)
within a cluster of docu-ments. When a cluster consists of one
document, the process is called single-document extractive
summarization, otherwise the name is multi-documentextractive
summarization.
Extractive summarization is the only scalable and
domain-independent
2
-
method for text summarization. It is used in a variety of
systems (e.g.,[Luh58],[JMBE98]).
One common class of evaluation metrics for extractive summaries
basedon text unit overlap includes Precision and Recall (P&R),
Percent Agree-ment (PA), and Kappa. The generic name for this class
of evaluation meth-ods is co-selection as they measure to what
extent an automatic extractoverlaps with manual extracts.
Using metrics such as P&R or PA [JMBE98, GKMC99] to evaluate
sum-maries creates the possibility that two equally good extracts
are judged verydifferently.
Suppose that a manual summary contains sentences {1 2} from a
docu-ment. Suppose also that two systems, A and B, produce
summaries consist-ing of sentences {1 2} and {1 3}, respectively.
Using P&R or PA, systemA will be ranked much higher than system
B. It is quite possible however,that for the purpose of
summarization, sentences 2 and 3 are equally im-portant, in which
case the two systems should get the same score. It isknown from the
literature on summarization (e.g., [JMBE98]) that given atarget
summary length, judges often pick different sentences. We will
callthis observation the principle of Summary Sentence
Substitutability (SSS).
The Relative Utility (RU) method [RJB00] allows ideal summaries
toconsist of sentences with variable membership. With RU, the ideal
summaryrepresents all sentences of the input document(s) with
confidence valuesfor their inclusion in the summary. It directly
addresses the SSS problembecause it allows for sentences in
different summaries of the same input tobe substituted (at a small
cost) for one another.
For example, a document with five sentences {1 2 3 4 5} is
representedas {1/10 2/9 3/9 4/2 5/4}. The second number in each
pair indicates thedegree to which the given sentence should be part
of the summary accordingto a human judge. We call this number the
utility of the sentence. Utilitydepends on the input documents and
the judge. It does not depend on thesummary length. In the example,
the system that selects sentences {1 2}will not get a higher score
than a system that chooses sentences {1 3} giventhat both summaries
{1 2} and {1 3} carry the same number of utility points(10+9).
Given that no other combination of two sentences carries a
higherutility, both systems {1 2} and {1 3} produce optimal
extracts at the giventarget length of two sentences.
In Relative Utility experiments, judges are asked to assign
numericalscores to individual sentences from a single document or a
cluster of relateddocuments. A score of 10 indicates that a
sentence is central to the topic ofthe cluster while a score of 0
marks a totally irrelevant sentence.
3
-
2.1 An example
S# Text J1 util J2 util
2 The preliminary investigations showed that at this stage,
human-to-human 9 8transmission of the H5N1 influenza A virus has
not been proven andfurther investigations will be made to study
this possibility, the SpecialWorking Group on H5N1 announced today
(Sunday)
3 The initial findings also showed that the four H5 cases did
not share a 7 4common source, nor was the virus transmitted from
one case to the others.
7 However, there is no cause for panic as available evidence
does not 7 6suggest that the disease is widespread.
9 The WHO has been asked to alert vaccine production centres in
the world 7 7in the case investigation to follow developments here
with a viewto preparing the necessary vaccines.
14 He said the Department would disseminate to doctors, medical
8 8professionals, colleges and health care workers available
informationabout the H5 virus through letters and the Department of
Health’shomepage on the Internet (http:/www.info.gov.hk/dh/).
Figure 1: A 5-sentence extractive summary created from document
D-19971207-001 (in cluster 398) by LDC Judge J1.
S# Text J1 util J2 util
11 To further enhance surveillance in Hong Kong, Dr Saw said,
the Department 8 10of Health would extend surveillance coverage to
all General Out-patientClinics.
12 The Hospital Authority would also set up surveillance in
public hospitals. 4 10
13 In the meantime, Dr Saw said, the Agriculture and Fisheries
Department had 6 10also increased surveillance in poultry in
collaboration with The Universityof Hong Kong.
19 Dr Saw advised members of the public that the best way to
combat 7 10influenza infection was to build up body resistance by
having a proper dietwith adequate exercise and rest.
20 Good ventilation should be maintained to avoid the spread of
respiratory 8 10tract infection.
Figure 2: A 5-sentence extractive summary created from document
D-19971207-001 (in cluster 398) by LDC Judge J2.
The following example illustrates an advantage that Relative
Utility hasover Precision/Recall. The two summaries shown in
Figures 1 and 2 are5-sentence extractive summaries created from the
same document by twodifferent judges. Because the two summaries are
composed entirely of dif-
4
-
ferent sentences, the interjudge agreement as measured by
Precision/Recallor Percent Agreement is 0, despite the fact that
both summaries are reason-able.
Note that both judges gave each other’s sentences fairly high
utilityscores, however. In fact, the interjudge agreement as
measured by RU forthis example is 0.76. RU agreement (see next
section) is defined as therelative score that one judge would get
given his own extract and the otherjudge’s sentence judgements. For
example, if judge 1 picks a single sentencein his extract and if
the score that judge 2 gives to the same sentences is 8,and given
that judge 2’s top ranked sentence has a score of 10, then one
cansay that judge 1’s score relative to judge 2 is 0.80 (or
8/10).
The 0.76 score is also markedly higher than the lowest possible
score asummarizer could receive. Although not depicted in the
example, a sum-marizer could have an RU agreement with judge J1 as
low as 0.14 and anagreement with judge J2 as low as 0.38. In other
words, given that inter-judge agreement is significantly less than
1.0 but significantly more than theworst score possible, an
automatic summarizer might score as low as .70 andstill be almost
as good as the judges themselves.
A related paper [DDM00] suggested that one problem with classic
ap-proaches to summary evaluation is that different collections of
extracts rankdifferently when one ground truth (judgement) is
substituted for another.In their experiments, recall for the same
summary varied from 25% to 50%depending on what manual extract it
was compared against. Our resultsstrongly confirm Donaway et al.’s
claims and suggest that RU is a viableevaluation alternative.
2.2 Defining Relative Utility
In this section, we will formally define Relative Utility (RU).
To computeRU, a number of judges, N (N ≥ 1), are asked to assign
utility scoresto all n sentences in a cluster of documents (which
can consist of one ormore documents). The top e sentences according
to utility score are thencalled a sentence extract of size e (in
the case of ties, some arbitrary butconsistent mechanism is used to
decide which sentences should be included inthe summary). The
formulas below assume that n is the number of sentencesin a cluster
of documents, e is the number of sentences in the desired
extract,and N is the number of human judges providing utility
scores.
We can then define the following metrics:
~Ui = {ui,1, ui,2, ..., ui,n}
5
-
= sentence utility scores for judge i
for all n sentences in the cluster~U ′i = {δi,1 · ui,1, δi,2 ·
ui,2, ..., δi,n · ui,n}
= extractive utility scores for judge i
In the formula for ~U ′i , δi,j is the summary characteristic
function forjudge i and sentence j. It is equal to 1 for the e
highest-utility sentencesfor a given judge, allowing us to adjust
the summary size. For example, ife = 2, and ~Ui = {10, 8, 9, 2, 4},
then δi,1 = δi,3 = 1 and δi,2 = δi,4 = δi,5 = 0.Note that
∑nj=1 δi,j = e.
We can now define some additional quantities:
Ui =n
∑
j=1
ui,j
= total self-utility for judge i
U ′i =n
∑
j=1
δi,j · ui,j
= total extractive self-utility for judge i
(computed over all n sentences)
Ui,k =n
∑
j=1
δi,j · uk,j
= total extractive cross-utility for judges i and k
(i 6= k)
Ui,avg = 1/(N − 1) ·N
∑
k=1
Ui,k for i 6= k
= (non-symmetric) judge utility for judge i.
J = Uavg = 1/N ·N
∑
i=1
Ui,avg
= interjudge performance
(average extractive cross-utility of all judges)
U =n
∑
j=1
N∑
i=1
ui,j
= total extractive utility for all judges.
6
-
U ′ =n
∑
j=1
εj ·N
∑
i=1
ui,j
= total utility for all judges
In the formula for U ′, εj (multi-judge summary characteristic
function)is 1 for the top e sentences according to the sum of
utility scores from alljudges. U ′ is the maximum utility that any
system can achieve at a givensummary length e.
Note that∑n
j=1 εi,j = e. Note also that N = 1 implies U′ = U ′1 (single
judge case).A summarizer producing an extract of length e can be
thought of as an
additional judge. Its (non-normalized) RU will be computed as
its perfor-mance against the human judges divided by the maximum
possible perfor-mance. In other words, the ratio of the sum of its
cross-utility with thetotality of human judges and the maximum
utility U ′ achievable at a givensummary length e. As a result, a
summary can be judged based on its utilityrelative to the maximum
possible against the set of judges, hence the nameof the method
RU.
S =
∑nj=1 δs,j ·
∑Ni=1 ui,j
U ′
= system performance (δs,j is equal to 1 for the
top e sentences extracted by the system).
In the formula for S,∑N
i=1 ui,j is the utility assigned by the totality ofjudges to a
given sentence j extracted by the summarizer.
R = 1/(ne)
(ne)
∑
t=1
St
= random performance (computed over all(ne
)
possible extracts of length e).
R is practically a lower bound on S while J is the corresponding
up-per bound. In order to factor in the difficulty of a given
cluster, one cannormalize the system performance S between J and
R:
D =S − R
J − R
7
-
= normalized Relative Utility
(normalized system performance).
Assuming that R 6= J (which is not unreasonable (!) and holds
inpractice), D = 1 only when S = J (system is as good as the
interjudgeagreement) and D = 0 when S = R (system is no better than
random).
Reporting S values in the absence of corresponding J and R
values isnot very informative. Therefore, one should either report
S, J , and R orreport D alone.
When values for R and J are given as comparison, reporting S is
suffi-cient. However, D should be used when R and J are
ignored.
3 Experimental setup
We used the Hong Kong News summary corpus created at Johns
HopkinsUniversity in 2001. The original corpus consists of 18,146
aligned articles(on the document level) in plain text in English
and Chinese without anymarkup.1 We annotated the corpus with
information about sentence andword boundaries for both English and
Chinese, and part of speech andmorphological information for
articles in English only.
3.1 Clusters
The Linguistic Data Consortium (LDC) developed 40 queries that
covera variety of subjects (“narcotics rehabilitation”, “natural
disaster victimsaided”, “customs staff doing good job”, etc.).
Using an in-house informa-tion retrieval engine and human revision,
documents highly relevant to thequeries were obtained and the 10
most relevant (according to human as-sessors) were used to
construct clusters. These 40 clusters of documentswere used during
the workshop for training and some specific evaluations.Figure 3
shows the first 20 queries that were used in our experiments.
The three human annotators from LDC judged each sentence within
the10 relevant documents in each cluster. They assigned each
sentence a scoreon a scale from 0 to 10, expressing the importance
of this sentence for thesummary. This annotation allows us to
compile human-generated ’ideal’summaries at different target
lengths, and it is the basis for our differentmeasures of
sentence-based agreement, both between the human agreementand
between the system and the human annotators. We can in fact, in
1http://www.ldc.upenn.edu
8
-
Group 125 Narcotics RehabilitationGroup 241 Fire safety,
building management concernsGroup 323 Battle against disc
piracyGroup 551 Natural disaster victims aidedGroup 112 Autumn and
sports carnivalsGroup 199 Intellectual Property RightsGroup 398 Flu
results in Health ControlsGroup 883 Public health concerns cause
food-business closingsGroup 1014 Traffic Safety EnforcementGroup
1197 Museums: exhibits/hoursGroup 447 Housing (Amendment) Bill
Brings Assorted ImprovementsGroup 827 Health education for
youngstersGroup 885 Customs combats contraband/dutiable cigarette
operationsGroup 2 Meetings with foreign leadersGroup 46 Improving
Employment OpportunitiesGroup 54 Illegal immigrantsGroup 60 Customs
staff doing good job.Group 61 Permits for charitable fund
raisingGroup 62 Y2K readinessGroup 1018 Flower shows
Figure 3: 20 queries produced by the LDC.
addition to RU scores, produce any co-selection metric such as
P/R andKappa using the top ranked sentences.
Each query-based cluster contains 10 documents. Figure 4 shows
thecontents of cluster 125. The document IDs come from the HKNews
corpusand indicate the year, month, day, and story number for each
document.
Figure 4: Sample cluster.
9
-
3.2 Sentence utility judgements
All sentence utility scores given by the judges for a given
cluster are repre-sented in a so-called sentjudge. An example is
shown in Figure 5. The totalnumber of sentences in cluster 125 is
232. By convention, a 10% summarywill contain 24 sentences (23.2
rounded up). (Note that in the case where acluster contains a
single document, sentjudges can used for
single-documentsummarization).
While we have not studied the cost of acquiring such sentjudges,
it ap-pears to be comparable to that of generating human reference
summariesfor the other co-selection evaluation schemes. In the case
where it is im-possible to have human judges assign utility scores
to each sentences, onecould produce such judgement automatically
from manual abstracts, whichwe discuss in Section 6.
DOC:SENT JUDGE1 JUDGE2 JUDGE3 TOTAL19980306 007:1 4 6 9
1919980306 007:2 5 10 9 2419980306 007:3 4 9 7 2019980306 007:4 4 9
8 2119980306 007:5 5 8 8 2119980306 007:6 4 9 5 1819980306 007:7 4
9 6 1919980306 007:8 5 7 8 20...20000408 011:13 1 5 3 920000408
011:14 6 4 2 1220000408 011:15 2 6 6 14...
Figure 5: Sentjudge: sentence utilities as assigned by the
judges.
3.3 Summarizers
For evaluation, we used two summarization systems that were
available tous.
One summarizer that we used in the experiments is WEBSUMM
[MB99].It represents texts in terms of graphs where the nodes are
occurrences ofwords or phrases and the edges are relations of
repetition, synonymy, andco-reference. WEBSUMM assumes that nodes
which are connected to manyother nodes are likely to carry salient
information, and it builds its summarycorrespondingly.
The second summarizer is the centroid-based summarizer MEAD
[RJB00].MEAD ranks sentences in a cluster of documents based on
their positions
10
-
in a document and the cosine similarity between them and the
sentencecentroid, which is a pseudo-sentence (bag of words) that is
closest to all sen-tences in the cluster. MEAD has a built-in
facility which does not includein the summary sentences that are
too similar (lexically) to the rest of thesummary.
3.4 Extracts
An extract contains a list of the highest-scoring sentences that
will be usedin the summary. After the top sentences are picked,
they are sorted in theorder they appear.
We produced a large number of automatic extracts (at 10 target
lengthsusing a number of algorithms of all 20 clusters and of all
18,146 documentsin the corpus).
In the following example, we will evaluate one of the
summarizers, MEADusing RU. Table 21 presents seven different 10%
extracts produced from thesame cluster (Cluster 125). An excerpt
from the actual judgement scores isshown in Figure 5. As one can
see, when all judges are taken into account,one sentence with high
salience is sentence 2 from article 19980306 007 witha total
utility score of 24. Given that MEAD includes that sentence in
its10% extract, it will get the maximum possible utility for this
sentence. Onthe other hand, not all sentences extracted by MEAD
have such a high util-ity. For example, sentence 3 from 19990802
006 which was also picked byMEAD only carries a utility of 15. If
MEAD had picked a different sentenceinstead (e.g., sentence 2 from
20000408 011 with a utility of 28), its RUwould be higher.
In this example, the total self-utility U1 for judge 1 is 1218.
The totalself-utilities for judges 2 and 3 are 1380 and 1130,
respectively. The valuesfor extractive total utility U ′i for each
of the three judges are 237, 218, and224, respectively.
Table 1 shows the values for extractive cross-judge utility. The
average,0.73, is equal to the interjudge agreement J .
Judge 1 Judge 2 Judge 3 Average
Judge 1 1.00 0.74 0.74 0.74Judge 2 0.64 1.00 0.74 0.69Judge 3
0.72 0.81 1.00 0.77
Table 1: Cross-judge utilities.
Using the formulas in the previous section, one can compute the
value
11
-
for random performance, which is 0.57.The performance of MEAD is
0.70 (compared to random = 0.57 and
interjudge agreement = 0.73). When normalized, MEAD’s
performance is0.80 on a scale from 0 to 1.
3.5 Comparing Relative Utility with P/R
Given an ideal extract E1 consisting of e1 sentences, one can
measure howsimilar another extract E2 including e2 sentences is to
it. Precision (P ) isthe ratio of sentences included in E2 which
are also included in E1 whileRecall (R) is the ratio of sentences
included in E2 to the total number e1of sentences in E1. It can be
trivially shown that if e1 = e2 and the twoextracts have a
sentences in common, P = R = a/e.
Percent agreement (PA) measures how many of the judges’
decisions areshared amongst two judges. If d is the number of
sentences in the inputdocument (or cluster) that were not extracted
by either judge and the inputhas n sentences, then PA is defined as
(a + d)/n.
For example, suppose that two judges produce 10% extracts from a
doc-ument containing 50 sentences. For example, if the same three
sentences areextracted by both judges, then P = R = 3/5 = 60%;PA =
(3 + 43)/50 =92%. PA is known to significantly overestimate
agreement (due to the in-clusion of non-summary sentences in the
evaluation) for both very shortand very long extracts while P and R
underestimate agreement (due to theSummary Sentence
Substitutability principle).
We can now compare the RU values with these for Precision and
Recall.Let’s first look at judges 1 and 2. Out of 24 sentences,
only four overlapbetween the two judges (19980306 007:2, 19990802
006:8, 19990802 006:9,and19990829 012:2), or in other words, P = R
= 4/24 = .17. (Note thatwhen the two extracts are of the same
length and the number of sentencesthat each of them includes is the
same, Precision trivially equals Recall).Let’s now look at judges 1
and 3. They overlap on only three sentences(P = R = .13).
Similarly, P = R = .17 for judges 2 and 3.
Let’s now turn to the performance of MEAD. MEAD has P = R =2/24
= .08 with judge 1. The values for P and R are .13 and .17
whencomparing MEAD with judge 2 and judge 3, respectively.
Such low numbers could indicate that it is impossible to reach
consensuson extractive summaries. The numbers above are for
multi-document ex-tracts, although similar numbers hold for
single-document extracts as well.For example, the average
interjudge P/R for 10% extracts of each of the ten
12
-
A B C D E F G H I J
R .648 .650 .652 .465 .626 .727 .508 .497 .644 .566J .715 .666
.859 .726 .876 .944 .909 .776 .710 .808
Table 2: Relative Utility - interjudge agreement (J) and random
performance(R) for cluster 125, per document, 5% target length.
A B C D E F G H I J
R .690 .685 .679 .523 .642 .741 .541 .553 .699 .595J .827 .730
.866 .828 .838 .913 .861 .876 .736 .874
Table 3: Relative Utility - interjudge agreement (J) and random
performance(R) for cluster 125, per document, 20% target
length.
single documents comprising cluster 125 is .22 for judges 1 and
2, .33 forjudges 2 and 3, and .26 for judges 3 and 1.
Past work on evaluating extractive summaries [JMBE98, GKMC99]
hasindicated such low agreement for single-document extracts. We
claim thatRelative Utility is a better metric than P/R because it
does not underes-timate agreement in the case where multiple
sentences are almost equallyuseful for an extract and the
summarizer has to choose one over the other.
4 Experiments
We ran four experiments to compute Relative Utility values for a
numberof summarizers at ten summary lengths. We also produced
Relative Utilityvalues for a few baselines - lead-based and random
summaries.
4.1 Single-document J/R values
In the experiments below, J is the upper bound. R is the lower
bound onthe performance of an extractive summarizer. Reasonable
summarizers areexpected to have Relative Utility S in the range
between R and J. Note thatoccasionally (on a particular input and
at a particular summary length) asummarizer can score worse than
random or better than J. However, whenaveraging over a number of
clusters, these outliers cancel out.
Tables 2, 3 and 4 show how single-document J and R vary by
documentwithin a cluster. The first table is for 5% extracts and
the second one for20% extracts.
Tables 12 and 13 show how J and R vary by compression rate in
singleand multi-document summaries. They also describe the
performance of
13
-
A B C D E F G H I J
R .74 .738 .724 .653 .695 .77 .647 .679 .764 .664J .836 .754
.878 .954 .91 .952 .919 .954 .811 .904
Table 4: Relative Utility - interjudge agreement (J) and random
performance(R) for cluster 125, per document, 40% target
length.
MEAD (S) and another single-document summarizer (WEBS). The
valuefor LEAD is for a lead-based summarizer (that is, a summarizer
that onlyincludes the top n% of the sentences of a document or
cluster).
4.2 Single-document RU evaluation
We computed J (interjudge agreement), R (random performance), S
(sys-tem performance), and D (normalized system performance) over
all 20 clus-ters (total = 200 documents). The results are presented
in Table 5.
We explored different summarization technologies that work in
bothsingle- and multi-document mode. We included two baseline
methods inour framework: random summaries (RANDOM, constructed from
sentencespicked at random from the source) and lead-based summaries
(LEAD, pro-duced from sentences appearing at the beginning of the
text).
We should note the concept of a random summary produced by
pickingrandom sentences given a summary length is different from
the idea of R asdescribed above. To produce R, we average over all
possible
(ne
)
combina-tions of e sentences out of n where the random summary
method producesonly one such combination. It should be expected,
over a large sample, thatRANDOM extracts perform as poorly as R and
our experiments show thatsuch is indeed the case.
Random summaries should give a lower bound for the performance
anysystem should have, while lead-based summaries give a nice and
simplebaseline that sometimes obtains very good performance for
specific tasks[BMR95]. To provide a basis for comparison, we
evaluated WEBSUMM inaddition to MEAD.
The single-document results tables compare MEAD with WEBSUMMand
the two baselines RANDOM and LEAD.
Several interesting observations can be made by looking at the
data inTable 5. First, random performance is quite high although
certainly beat-able, as shown in Tables 2 and 3. Second, both the
lower bound (R) andthe upper bound (J) increase with summary
length. The average value ofR across all documents at the 5% target
length is 0.598 while the average
14
-
value of J is 0.799. The corresponding values for the 20% target
lengthare: R = 0.635 and J = 0.835. Third, even though the
performances ofMEAD and WEBSUMM (S) also increase with summary
length, MEAD’snormalized version (D) decreases slowly with summary
length until the twosummarizers score about the same on both S and
D for longer summaries.Fourth, for summary lengths of 80% and
above, R gets really close to Jshowing that reasonable
summarization that significantly beats random atsuch summary
lengths is quite difficult. Fifth, MEAD outperforms LEADin lower
compression rates. This last observation is very valuable giventhat
some previous studies (e.g., [BMR95]) had indicated that
lead-basedextracts are at least as good as more intelligent
extracts. The fact that apublic-domain summarizer, not specifically
trained for the particular typeof documents used in this experiment
can outperform LEAD indicates thateven though the first few
sentences in a document are indeed rather impor-tant, there are
some other sentences, further down in a document whoseutility
exceeds that of the sentences in the lead extracts.
MEAD RANDOM LEAD WEBSUMMPercent J R S D S D S D S D
05 0.80 0.66 0.78 0.88 0.67 0.05 0.72 0.41 0.72 0.4410 0.81 0.68
0.79 0.84 0.67 -0.02 0.73 0.42 0.73 0.4420 0.83 0.71 0.79 0.68 0.71
0.01 0.77 0.52 0.76 0.4330 0.85 0.74 0.81 0.64 0.75 0.10 0.80 0.55
0.79 0.4440 0.87 0.76 0.83 0.63 0.77 0.03 0.83 0.64 0.82 0.5150
0.89 0.79 0.85 0.61 0.79 0.01 0.86 0.63 0.85 0.5560 0.92 0.83 0.88
0.59 0.83 0.02 0.89 0.63 0.87 0.4270 0.94 0.86 0.91 0.58 0.87 0.08
0.92 0.69 0.90 0.4880 0.96 0.91 0.93 0.45 0.91 0.05 0.94 0.66 0.93
0.3690 0.98 0.96 0.97 0.37 0.96 0.04 0.98 0.68 0.97 0.53
Table 5: Single-document Relative Utility.
4.3 Single-document P/R evaluation
It is interesting to compare single-document RU with
single-document P/Rresults. Table 6 shows how P/R varies by
summarizer and summary length.For the lengths that make most sense
in real life (5-30%), P/R agreementis quite low, both among judges
and between systems and judges, whereasRU agreement is much
higher.
15
-
J0+J1 J1+J2 J2+J0 ALL JUDGE PAIRS MEAD RANDOM LEAD WEBSUMM05
0.22 0.25 0.14 0.20 0.17 0.08 0.30 0.2310 0.25 0.29 0.25 0.26 0.23
0.12 0.35 0.2420 0.35 0.37 0.43 0.38 0.34 0.23 0.43 0.3230 0.46
0.49 0.51 0.49 0.44 0.34 0.49 0.4140 0.57 0.60 0.59 0.59 0.53 0.43
0.58 0.5150 0.67 0.68 0.66 0.67 0.62 0.52 0.65 0.5960 0.75 0.76
0.75 0.76 0.72 0.63 0.73 0.6870 0.84 0.82 0.83 0.83 0.80 0.74 0.81
0.7780 0.91 0.89 0.89 0.90 0.87 0.83 0.88 0.8590 0.96 0.96 0.96
0.96 0.95 0.94 0.96 0.95
Table 6: Single-document Precision/Recall (P = R).
4.4 Single-document content-based evaluation
To further calibrate RU results, we compared them with a number
of content-based measures [DDM00]. These include word-based cosine
between twosummaries, word overlap, bigram overlap, and LCS
(longest common subse-quence). These metrics are all based on the
actual text of the extracts (unlikeP/R/Kappa/RU, which are all
computed on the sentence co-selection vec-tors). The content-based
metrics are described in more detail in[RTS+03].
To compute the content-based scores, we obtained manual
abstracts atvariable lengths. The comparative results are shown in
Tables 7— 10.
Some interesting observations can be made from this comparison.
First,the three content-based metrics rank LEAD ahead of both MEAD
andWEBSUMM. Second, MEAD and WEBSUMM score approximately thesame on
all metrics with MEAD doing slightly better on the Word over-lap,
Bigram overlap, and Longest-common-subsequence measures and
WEB-SUMM on the cosine metric. Contrasting these findings with the
results us-ing RU, one can conclude that RU is somehow better able
than the content-based measures in giving proper credit for
substitutable sentences that arenot lexically similar to the manual
extracts.
Percent LEAD MEAD RANDOM WEBSUMM
10 0.55 0.46 0.31 0.5220 0.65 0.61 0.47 0.6030 0.70 0.70 0.60
0.6840 0.79 0.78 0.69 0.7750 0.84 0.83 0.75 0.82
Table 7: Similarity between Machine Extracts and Human Extracts.
Mea-sure: Cosine.
16
-
Percent LEAD MEAD RANDOM WEBSUMM
10 0.42 0.30 0.22 0.3520 0.47 0.40 0.31 0.3630 0.48 0.46 0.41
0.4140 0.57 0.55 0.47 0.5150 0.61 0.61 0.52 0.58
Table 8: Similarity between Machine Extracts and Human Extracts.
Mea-sure: Word overlap.
Percent LEAD MEAD RANDOM WEBSUMM
10 0.35 0.22 0.12 0.2520 0.38 0.31 0.20 0.2530 0.41 0.37 0.29
0.3140 0.51 0.46 0.36 0.4250 0.56 0.53 0.43 0.50
Table 9: Similarity between Machine Extracts and Human Extracts.
Mea-sure: Bigram overlap.
Percent LEAD MEAD RANDOM WEBSUMM
10 0.47 0.37 0.25 0.3920 0.55 0.52 0.38 0.4530 0.60 0.61 0.50
0.5340 0.70 0.70 0.58 0.6450 0.75 0.76 0.64 0.71
Table 10: Similarity between Machine Extracts and Human
Extracts. Mea-sure: longest-common-subsequence.
4.5 Multi-document RU evaluation
In this section, we provide multi-document RU results. Given
that MEADwas the only multi-document summarizer available to us, in
Table 11 we onlyinclude MEAD-specific results, in addition to the
two baselines: RANDOMand LEAD.
As one can see from the table, multi-document RU is slightly
lower thansingle-document RU. We believe that this can be explained
by the fact thatthe distribution of scores by the same judge across
different articles in thesame cluster is not uniform. Some
documents contain only a small numberof high-utility sentences and
contribute to the increase in RU for single-document vs.
multi-document. In addition to RU, the lower bound (R) andthe upper
bound (J) are also slightly lower for multi-document extracts.
As
17
-
a result, the normalized performance (D) is almost exactly the
same in bothcases.
MEAD RANDOM LEADPercent J R S D S D S D
05 0.76 0.64 0.73 0.81 0.63 -0.08 0.71 0.6210 0.78 0.66 0.75
0.76 0.65 -0.01 0.71 0.4720 0.81 0.69 0.78 0.74 0.71 0.15 0.76
0.5530 0.83 0.72 0.79 0.65 0.72 0.01 0.79 0.6740 0.85 0.74 0.81
0.62 0.74 -0.06 0.82 0.7250 0.87 0.77 0.82 0.58 0.79 0.11 0.84
0.7060 0.88 0.80 0.84 0.52 0.81 0.00 0.86 0.6670 0.91 0.82 0.86
0.49 0.85 0.06 0.88 0.5980 0.92 0.84 0.88 0.45 0.89 0.03 0.90
0.5590 0.93 0.86 0.89 0.36 0.93 -0.04 0.91 0.52
Table 11: Multi-Document Relative Utility
5 10 20 30 40 50 60 70 80 90
R .66 .68 .71 .74 .76 .79 .83 .86 .91 .96RANDOM .67 .67 .71 .75
.77 .79 .83 .87 .91 .96
WEBSUMM .72 .73 .76 .79 .82 .85 .87 .90 .93 .97LEAD .72 .73 .77
.80 .83 .86 .89 .92 .94 .98MEAD .78 .79 .79 .81 .83 .85 .88 .91 .93
.97
J .80 .81 .83 .85 .87 .89 .92 .94 .96 .98
Table 12: (non-normalized) RU per summarizer and summary
length(Single-document)
5 10 20 30 40 50 60 70 80 90
R .64 .66 .69 .72 .74 .77 .80 .82 .84 .86RANDOM .63 .65 .71 .72
.74 .79 .81 .85 .89 .93
LEAD .71 .71 .76 .79 .82 .85 .87 .90 .93 .97MEAD .73 .75 .78 .79
.81 .82 .84 .86 .88 .89
J .76 .78 .81 .83 .85 .87 .88 .91 .92 .93
Table 13: (non-normalized) RU per summarizer and summary length
(Multi-document)
4.6 Multi-document content-based evaluation
We will now present a short summary of the multi-document
content-basedevaluation. In Table 14 we show a comparison between
the performanceof both MEAD and manual extracts (in this case, 50,
100, and 200 words
18
-
were chosen for pragmatic reasons - these are the lengths used
in the DUCevaluation[DUC00]) when both methods are compared to
manual abstracts.Except for the cosine measure, all other metrics
show that MEAD’s perfor-mance is quite comparable to human
extracts.
LENGTH COSINE OVERLAP BIGRAM LCSHUMAN MEAD HUMAN MEAD HUMAN MEAD
HUMAN MEAD
50 0.36 0.17 0.20 0.17 0.06 0.04 0.23 0.20100 0.44 0.22 0.20
0.17 0.07 0.04 0.25 0.21200 0.50 0.43 0.20 0.20 0.08 0.07 0.25
0.23
Table 14: (MEAD vs. MANUAL EXTRACTS) compared to
MANUALSUMMARIES
5 Relative Utility with Subsumption
One important property of multi-document summaries that
unmodified RUdoes not address well is subsumption. Unlike sentence
substitutability whichexists between sentences that are equally
worthy of inclusion in a summarybut which may be very different in
content, subsumption deals with pairs ofsentences that have a
significant amount of content overlap. In the extremecase, they
could be paraphrases of each other or outright copies. It is
nothard to realize that sentences with similar content are (a)
likely to obtainsimilar utility scores independently of one another
and (b) once one of themis included in a summary, the utility of
the other sentence is automaticallydropped.
We extended RU to deal with subsumption by introducing
conditionalsentence utility values [RJB00] which depend on the
presence of other sen-tences in the summary.
Informational subsumption deals with the fact that the utility
of a sen-tence may depend on the utility of other sentences already
included in asummary. Two sentences may be almost identical in
content and get thesame utility scores from a judge and yet they
should not be included in thesummary at the same time.
Figure 6 shows an example. The sentence extracted from
D-19990527-022 (S1) subsumes that from D-19980601-013 (S2), because
S1 has the ad-ditional information that the hygiene facilities were
provided by the ”Provi-sional Regional Council”. Since S1 contains
all the information provided byS2, an extractive summary selecting
both S1 and S2 should be penalized.This has been implemented as an
option in our RU system.
19
-
RU penalizes summarizers that include subsumed sentences by
reducingjudge utility scores for those sentences by a parameter α.
α takes a valuefrom 0 to 1. When it is 1, subsumed sentences retain
their original utilityscores. When α is 0, the utility score is 0.
The utility scores for sentencesthat subsume others (S1 in our
example) are not modified. In general, theutility score of a
subsumed sentence in an extract is reduced by the formula:
Usubsumed = α ∗ UorigIn our experiment, information subsumption
is identified by human judges.
This is imaginably very time consuming. [ZOR03] studied methods
to auto-matically identify subsumption, as well as other
Cross-document StructuralRelationships.
S# Text J1 util J2 util J3 (of 10)S1 Two to Four students
studying in schools 9 10 9
in the New Territories and outlying islandsnow have a chance to
gain more environmentalhygiene knowledge through visits to a number
ofProvisional Regional Council (Pro RC) ’s hygienefacilities and
participation in a lifeskilltraining camp during the summer
holiday.
S2 Two to Four students studying in schools in the New 5 6
9Territories and outlying islands can now have achance to gain more
knowledge on theirenvironmental hygiene facilities and at the same
timetake part in a challenging lifeskill trainingcamp during this
summer holiday.
Figure 6: An illustration of subsumption from documents
D-19990527-022and D-19980601-013 (in cluster 827)
We obtained subsumption data for 12 clusters and experimented
withvarious α values. Note that since subsumption penalty is
carried out forall utility scores, both J and R are recomputed. For
example, it may nolonger be possible to achieve a very high J if
that would cause the inclusionof sentences that subsume one
another. Compared with Table 13, wheresubsumed sentences are not
penalized, MEAD and RANDOM both per-formed significantly better.
Tables 15 – 17 illustrate the results of RU withsubsumption for
different values of α.
6 Experiments on DUC data
We have shown that relative utility gives higher interjudge
agreement com-pared to other metrics. This is a strong evidence
that relative utility corre-
20
-
MEAD RANDOM LEADPercent J R S D S D S D
10 0.77 0.63 0.79 1.47 0.68 0.55 0.68 0.6120 0.80 0.66 0.81 1.18
0.72 0.55 0.74 0.6930 0.82 0.69 0.82 1.13 0.74 0.39 0.79 0.8840
0.84 0.71 0.84 1.26 0.74 0.36 0.82 1.1550 0.86 0.74 0.86 1.40 0.78
0.42 0.84 1.25
Table 15: Multi-document Relative Utility with subsumption
penalty 0.25.
MEAD RANDOM LEADPercent J R S D S D S D
10 0.78 0.62 0.86 1.89 0.75 0.92 0.66 0.2820 0.80 0.64 0.88 1.46
0.76 0.77 0.74 0.6230 0.83 0.67 0.88 1.42 0.78 0.67 0.80 0.8440
0.85 0.70 0.90 1.52 0.77 0.57 0.83 0.9950 0.86 0.73 0.92 1.65 0.80
0.55 0.85 1.01
Table 16: Multi-document Relative Utility with subsumption
penalty 0.5.
MEAD RANDOM LEADPercent J R S D S D S D
10 0.84 0.65 1.03 1.90 0.90 1.13 0.64 0.2320 0.82 0.63 0.97 1.60
0.82 0.96 0.74 0.5730 0.84 0.66 0.97 1.69 0.84 0.81 0.82 0.8340
0.86 0.69 1.00 1.76 0.81 0.73 0.85 0.9850 0.88 0.72 1.01 1.94 0.82
0.75 0.87 0.98
Table 17: Multi-document Relative Utility with subsumption
penalty 0.75.
lates better with human judgements. However, since the only
summarizersystems available to us are MEAD and WEBSUMM, the results
in Section 4are not enough to conclude that relative utility is a
better metric than theothers in this sense. In this section, we
compare relative utility with othermetrics used in evaluating
summaries.
6.1 Data sets and automatic sentence utility judgements
DUC data is perfectly suitable for our purpose since it includes
many auto-matic summaries by different participant systems and
human judge rankingsof these systems. We used the generic
multi-document summarization tasksof DUC 2003 and 2004 in our
experiments. However, there is no manualutility scoring for each
sentence in DUC, which is a burden against com-puting relative
utility. To get sentence utilities, we applied the
automaticsentence scoring algorithm described in [ROQT03]. In this
method, manual
21
-
abstracts are used to score the sentences. Utility for a
sentence is computedby looking at how similar the sentence is to
the manual abstracts. We usedcosine similarity for this purpose
although the idea can be extended to anysimilarity metric.
6.2 Correlations of different metrics against human
judge-ments
A total of 18 participant systems and 10 human summarizers
ranked inDUC 2003, and 17 participant systems and 8 human
summarizers in DUC2004. Each human summarizer is also judged by
other humans and placedin the ranking. Table 18 shows the Spearman
rank order coefficients of DUC2003 and 2004 multi-document
summarization data between human rank-ings and different automatic
content-based metrics. We also include BLEU[PRWZ01], which is a
widely used evaluation metric among the machinetranslation
community.
Cosine Word Bigram LCS BLEU
DUC 2003 0.822 0.877 0.914 0.902 0.865DUC 2004 0.754 0.878 0.803
0.839 0.804
Table 18: Spearman rank order correlation coefficients of DUC
multi-document summarization data between human rankings and some
automaticcontent-based evaluation metrics (in order: cosine, word
overlap, bigramoverlap, longest common subsequence, and BLEU).
Multi-document summaries are bounded by 100 words in DUC 2003
and665 bytes in DUC 2004, which correspond to 2-5% of the document
clus-ters. There are 30 document clusters in DUC 2003, and 50
clusters in DUC2004, with 10 documents in each cluster. To get a
better comparison withthe content-based metrics, we also produced
2% extracts from the auto-matically created sentence utilities as
well as 5%, 10%, and 20% extracts.Table 19 shows the the Spearman
rank order coefficients between humanrankings and different
extractive evaluation metrics. RU gives higher cor-relation in all
cases compared to P/R and Kappa. In comparison with
thecontent-based metrics, RU correlates with human judgements as
well asother metrics on DUC 2004. However, it is hard to say that
this is the casefor DUC 2003. There are at least three reasons for
RU’s worse performance.First of all, interjudge agreement in DUC
2003 is lower than it is in DUC2004 (Table 20). Considering the
judges are the same individuals as themanual summarizers, this may
result in inconsistent rankings among differ-
22
-
ent judges. Second, our method to produce extracts from sentence
utilityscores is merely taking the sentences with the highest
score. Since we do notconsider information subsumption among the
selected sentences, the extractmay suffer from repeated
information. This makes a crucial effect on humanrankings, which
are based on coverage of the summaries with respect to themanual
summaries. Finally, our automatic sentence scoring algorithm is
notas perfect as human scoring, which clearly effects the accuracy
of RU. Lasttwo reasons apply for the DUC 2004, too, which means
that we could haveeven higher correlation if we had human sentence
utility scores and used RUwith subsumption.
DUC 2003 DUC 2004
Percent P/R Kappa RU P/R Kappa RU
2 0.664 0.663 0.718 0.780 0.782 0.8265 0.737 0.743 0.761 0.844
0.844 0.88210 0.723 0.726 0.753 0.827 0.827 0.86820 0.795 0.789
0.801 0.812 0.789 0.845
Table 19: Spearman rank order correlation coefficients of DUC
multi-document summarization data between human rankings and some
auto-matic extractive evaluation metrics (in order:
Precision/Recall, Kappa, andrelative utility).
Percentage
02 05 10 20
DUC 2003 0.647 0.705 0.743 0.796DUC 2004 0.715 0.740 0.770
0.810
Table 20: Relative Utility - average interjudge agreement (J)
for DUC multi-document summarization data.
7 Conclusions and Future work
Since interjudge agreement measured by Precision, Recall, and
percent agree-ment are quite low for extractive summaries, it is
practically impossible towrite summarizers which are optimized for
these measures. Relative Utilityprovides an intuitive mechanism
which takes into account the fact that eventhough human judges may
disagree on exactly which sentences belong in asummary, they tend
to agree on the overall salience of each sentence. By
23
-
moving from binary decisions to variable-membership decisions,
it is possibleto catch that agreement and produce better
summarizers.
Relative Utility has several additional advantages over P/R/PA.
First,in a way similar to Kappa [SC88], it takes into account the
difficulty of aproblem by factoring in random and interjudge
performance.
Second (and unlike Kappa), it can be used for evaluation at
multiplecompression rates (summary lengths). In one pass, judges
assign saliencescores to all sentences in a cluster (or in a single
document). It is thenpossible to simulate extraction at a fixed
compression rate by ranking sen-tences by utility. As a result, RU
is a more informative measure of sentencesalience than the
alternative metrics.
Third, the RU method can be further expanded to allow sentences
orparagraphs to exert negative reinforcement on one another, that
is, allowfor cases in which the inclusion of a given sentence makes
another redundantand a system that includes both will be penalized
more than a system whichonly includes one of the two “equivalent”
sentences and another, perhapsless informative sentence.
In current work, we are investigating the connection between RU,
sub-sumption and the taxonomy of cross-document relationships (such
as para-phrase, follow-up, elaboration, etc.) set forth in
Cross-Document StructureTheory (CST) [Rad00, ZBGR02].
The subsumption-based RU model will need further adjustment to
ad-dress sentences which mutually increase their importance. For
example,sentences with anaphoric expressions (e.g., “He then
said...”) will have ahigher utility if the sentence containing the
antecedent of the anaphora isalso included.
Finally, we need to mention that the use of Relative Utility is
not limitedto the evaluation of sentence extracts. We will
investigate its applicability toother evaluation tasks, such as
ad-hoc retrieval and word sense disambigua-tion. One particularly
promising area of application is in the evaluationof non-extractive
summaries. In recent DUC conferences, abstractive sum-maries have
been evaluated using model unit recall (also known as MLAC =mean
length-adjusted coverage.) In this model, human reference
summariesare split into atomic content pieces called model units.
Example model unitscould be “Teachers went on strike in France” or
“Two new SARS cases havebeen reported in Hong Kong”. The current
DUC evaluation measures recallonly when the right model unit is
included in the system summary. We willinvestigate assigning
relative utility scores to model units in order to capturefact
salience.
24
-
8 Acknowledgments
This work was partially supported by the National Science
Foundation’s In-formation Technology Research program (ITR) under
grant IIS-0082884. Allopinions, findings, conclusions and
recommendations in any material result-ing from this workshop are
those of the participants, and do not necessarilyreflect the views
of the National Science Foundation.
We would also like to thank the CLAIR (Computational Linguistics
AndInformation Retrieval) group at the University of Michigan and,
more specif-ically, Adam Winkel, Sasha Blair-Goldensohn, Jahna
Otterbacher, NaomiDaniel, and Timothy Allison for useful
feedback.
25
-
References
[BMR95] Ron Brandow, Karl Mitze, and Lisa F. Rau. Automatic
con-densation of electronic publications by sentence selection.
In-formation Processing and Management, 31(5):675–685, 1995.
[DDM00] R.L. Donaway, K.W. Drummey, and L.A. Mather. A
Com-parison of Rankings Produced by Summarization
EvaluationMeasures. In Proceedings of the Workshop on Automatic
Sum-marization, ANLP-NAACL2000, pages 69–78. Association
forComputational Linguistics, 30 April 2000.
[DUC00] Proceedings of the Workshop on Text Summarization
(DUC2000), New Orleans, LA, 2000.
[GKMC99] Jade Goldstein, Mark Kantrowitz, Vibhu O. Mittal,
andJaime G. Carbonell. Summarizing text documents:
Sentenceselection and evaluation metrics. In SIGIR 1999, pages
121–128, Berkeley, California, 1999.
[JMBE98] Hongyan Jing, Kathleen McKeown, Regina Barzilay,
andMichael Elhadad. Summarization Evaluation Methods: Experi-ments
and Analysis. In Intelligent Text Summarization. Papersfrom the
1998 AAAI Spring Symposium. Technical Report SS-98-06, pages 60–68,
Standford (CA), USA, March 23-25 1998.The AAAI Press.
[Luh58] H.P. Luhn. The Automatic Creation of Literature
Abstracts.IBM Journal of Research Development, 2(2):159–165,
1958.
[MB99] Inderjeet Mani and Eric Bloedorn. Summarizing
similarities anddifferences among related documents. Information
Retrieval,1(1):35–67, 1999.
[PRWZ01] K. Papineni, S. Roukos, T. Ward, and W-J. Zhu. Blue:
AMethod for Automatic Evaluation of Machine Translation. Re-search
Report RC22176, IBM, 2001.
[Rad00] Dragomir Radev. A common theory of information fusion
frommultiple text sources, step one: Cross-document structure.
InProceedings, 1st ACL SIGDIAL Workshop on Discourse andDialogue,
Hong Kong, October 2000.
26
-
[RJB00] Dragomir R. Radev, Hongyan Jing, and
MalgorzataBudzikowska. Centroid-based summarization of
multipledocuments: sentence extraction, utility-based evaluation,
anduser studies. In ANLP/NAACL Workshop on Summarization,Seattle,
WA, April 2000.
[ROQT03] Dragomir R. Radev, Jahna Otterbacher, Hong Qi, and
DanielTam. Mead reducs: Michigan at duc 2003. In Proceedings ofDUC
2003, Edmonton, AB, Canada, 2003.
[RTS+03] Dragomir R. Radev, Simone Teufel, Horacio Saggion, Wai
Lam,John Blitzer, Hong Qi, Arda Çelebi, Danyu Liu, and
ElliottDrabek. Evaluation challenges in large-scale
multi-documentsummarization: the mead project. In Proceedings of
ACL 2003,Sapporo, Japan, 2003.
[SC88] Sidney Siegel and N. John Jr. Castellan. Nonparametric
Statis-tics for the Behavioral Sciences. McGraw-Hill, Berkeley,
CA,2nd edition, 1988.
[ZBGR02] Zhu Zhang, Sasha Blair-Goldensohn, and Dragomir Radev.
To-wards CST-enhanced summarization. AAAI 2002, August 2002.
[ZOR03] Zhu Zhang, Jahna Otterbacher, and Dragomir R. Radev.
Learn-ing cross-document structural relationships using boosting.
InProceedings of ACM CIKM 2003, New Orleans, LA, November2003.
27
-
MEAD LEAD RANDOM JUDGE1 JUDGE2 JUDGE3 ALLJUDGES
19980306 007:2 19980306 007:1 19980306 007:4 19980306 007:2
19980306 007:1 19980306 007:15 19980306 007:219980306 007:15
19980306 007:2 19980306 007:6 19980306 007:3 19980306 007:2
19980306 007:17 19980306 007:1519980306 007:26 19980430 016:1
19980306 007:19 19980306 007:4 19980306 007:18 19980430 016:1
19980430 016:1319980306 007:27 19980430 016:2 19980306 007:22
19980306 007:6 19990425 009:1 19980430 016:2 19980430
016:1619980430 016:17 19990211 009:1 19980430 016:1 19980306 007:7
19990425 009:2 19980430 016:13 19990425 009:119980430 016:20
19990211 009:2 19980430 016:3 19980306 007:9 19990729 008:12
19980430 016:14 19990425 009:219980430 016:38 19990218 009:1
19980430 016:20 19980306 007:11 19990802 006:2 19980430 016:16
19990425 009:319990211 009:2 19990218 009:2 19980430 016:24
19980306 007:12 19990802 006:6 19980430 016:17 19990425
009:719990211 009:4 19990218 009:3 19980430 016:42 19980306 007:13
19990802 006:8 19980430 016:19 19990425 009:819990211 009:6
19990425 009:1 19990218 009:14 19990425 009:7 19990802 006:9
19990211 009:3 19990729 008:819990218 009:4 19990425 009:2 19990425
009:18 19990425 009:10 19990802 006:13 19990218 009:2 19990802
006:819990425 009:2 19990425 009:3 19990729 008:4 19990802 006:7
19990802 006:16 19990218 009:4 19990802 006:919990425 009:6
19990729 008:1 19990729 008:13 19990802 006:8 19990829 012:1
19990425 009:1 19990802 006:1019990425 009:7 19990729 008:2
19990802 006:19 19990802 006:9 19990829 012:2 19990425 009:3
19990802 006:1319990425 009:9 19990802 006:1 19990802 006:23
19990802 006:10 19990927 011:1 19990425 009:8 19990802
006:1619990425 009:13 19990802 006:2 19990829 012:16 19990829 012:2
19990927 011:2 19990425 009:12 19990829 012:219990729 008:3
19990829 012:1 19990927 011:11 19990829 012:5 19990927 011:10
19990729 008:8 19990829 012:619990729 008:8 19990829 012:2 19990927
011:14 19990829 012:6 19990927 011:11 19990802 006:13 19990829
012:1319990729 008:13 19990927 011:1 19990927 011:18 19990829
012:12 19990927 011:12 19990829 012:2 19990927 011:1119990802 006:3
19990927 011:2 19990927 011:21 19990829 012:13 19990927 011:13
19990829 012:6 19990927 011:1219990802 006:16 19990927 011:3
19990927 011:26 19990927 011:4 19990927 011:18 19990829 012:13
20000408 011:119990802 006:17 20000408 011:1 20000408 011:15
19990927 011:5 19990927 011:20 19990927 011:14 20000408
011:219990829 012:7 20000408 011:2 20000408 011:20 19990927 011:6
19990927 011:21 20000408 011:13 20000408 011:419990927 011:9
20000408 011:3 20000408 011:21 20000408 011:2 20000408 011:1
20000408 011:15 20000408 011:5
Table 21: Seven 10% extracts (document-id:sentence-id) produced
from the same cluster. Note: order within acolumn is not
relevant.
28