Extraction Chapter 3 in Automatic Summarization 한 한 한 2001-11-08 한한한한한 한한한한한한한한
Extraction
Chapter 3inAutomatic Summarization
한 경 수2001-11-08고려대학교 자연어처리연구실
한경수 Extraction 2
Contents Introduction The Edmundsonian paradigm Corpus based sentence extraction
General considerations Aspects of learning approaches
Coherence of extracts Conclusion
한경수 Extraction 3
Extraction Extraction (discussed here)
Analysis phase dominates. This analysis is relatively shallow. Discourse level information, if used at all, is mostly for …
– establishing coreference between proper names– pronoun resolution
Extraction is not appropriate for every summarization. At high compression rate
– extraction seems less likely to be effective, unless some pre-existing highly compressed summary material is found.
In multi-document summarization– both differences and similarities between documents need to be char
acterized. Human abstractors produce abstracts, not extracts.
Introduction
한경수 Extraction 4
Extraction element The basic unit of extraction is the sentence. Practical reason preferring sentence to paragraph
It offers better control over compression Linguistic motivation
– Sentence has historically served as a prominent unit in syntactic and semantic analysis.
– Logical accounts of meaning offer precise notions of sentential meaning.
o Sentences can be represented in a logical form, and taken to denote propositions.
The extraction of elements below the sentence level The extracts will often be fragmentary in nature.
The sentence seems a natural unit to consider in the general case.
Introduction
한경수 Extraction 5
Classic work of Edmundson (1969) Used a corpus of 200 scientific papers on chemistry.
Each paper between 100 and 3900 words long. Manually prepare the target extracts
Features Title words
– Words from the title, subtitles, and headings– given a hand-assigned weight
Cue words– Extracted from the training corpus based on selection ratio
o Selection ratio = # of occurrences in extract / # of occurrences in all sentences of the corpus
– Bonus wordso Evidence for selection: above an upper selection ratio thresholdo comparatives, superlatives, adverbs of conclusion, value terms, relative interrogative
s, causality terms– Stigma wods
o Evidence for non-selection: below a lower selection ratio cutoffo Anaphoric expressions, belittling expressions, insignificant detail expressions, hedgi
ng expressions
The Edmundsonian paradigm
한경수 Extraction 6
Classic work of Edmundson (1969) Features (continued)
Keywords– The word frequencies were tabulated in descending order
o Until a given cutoff percentage of all the word occurrences in the document were reached
– Non-cue words above that threshold were extracted as key words.– Each word’s weight is its frequency in the document.
Sentence location– Heading weight
o Short list of particular section headings was constructed.• Like “Introduction” and “Conclusion”
o Sentences occurred under such headings were assigned a positive weight.
– Ordinal weighto Sentences were assigned weights based on their ordinal position.o If they occurred in the first and last paragraph or if they were the first or
last sentences of paragraphs, they were assigned a positive weight.
The Edmundsonian paradigm
한경수 Extraction 7
Classic work of Edmundson (1969) Sentence scoring
Based on a linear function of the weights of each features
Edmundson adjusted by hand the feature weights and the tuning parameters
– by feedback from comparisons against manually created training extracts
Evaluations Key words were poorer than the other 3 features. The combination of cue-title-location was the best
– The best individual feature: location, the worst: key words
The Edmundsonian paradigm
)()()()()( sTsLsKsCsW
한경수 Extraction 8
Feature reinterpretation: cue words
Cue words cue phrases Cue phrases
Expressions– like “I conclude by”, “this paper is concerned with”, …
Bonus words, stigma words In-text summary cues (indicator phrases)
– E.g. beginning with “in summary”
Useful for specific technical domains Indicator phrases can be extracted by a
pattern matching process Black(1990): p.49 example
The Edmundsonian paradigm
한경수 Extraction 9
Feature reinterpretation: key words
Key words presence of thematic term features Selected based on term frequency Including key words of Edmundson
Thematic Term Assumption Relatively more frequent terms are more salient. Luhn(1958)
– Find content words in a document by filtering against a stoplist of function words
– Arrange it by frequency– Suitable high-frequency and low-frequency cutoffs were estimated fro
m a collection of articles and their abstracts. A variant of the thematic term assumption: tf*idf
Its use in automatic summarization is somewhat less well-motivated.
The Edmundsonian paradigm
한경수 Extraction 10
Feature reinterpretation: location
Baxendale(1958) Found that important sentences were located at the beginning or e
nd of paragraphs. Salient sentences were likely to occur as …
– first sentence in the paragraph 85% the time– Last sentence 7% of the time
Brandow et al.(1995) Compared their thematic term based extraction system for news(A
NES) against Searchable Lead, a system which just output sentences in order.
Searchable Lead outperformed ANES– Acceptable 87% to 96% of the time– Unacceptable case
o anecdotal, human-interest style lead-ins, documents that contained multiple news stories, stories with unusual structural/stylistic features, …
The Edmundsonian paradigm
한경수 Extraction 11
Feature reinterpretation: location
Lin & Hovy(1997) Defined Optimal Position Policy(OPP). OPP
– A list of positions in the text in which salient sentences were likely to occur.
For 13,000 Ziff-Davis news articles– Title, 1st sentence of 2nd paragraph, 1st sent of 3rd para, …
For Wall Street Journal– Title, 1st sentence of 1st paragraph, 2nd sent of 1st para, …
The Edmundsonian paradigm
한경수 Extraction 12
Feature reinterpretation: title Title words Add Term
Weight is assigned based on terms in it that are also present in the title, article headline, or the user’s profile or query.
A user-focused summary– Relatively heavy weight for – Will favor the relevance of the summary to the query or
topic.– Must be balanced against the fidelity to the source
document.o Need for the summary to represent information in the document
The Edmundsonian paradigm
한경수 Extraction 13
Criticism The Edmundsonian equation is inadequate for summarizati
on for the following reasons Extracts only single elements in isolation, rather than extracting se
quences of elements.– Incoherent summaries– Knowing that a particular sentence has been selected should affect th
e choice of subsequent sentences. Compression rate isn’t directly referenced in the equation
– The compression rate should be part of the summarization process, not just an afterthought.
o E.g.• most salient concept A – s1, s2• Next-to-most salient concept B – s3• One-sentence summary: s3• Two-sentence summary: s1, s2
The Edmundsonian paradigm
한경수 Extraction 14
Criticism A linear equation may not be a powerful enough model
for summarization.– Non-linear model is required for certain applications
o Spreading activation between wordso Other probabilistic models
Uses only shallow, morphological-level features for words and phrases in the sentence, along with the sentence’s location.
– There has been a body of work which explores different linear combinations of syntactic, semantic, and discourse-level features.
Is rather ad hoc.– Doesn’t tell us anything theoretically interesting about what
makes a summary a summary.
The Edmundsonian paradigm
한경수 Extraction 15
General considerations The most interesting empirical work in Edmundsonian para
digm has used some variant of Edmundson’s equation, leveraging a corpus to estimate the weights.
Basic methodology for a corpus-based to sentence extraction Figure 3.1 (p. 54)
Corpus based sentence extraction
한경수 Extraction 16
Labeling A training extract is also preferred to a training abstract
Because it is somewhat less likely to vary across human summarizers.
Producing an extract from an abstract Mani & Bloedorn(1998)
– Treat the abstrat as a query.– Rank the sentences for similarity to the abstract.
o Combined-match• Each source sentence is matched against the entire abstract treated as a sing
le sentence.• Euqation 3.2 (p. 56)
o Individual-match• Each source sentence is compared against each sentence of the abstract.
Corpus based sentence extraction
한경수 Extraction 17
Labeling Producing an extract from an abstract (continued)
Marcu(1999)– Prunes a clause away from the source that is least similar to abstract.
Jing & McKeown(1999)– Word-sequence alignment using HMM– Refer to section 3 in Kyoung-Soo’s Technical Note KS-TN-200103
Can result in a score for each sentence Yes/no label Labeling can be left as a continuous function.
Corpus based sentence extraction
한경수 Extraction 18
Learning representation The result of learning can be represented as
… Rules Mathematical functions
If a human is to trust a machine’s summaries The machine has to have some way of explaining why it
produced the summary it did. Logical rules are usually preferred to
mathematical functions.
Corpus based sentence extraction
한경수 Extraction 19
Compression & Evaluation Compression
Typically, it is applied at the time of testing. It is possible to train a summarizer for a particular
compression.– Different feature combinations may be used for different
compression rates.
Evaluation Precision, recall, accuracy, F-measure Table 3.1/3.2 (p. 59)
Corpus based sentence extraction
한경수 Extraction 20
Aspects of learning approaches
Sentence extraction as Bayesian classification Kupiec et al.(1995) 188 full text/summary pairs
– drawn from 21 different collections of scientific articles– Summary was written by a professional abstractor and was 3 sentenc
es long on average. Features
– Sentence length, presence of fixed cue phrases, location, presence of thematic terms, presence of proper names
Bayesian classifier (Equation 3.4 p.60) Producing an extract from the abstract
– Direct match(79%)o identical, or considered to have the same content
– Direct join(3%)o two or more document sentences appear to have the same content as a s
ingle summary sentence.
Corpus based sentence extraction
한경수 Extraction 21
Aspects of learning approaches
Sentence extraction as Bayesian classification (cont’d) Evaluation
– 43% recall– As the summaries were lengthened performance improved.
o 84% recall at 25% of the full text length
– Location was the best feature– Location-cue phrase-sentence length was the best
combination
Corpus based sentence extraction
한경수 Extraction 22
Aspects of learning approaches
Classifier combination Myaeng & Jang(1999)
– Tagged each sentence in the Introduction and Conclusion sectiono Whether the section represented …
• Background• Main theme• Explanation of the document structure• Description of future work
– 96% of the summary sentence were main theme sentences.– Training method
o Used bayesian classifier to determine whether a sentence belonged to a main theme
o Combined evidence from multiple Bayesian feature classifiers using a voting
o Applied a filter to eliminate redundant sentences.– Evaluation
o Cue words-location-title words was the best combinationo Suggests that the Edmundsonian features are not language-specific.
Corpus based sentence extraction
한경수 Extraction 23
Aspects of learning approaches
Term aggregation In a document about a certain topic,
– There would be many reference to that topic.– The reference need not result in verbatim repetition.
o Synonym, more specialized word, related term, … Aone et al.(1999)
– Different methods of term aggregation can impact summarization performance.
o Treat morphological variants, synonyms, name aliases as instances of the same term.
– Performance can be improvedo When place names and organization names are identified as terms,o And when person names are filtered outo Reason: document topics are generally not about people.
Corpus based sentence extraction
한경수 Extraction 24
Aspects of learning approaches
Topic-focused summaries Lin(1999)
– Used a corpus, called the Q&A corpuso 120 texts (4 topics * 30 relevant docs/topic)o Human-created, topic-focused passage extraction summary
– Featureso Add-Term: query term
• Sentences are weighted based on the number of query terms they contained.
o Additional relevance feature• Relevance feedback weight for terms that occurred in documents
most relevant to the topic.
o Presence of proper name, sentence lengtho Cohesion features
• Number of terms shared with other sentences
o Numerical expression, pronoun, adjective, reference to specific weekdays or months, presence of quoted speech
Corpus based sentence extraction
한경수 Extraction 25
Aspects of learning approaches
Topic-focused summaries (continued) Lin(1999) (continued)
– Feature combinationo Naïve combination with each feature given equal weighto Decision tree learner
– Naïve method outperformed the decision tree learner on 3 out of 4 topics.
– Baseline method(based on sentence order) also performed well on all topics.
Corpus based sentence extraction
한경수 Extraction 26
Aspects of learning approaches
Topic-focused summaries (continued) Mani & Bloedorn(1998)
– Cmp-lg corpus: a set of 198 pairs of full-text docs/abstracts– Labeling
o The overall information need for a user was defined by a set of docs.o A subject was told to pick a sample of 10 docs matched his interests.o Top content words were extracted from each docs.o Words for the 10 docs were sorted by their scoreso All words more than 2.5 standard deviations above the mean of these wor
ds’ scores were treated as a representation of the user’s interest, or topic.• There were 72 such words.
o Relevance match• Used spreading activation based on cohesion information to weight word occur
rences in the document related to the topic.• Each sentence was weighted based on the average of its word weights.• The top C% of these sentences were picked as positive examples
Corpus based sentence extraction
한경수 Extraction 27
Aspects of learning approaches
Topic-focused summaries (continued) Mani & Bloedorn(1998) (continued)
– Featureso 2 additional user-interest-specific features
• Number of reweighted words(topic keywords) in the sentence• Number of topic keywords / number of content word in the sentence• Specific topic keywords weren’t used as features, since it is preferable to learn
rules that could transfer across user-interests.• Topic keywords are similar to ‘relevance feedback’ terms in Lin’s study.
o Location, thematic featureso cohesion features
• Synonymy: judged by using WordNet• Statistical cooccurrence: scores between content words i and j up to 40 words
apart were computed using mutual information.• Equation 3.5 (p. 65)• Association table only stores scores for tf counts greater than 10 and associat
ion scores greater than 10.
Corpus based sentence extraction
한경수 Extraction 28
Aspects of learning approaches
Topic-focused summaries (continued) Mani & Bloedorn(1998) (continued)
– Evaluationo In user-focused summaries, the number of topic keywords in a sentnece
was the single most influential feature.o The cohesion features contributed the least,
• Perhaps because the cohesion calculation was too imprecise.
– Some sample rules (Table 3.4 p.66)o The learned rules are highly intelligible, and can perhaps be edited in acc
ordance with human intuitions.o The discretization of the features degraded performance by about 15%
• There is a tradeoff there between accuracy and transparency.
Corpus based sentence extraction
한경수 Extraction 29
Aspects of learning approaches
Case study: Noisy channel model There has been a surge of interest in language modeling approach
es to summarization. (Berger & Mittal 2000) The problem of automatic summarization as a translation problem
– translating between a verbose language(of source documents) and a succinct language(of summaries)
– This idea is related to the notion of the abstractor reconstructing the author’s ideas in order to produce a summary.
Generic summarization
Corpus based sentence extraction
decoder
Noisy Channel
ds *s)|( sdP
))()|((maxarg)|(maxarg* sPsdPdsPsss
한경수 Extraction 30
Aspects of learning approaches
Case study: Noisy channel model (continued) User-focused summarization
– fidelity
– relevance
Corpus based sentence extraction
))|()|((maxarg
))|(),|((maxarg),|(maxarg*
dsPsqP
dsPdsqPqdsPs
s
ss
relevance fidelity
m
iidd sPmldsP
1
)()()|(
m
iiss qPklsqP
1
)()()|(
한경수 Extraction 31
Aspects of learning approaches
Case study: Noisy channel model (continued) Training
– Use FAQ pages on WWWo Lists a sequence of question-answer pairs (10,395)o Culled from 201 usenet FAQs and 4 call-center FAQso View each answer as the query-focused summary of the document
Evaluation– Assigns the correct summary, on the average, a rank of …
o 1.41 for useneto 4.3 for the call center data
Criticism– The noisy channel model is appealing
o Because it decomposes the summarization problem for generic and user-focused summarization in a theoretically interesting way
– However, the model tends to rely on large quantities of training data.
Corpus based sentence extraction
한경수 Extraction 32
Aspects of learning approaches
Conclusion The corpus-based approach to sentence extraction is
attractive because …– It allows one to tune the summarizer to the characteristics of
the corpus or genre of text.– Well-established– The capability to learn interesting and often quite interesting
rules But,
– Lots of design choices and parameters involved in training Issues
– How is the training to be utilized in an end-application?– Learning sequences of sentences to extract deserves more
attention.– Evaluation
Corpus based sentence extraction
한경수 Extraction 33
Coherence of extracts When extracting sentences from a source,
An obvious problem is preserving context. Picking sentences out of context can result in incoherent
summaries Coherence problems
Dangling anaphors– If an anaphor is present in a summary extract, the extract
may not be entirely intelligible if the referent isn’t included as well.
Gaps– Breaking the connection between the ideas in a text can
cause problems. Structured environments
– Itemized lists, tables, logical arguments, etc., cannot be arbitrarily divided.
한경수 Extraction 34
Conclusion Abstracts vs. extracts
The most important aspect of an abstract …– Is not so much that it paraphrases the input in its own words.– Some level of abstraction of the input has been carried out
o Providing a degree of compressiono Requires Knowledge of the meaning of the information talked abouto And ability to make inferences at the semantic level
Extraction methods– While knowledge-poor, are not entirely knowledge-free.– Knowledge about a particular domain is represented
o In terms of features specific to that domaino In the particular rules or functions learned for that domain
– The knowledge here is entirely internal. There is fundamental limitation to the capabilities of extraction
systems.– Current attention is focused on the opportunity to avail of
compression in a more effective way by producing abstracts automatically.