Summary Generation Keith Trnka

Summary GenerationKeith Trnka

The approach

● Apply Marcu's basic summarizer (1999) to perform content selection

● Re-generate the selected content so that it's more natural

RST Refresher

● A text is composed of elementary discourse units (EDUs)– What constitutes an EDU varies from author to author– Common consensus that they are no larger than

sentences● Text spans

– An EDU is a text span– A sequence of adjacent text spans in some rhetorical

relation is a text span

RST Refresher (cont'd)

● A rhetorical relation is the relationship between text spans– Some relations have the notion of nuclearity:

one sub-span (nucleus) is the one to which all other sub-spans (satellites) relate

● These relations are called mononuclear● Example: [When I got home,] circumstance-for [I was

tired]

– Other spans are called multinuclear● There is no most-important sub-span● Example: [Cats scratch] contrast-with [, but dogs bite.]

RST Discourse Treebank

● RST analyses of 385 WSJ articles from Penn Treebank

● Available from LDC (http://www.ldc.upenn.edu)● Overview can be found in (Carlson et. al. 2001)● Annotation manual is (Carlson, Marcu 2001)● Thanks to the department for buying it

http://www.ldc.upenn.edu/

● Notes about the annotation– EDUs are clause-like– Mono-nuclear relations were forced to be binary– Relative clauses and appositives can be embedded

relations

RST Discourse Treebank (cont'd)

RST Discourse Treebank (cont'd)

● Statistical analysis of 335 training documents– 98% of spans are binary (two children)– For binary mononuclear relations:

● Nucleus-satellite order can be predicted with 87% accuracy, given the relation, using predict-majority

Relation Frequency N-S Order S-N OrderElaboration-additional 20.44% 99.79% 0.17%Attribution 17.19% 32.34% 67.42%elaboration-object-attribute-e 16.13% 99.96% 0.04%Elaboration-additional-e 5.22% 99.06% 0.94%Circumstance 3.95% 55.26% 44.56%Explanation-argumentative 3.61% 96.88% 2.34%

Marcu's Content Selection Algorithm

● Described in (Marcu 1999)● Promotion sets

– The promotion set of each span is the union of all promotion sets of nuclear sub-spans

– The promotion set of an EDU is the EDU itself

Marcu's Content Selection Algorithm (cont'd)● Build a partial ordering of EDUs*

– For each EDU, find the topmost span in which it's in the promotion set. Let d be the tree depth of this span.

– The rank of each EDU is● If the EDU is in an embedded relation, d + 1● Otherwise, d

– Example of the partial ordering

*re-worded from Marcu's description

Marcu's Content Selection Algorithm (cont'd)● Given a summary length requirement

– Select the topmost EDU groups until it isn't possible to select more and honor the length requirement

– Effect: can't always generate a summary as close to desired length as possible

Generation desiderata

● Removal of problems– Dangling references– Dangling discourse markers

● Introduction of coherence– Generate smaller referring expressions– Generate discourse markers when appropriate

Example

Claude Bebear, chairman and chief executive officer, of Axa-Midi Assurances, pledged to retain employees and management of Farmers Group Inc.. Mr. Bebear made his remarks at a breakfast meeting with reporters here yesterday as part of a tour. Farmers was quick yesterday to point out the many negative aspects. For one, Axa plans to do away with certain tax credits.

The theoretical approach

● Content selection– Marcu's summarization algorithm

● Paragraph generation– Organize sentences into paragraphs

● Sentence generation– Construct complete sentences from EDUs

The theoretical approach (cont'd)

● Discourse marker generation– Remove discourse markers that refer to removed text

spans– Generate discourse markers when none exists and one

is appropriate● Referring expression generation

– Generate the best unambiguous referring expressions● Shorter is better● Faster to interpret is better

The implemented approach

● Content selection– Marcu's algorithm as stated

● Paragraph generation– Not implemented

Implementation: Sentence “generation”● If a selected group of EDUs is an entire text span

– select them all as-is, uppercase the front and make sure it ends with punctuation

● If a selected group of EDUs is an entire text span, except for some embedded relations– Remove punctuation associated with embeddings, add

sentence terminators from embeddings

● If a selected group of EDUs is a sentence– Select as-is

● If a selected EDU isn't part of such a group– uppercase the front and end with punctuation

Implementation: Discourse marker generation● Train to see which discourse markers go with

which relations● In generation, select discourse markers with a

probability > 80%

Training on discourse markers

● Discourse markers identified by string matching at beginning and ending of each EDU

● List of markers taken from (Knott 1994)

Training on discourse markers (cont'd)● Three statistics trained on binary, atomic spans

with zero or one markers– Inclusion

– Usage

– Position

P include a marker | relation

P marker = m | include , relation

P position 1, 2 start , end | marker , include , n-s order

Rough evaluation

● Sentence “generation” isn't much different from not changing it at all– Except embedded relation removal

● Out of 347 summaries, a discourse marker was only generated once– Ms. Johnson is awed by the earthquake's destructive

force. "It really brings you down to a human level," Though "It's hard to accept all the suffering but you have to.

Desired approach: Content selection● Marcu's algorithm can only select groups of

EDUs– Sometimes produces overly short summaries or

nothing at all– If a preferential ordering could be defined within

equivalence, summaries could meet the desired length better

● EDUs tied to more salient EDUs have their score boosted

Desired approach:Paragraph generation● Paragraphs in the source document are marked

– Leave paragraph boundaries intact if they form large enough paragraphs

– A shallow method, but has potential● Correlate paragraph boundaries with something

– RS-tree structure– Co-reference chain beginnings/endings– Topical text segments, by an extension of Heart's text

segmentation algorithm (Hearst 1994)

Desired approach: Sentence generation● Apply shallow parsing to understand the rough

syntactic structure of an EDU● Relative clauses can be attached and full

sentences generated like (Siddharthan 2004)

Desired approach:Discourse marker generation● The probabilities computed in DM training aren't

the best– Need to attach discourse markers and recompute,

repeat until stable– The attachment algorithm involves a constraint-

satisfaction problem● DM attachment needed to perform DM removal● A DM generator should understand syntax better

– When should commas be included and where?

Desired approach:Referring expression generation● Requires good co-reference resolution

– A reference resolver requires (at least) a base noun phrase chunker

– EDUs might be used in conjunction with a shallow parse to approximate Hobbs' naïve approach

● Mitkov (2002) describes Hobbs' naïve approach

● Generation algorithm only adds the creation of a list of referring expressions, ordered by preference

Conclusions

● Document length is poorly defined– Quite a bit of variation between EDU length, word

length, and character length● Attaching discourse markers to the relation they

realize is tough● Representing natural language in programs can

be tough● Summarization of quotations requires special

treatment

References

● Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski (2001). Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory. Proceedings of the 2nd SIGDIAL Workshop on Discourse and Dialogue, Eurospeech 2001, Denmark, September 2001.

● Lynn Carlson and Daniel Marcu. (2001). Discourse Tagging Manual. ISI Tech Report ISI-TR-545. July 2001.

● Marti Hearst (1994). Multi-Paragraph Segmentation of Expository Text. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM, June 1994.

● Alistair Knott and Robert Dale (1994). Using Linguistic Phenomena to Motivate a Set of Coherence Relations. Discourse Processes 18(1): 35-62.

● William Mann and Sandra Thompson (1988). Rhetorical Structure Theory: Toward a functional theory of text organization. Text 8(3): 243-281.

References (cont'd)

● Daniel Marcu (1999). Discourse trees are good indicators of importance in text. In I. Mani and M. Maybury editors, Advances in Automatic Text Summarization, pages 123-136, The MIT Press.

– I think this is a cleanup of his earlier work from 1997.

● Ruslan Mitkov (2002). Anaphora Resolution. Pearson Education.

● Advaith Siddharthan (2004). Syntactic Simplification and Text Cohesion. To appear in the Journal of Language and Computation, Kluwer Academic Publishers, the Netherlands.

Summary Generation Keith Trnka

Documents