Graz University of Technology 706.048 Information Search and Retrieval WS 2009 Automatic Text Summarization Group 10 Rejhan Basagic [0330740] Damir Krupic [0231316] Bojan Suzic [0631814] Supervisor: Dipl.-Ing. Dr.techn. Christian Gütl Institute for Information Systems and Computer Media, Graz
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Graz University of Technology
706.048 Information Search and Retrieval
WS 2009
Automatic Text Summarization
Group 10
Rejhan Basagic [0330740]
Damir Krupic [0231316]
Bojan Suzic [0631814]
Supervisor:
Dipl.-Ing. Dr.techn. Christian Gütl
Institute for Information Systems and Computer Media, Graz
II
Copyright (C) 2009 Rejhan Basagic, Damir Krupic, Bojan Suzic.
Dieses Werk kann durch jedermann gemäß den Bestimmungen der Lizenz für Freie Inhalte
genutzt werden.
Die Lizenzbedingungen können unter http://www.uvm.nrw.de/opencontent abgerufen
oder bei der Geschäftsstelle des Kompetenznetzwerkes Universitätsverbund MultiMedia
Recently, one of the problems arisen with the rapid growth of the web and generally information
availability (sometimes referred as an information overloading) is the increased need for effective
and powerful text summarization. In this document we will present a short historical overview and
advancement of automatic text summarization and the most relevant approaches currently used in
this area. The approaches will be selected both for single and multi-document summarization.
Furthermore, the examples of the use of the technology will be summarized and short conclusion
and future directions will be given.
ZUSAMMENFASSUNG
In letzter Zeit, ein der Probleme, das mit dem schnellen Wachstum des Internets und in der Regel
die Verfügbarkeit von Information entstanden wurde, ist die zunehmende Notwendigkeit für eine
effektive und leistungsfähige Textzusammenfassung. In diesem Dokument wird ein kurzen
historischen Überblick und Weiterentwicklung der automatischen Textzusammenfassung
präsentiert. Weiter, die meist relevanten und wichtigsten Konzepte werden dargestellt, mit dem
Fokus auf Single- und Multi- Dokumentzusammenfassung. Darüber hinaus werden die Beispiele für
die Nutzung der Technologie zusammengefasst und kurzes Fazit und künftige Richtungen gegeben
werden.
IV
TABLE OF CONTENTS
Abstract ..................................................................................................................................................................................... III
Zusammenfassung ................................................................................................................................................................ III
Table of Contents .................................................................................................................................................................... IV
1.1 Applications of the automatic summarization ......................................................................................... 6
1.2 Scope of this work ................................................................................................................................................ 7
� N is the total number of sentences in the document
� n(si) is number of sentences having semantic similarity above predefined threshold level
� P(si) is sentence position weight
� Sim(si,T) and Sim(si,Q) are related to semantic similarity with the document title and user
query, respectively
� NE is number of named entities in the document
� FNE(si) is the number of named entities contained in the sentence i
- 14 -
The summary is generated by choosing the most important sentences in the document (ones with
the highest score) and arranging them in chronological order. Multi-document summaries could be
generated similar way by calculating scores of sentences in each document separately and then
choosing the highest scoring sentences from all documents.
The system built on the top of this method has been presented in TAC 2008. The system has
demonstrated better performances related to finding relevant content than removing irrelevant
one. Also the additional weighting of user query has shown better performances than headlines
related one. For future work the authors plan to implement redundancy checking and remove
redundant information. This should be done on two levels: first, removing repeated or non-
essential content within sentences by adding new metric related to redundancy penalty, and
second, trying to maximize content information diversity by introducing additional threshold
metrics. Co-reference resolution and sentence compression are also planned to be implemented by
means of syntactic trimming.
4.3 NEURAL NETWORK BASED APPROACH
NetSum system developed at Microsoft Research [Svore et al. 2007] is utilizing machine-learning
method based on neural network algorithm RankNet. The system is customized to be used for
summary extraction of news articles including three highlighted sentences. The goal is pure
extraction without any sentence compression or sentence generation. Thus, system is designed to
extract three sentences from single document that best match three document highlights.
For such task, in order to rank extracted sentences used is RankNet, a ranking algorithm based on
neural networks. The system is trained on pairs of sentences in single document, such that first
sentence in the pair should be ranked higher than second one. Training is based on modified back-
propagation algorithm for two layer networks. The system relies on NetSum, which is a two layer
neural network.
This model utilizes the following features:
Symbol Feature Name
F(Si) Is First Sentence
Pos(Si) Sentence Position
SB(Si) 5
SumBasic Score
SBB(Si) SumBasic Bigram Score
Sim(Si)6 Title Similarity Score
TABLE 1: FEATURES USED IN THE SYSTEM
5 Describes sentence importance based on a word frequency 6 Based on relative probability that term in particular sentence is present in document title
- 15 -
Symbol Feature Name7
NT(Si) Average News Query Term Score
NT+(Si) News Query Term Sum Score
NTr(Si) Relative News Query Term Score
WE(Si) Average Wikipedia Entity Score
WE+(Si) Wikipedia Entity Sum Score
TABLE 2: FEATURES USED IN THE SYSTEM, BASED ON EXTERNAL DATA SOURCES
The news query logs are gathered from Microsoft’s news search engine8, while from the Wikipedia9
used are article titles. Hence, if the parts of news search query or Wikipedia title appear frequently
in the news article to be summarized, then higher importance score is attached to the sentences
containing these terms.
The results of this summarization approach are encouraging. Based on evaluation using ROUGE and
comparing to the baseline system, this system performs better than all previous systems for news
article summarization from DUC workshops. In feature ablation studies the authors have confirmed
that inclusion of news search queries and Wikipedia titles improve performance. Namely, the
performances of NetSum with external features are statistically significant at 95% confidence.
However, NetSum also performs better than baseline even without external features.
For the future work authors recommend more detailed investigation of external features for
possible inclusion. Also it is expected that separated feature selection for each of three highlight
sentences could further improve performances, as the highlights have different characteristics
compared to each other. It is also expected to develop further sentence simplification, splicing and
merging features.
7 Features dependant to external data sources, namely query-logs and Wikipedia 8 http://search.live.com/news 9 http://www.wikipedia.org
- 16 -
5 MULTI-DOCUMENT SUMMARIZATION
Nowadays it is enormously important to develop procedures for finding text in efficient way. There
are systems such as single document summarization system that support the automatic generation
of extracts, abstract, or questions based on summaries. Single-document summaries provide limited
information about the contents of a single document and the user in deciding whether he should
read the document or not.
The situation in which a user makes an inquiry about a topic which has been treated recently in the
news should be considered. This then would provide back hundreds of documents. Although they
differ in some areas, many of these documents from the content provide the same information. A
summary of each document would help in this case; however, they would be semantically similar.
In today’s community, in which time plays an important role, multi-document summarizers play
essential role in such situations.
5.1 HISTORY OF MULTI-DOCUMENT SUMMARIZATION
The Extraction of a summary text from multiple documents became popular in mid 1990s, mostly
used in domain of news articles. Several Web-based news clustering systems10 were inspired by
research on multi-document summarization. As already said, the difference between single
document and multi document summarization is that multi document summarization involves
multiple sources of information. The key task of multiple document summarization is not just
identifying redundancy across documents, but also recognizing novelty and ensuring that the final
summary is both coherent and complete. [Das and Martins 2007]
This field of automatic summarization has been pioneered by the NLP group at Columbia University
[McKeown and Radev, 1995], where a summarization system called SUMMONS was developed by
extending existing technology for template-driven message understanding systems.
5.1.1 SUMMONS
SUMMONS is a knowledge-based multi-document summarization system that produces summaries
through a series of in a domain located articles on terrorism. A set of semantics templates,
previously extracted from a message understanding system are supplied as an input to the system.
The system recognizes specific patterns in these templates, such as altering the perspective, 10 Google News, Columbia News Blaster, News in Essence (previously referenced)
- 17 -
contradictions, refinements, definitions and elaborations. The techniques used in SUMMONS,
require a large amount of knowledge in the field of Knowledge Engineering, even for a relatively
small text domains.
Its architecture is based on a speech generator. The speech generator in SUMMONS mainly consists
of two components, a content planner which selects the information which is added to the text, and
a linguistic component, which selects words to refer to the information containing the selected
concepts. The content planner decides which information derived from the templates should be
included in the collection process. The linguistic component uses a grammar and a dictionary to
determine the syntactic form of the summary. The length of the summary is determined by the
input parameters. Information is taken from several articles, and it will be evaluated in order of
importance.
FIGURE 2: ARCHITECTURE OF SUMMONS [MCKEOWN AND RADEV, 1995]
5.2 ABSTRACTION
Abstraction method of automatic text summarization is, in difference to extractive method, based
on text generation techniques. In this approach the summaries contain the words not present in the
original document. Generally as of language complexity and ambiguity it is hard task for computer
research to solve this task successfully. The following example shows the usage of SUMMONS in
creating of abstract from four articles containing similar information.
- 18 -
These articles are provided to the Message Understanding System. The system generates templates,
which are stored in database. SUMMONS then uses the templates to generate the summary text.
Example: Four articles
FIGURE 3: ARTICLES PROVIDED TO THE SYSTEM
Templates generated from the articles:
JERUSALEM - A Muslim suicide bomber blew apart 18 people on a Jerusalem bus and
wounded 10 in a mirror-image of an attack one week ago. The carnage could rob Israel's
Prime Minister Shimon Peres of the May 29 election victory he needs to pursue Middle East
peacemaking. Peres declared all-out war on Hamas but his tough talk did little to impress
stunned residents of Jerusalem who said the election would turn on the issue of personal
security.
JERUSALEM - A bomb at a busy Tel Aviv shopping mall killed at least 10 people and
wounded 30, Israel radio said quoting police. Army radio said the blast was apparently
caused by a suicide bomber. Police said there were many wounded.
A bomb blast ripped through the commercial heart of Tel Aviv Monday, killing at
least 13 people and wounding more than 100. Israeli police say an Islamic suicide bomber
blew himself up outside a crowded shopping mall. It was the fourth deadly bombing in
Israel in nine days. The Islamic fundamentalist group Hamas claimed responsibility for the
attacks, which have killed at least 54 people. Hamas is intent on stopping the Middle
East peace process. President Clinton joined the voices of international condemnation
after the latest attack. He said the ``forces of terror shall not triumph'' over
peacemaking efforts.
TEL AVIV (Reuter) - A Muslim suicide bomber killed at least 12 people and wounded
105, including children, outside a crowded Tel Aviv shopping mall Monday, police said.
Sunday, a Hamas suicide bomber killed 18 people on a Jerusalem bus. Hamas has now killed
at least 56 people in four attacks in nine days.
The windows of stores lining both sides of Dizengoff Street were shattered, the charred
skeletons of cars lay in the street, the sidewalks were strewn with blood. The last attack
on Dizengoff was in October 1994 when a Hamas suicide bomber killed 22 people on a bus.
3
1
2
4
MESSAGE: ID TST-REU-0002
SECSOURCE: SOURCE Reuters
SECSOURCE: DATE March 4, 1996 07:20
PRIMSOURCE: SOURCE Israel Radio
INCIDENT: DATE March 4, 1996
INCIDENT: LOCATION Tel Aviv
INCIDENT: TYPE Bombing
HUM TGT: NUMBER “killed: at least 10''
“wounded: more than 100”
PERP: ORGANIZATION ID
MESSAGE: ID TST-REU-0001
SECSOURCE: SOURCE Reuters
SECSOURCE: DATE March 3, 1996 11:30
PRIMSOURCE: SOURCE
INCIDENT: DATE March 3, 1996
INCIDENT: LOCATION Jerusalem
INCIDENT: TYPE Bombing
HUM TGT: NUMBER “killed: 18''
“wounded: 10”
PERP: ORGANIZATION ID
1 2
- 19 -
FIGURE 4: TEMPLATES GENERATED FROM THE PROVIDED ARTICLES [RADEV 2004]
Content planner selects the information to include in the summary through combination of the
input templates. Linguistic generator selects the right words to express the information in
grammatical and coherent text. Some of the operations performed by content planner require
resolving conflicts, for example: contradictory information among different sources or time
instants; others complete pieces of information that are included in some articles and not in other.
At the end, the linguistic generator gathers all the combined information, and uses connective
phrases to synthesize a summary:
FIGURE 5: SUMMARY SYNTHETISED BY THE SYSTEM [RADEV 2004]
This method is very promising when the domain is narrow, but a generalization for broader
domains would be problematic. This was improved later by McKeon and Barzilay.
MESSAGE: ID TST-REU-0004
SECSOURCE: SOURCE Reuters
SECSOURCE: DATE March 4, 1996 14:30
PRIMSOURCE: SOURCE
INCIDENT: DATE March 4, 1996
INCIDENT: LOCATION Tel Aviv
INCIDENT: TYPE Bombing
HUM TGT: NUMBER “killed: at least
12''
“wounded: 105”
PERP: ORGANIZATION ID
MESSAGE: ID TST-REU-0003
SECSOURCE: SOURCE Reuters
SECSOURCE: DATE March 4, 1996 14:20
PRIMSOURCE: SOURCE
INCIDENT: DATE March 4, 1996
INCIDENT: LOCATION Tel Aviv
INCIDENT: TYPE Bombing
HUM TGT: NUMBER “killed: at least 13''
“wounded: more than 100”
PERP: ORGANIZATION ID “Hamas”
3 4
Reuters reported that 18 people were killed on Sunday in a bombing in Jerusalem. The next
day, a bomb in Tel Aviv killed at least 10 people and wounded 30 according to Israel radio.
Reuters reported that at least 12 people were killed and 105 wounded in the second incident.
Later the same day, Reuters reported that Hamas has claimed responsibility for the act.
- 20 -
5.3 TOPIC DRIVEN SUMMARIZATION AND MMR
MMR (maximal marginal relevance) is based on the vector-space model of text retrieval, and is
well-suited to query-based and multi-document summarization. In MMR, sentences are chosen
according to a weighted combination of their relevance to a query and their redundancy with the
sentences that have already been extracted.
"Let Q be a query or user profile and R a ranked list of documents retrieved by a search engine.
Consider an incremental procedure that selects documents, one at a time, and adds them to a set S.
So let S be the set of already selected documents in a particular step, and R \ S the set of yet
unselected documents in R. For each candidate document Di ∈ R\S, its marginal relevance MR(Di) is
computed as:" [Das and Martins 2007:14]
In this figure λ is a parameter that controls relative importance given to relevance versus
redundancy. Sim1 and Sim2 are two similarity measures, which are set to the standard similarity
used in vector space model:
yx
yxyxSimyxSim
×==
,),(),( 21
The document with highest marginal relevance is then selected, added to S, and the procedure
continues until a maximum number of documents are selected. It has been found that dynamically
changing of the parameter ( λ ) gives more effective results, than keeping it fixed.
To perform summarization, documents can be first segmented into sentences, and after a query is
submitted, the MMR algorithm can be applied. Top ranking passages (sentences) are selected, then
they are reordered according to the positions in the original documents, and presented as the
summary.
5.4 CENTROID BASED SUMMARIZATION
MEAD is a centroid-based Summarizer, which uses many algorithms (position-based, TF-IDF11,
largest common subsequence and keywords) to generate the summaries. As a result, it returns
centroid based summaries. A centroid is a set of words, which is statistically significant for a cluster.
11 Term frequency/inverse document frequency
),(max)1(),(:)(21 ji
SDii DDSimQDSimDMR
j ∈
−−= λλ
- 21 -
Centroids are used for both classification of relevant documents and the identification of the
Phrases in a cluster.
The MEAD Summarizer consists mainly of three components:
- Feature extractor,
- Sentence scorer and
- Sentence reranker .
Feature extractor calculates the values for each set of features (sentences), which are defined by the
user. Then, the sentence scorer assigns a value (linear combination of the features) to every feature.
After that, the sentences are sorted according to their weighting. The task of the sentence rerankers
is to insert the sentences in the resulting summaries, beginning with those which are high ranked.
In additions, sentence reranker reviews the sentences in the summary on similarity. If the degree of
consistency is over a certain threshold, the reranker is going to ignore the sentence and move to
another one.
Similar documents are using an algorithm that is described in detail by Radev, grouped into a
cluster. Each document is represented as a weighted vector of TF-IDF values. After that, CIDR
(system for automatic placement of text documents in a cluster) generates a centroid by using the
first document in the cluster.
As soon as new documents are processed, their TF-IDF values are compared with the centorid using
the formula:
In this formula, the following has been used:
- cj is the centroid of the j-th clusteer
- Cj is the set of documents that belong to the cluster
- d~ is a "truncated version" of d that vanishes on those words whose TF-IDF scores are below a
threshold.
If the value of cj is within a threshold, the new documents are added to the cluster.
j
Cd
jC
d
cj
∑∈
=
~
- 22 -
6 LESSONS LEARNED
The area of the text summarization has gained active focus of research community since 1990s.
Although several methods for such approaches exist, among them in recent time mostly were
applied ones based on different machine learning and statistics techniques, including binary
classifiers, Bayesian methods, and heuristic methods with feature weighted vectors. Graph based
methods were employed, as ones based on neural-network training. [Svore et al. 2007]
According to the methods presented in this work and research results done so far, many
researchers agree that in order to gain better performances multi-feature approaches are to be
used, especially ones based on external sources of information. Namely, recent works cited use
Wikipedia titles, or news queries from search engines to improve summarization performance. It is
to be further investigated which other sources of information and features should be used. It is also
important to note that summarization systems depend on other modules used in the preprocessing
and processing stages.
In the case of external NLP tools used the summarization system depends strongly on quality and
performance of underlying POS taggers, chunkers, lemmatizers, stemmers or sentence detectors.
For many languages such solutions are not completely available or they are in unmature
development stage.
In the case of external ontology based sources, the quality and depth of ontology plays important
role too. It is shown [Farzindar and Lapalme 2004; Verma et al. 2007] that specific domain systems
give better overall results for the document summarization. Also the works based on user query
input [Svore et al. 2007] give promising performances in the way that summarizations could be
adjusted and built according to user requirements and expectations.
- 23 -
7 CONCLUSION
In this work the short introduction on automatic text summarization has been given. The
interesting uses of this technology were described, as the issues confronting successful application
of such methods. Further, the short historical background related to the research in the area has
been elaborated, while the most attention in this work has been paid to the general approaches in
summarization currently used and proposed. Such approaches include single and multiple
document summarization methods.
Among approaches analyzed, for single document summarization analyzed were the three methods
relying on machine learning techniques. Namely, the ontology knowledge approach was presented,
the approach based on feature appraisal and NLP application in summarization. Finally, the recent
development based on neural-network application in extractive text summarization is presented,
together with findings and recommendations related to it.
For the multi-document summarization, the simple example of abstractive summary generation has
been given. Further, the topic driven and centroid based summarization were analyzed.
- 24 -
REFERENCES
[Bawakid and Oussalah 2008] Bawakid, A., Oussalah, M.: "A Semantic Summarization System: University of Birmingham at TAC 2008"; Proceedings of the First Text Analysis Conference (2008)
[Das and Martins 2007] Das, D., Martins, A. F.T.: "A Survey on Automatic Text Summarization"; Literature Survey for the Language and Statistics II course at CMU (2007)
[Edmundson 1969] Edmundson, H.P.: "New Methods in Automatic Extracting"; Journal of the ACM (JACM). Volume 16 , Issue 2 (April 1969). 264- 285
[Farzindar and Lapalme 2004] Farzindar, A. and Lapalme, G.: "LetSum, an automatic Legal Text Summarizing system"; Legal Knowledge and Information Systems. Jurix 2004: The Seventeenth Annual Conference. Amsterdam. IOS Press (2004), 11-18
[Heting 2007] Heting, C.: "Information Representation and Retrieval in the Digital Age"; Asist Monograph Series (2007), 7-8
[Jezek and Steinberger 2008] Jezek, K. and Steinberger, J.: "Automatic Text Summarization (The state of the art 2007 and new challenges)"; Znalosti 2008. FIIT STU Bratislava, Slovakia (2008), 1-12
[Lin et al. 2009] Lin, J., Ozsu, M. Tamer, Liu, L.: "Summarization"; Encyclopedia of Database Systems, Springer (2009)
[McKeown and Radev, 1995] McKeown, K.R. and Radev, D.R.: "Generating summaries of multiple news articles."; In Proceedings of SIGIR '95 (1995), 74-82
[Nenkova 2005] Nenkova, A.: "Automatic Text Summarization of Newswire: Lessons Learned from the Document Understanding Conference"; American Association for Artificial Intelligence (2005)
[Neto et al. 2002] Neto, J.L., Freitas, A.A., Kaestner C.A.A.: "Automatic Text Summarization Using a Machine Learning Approach"; SBIA 2002. Springer Verlag Berlin Heidelberg (2002), 205-215
[Reimer and Hahn 1997] Reimer, U. and Hahn, U.: "A Formal Model of Text Summarization Based on Condensation Operators of a Terminological Logic"; Advances in Automated Text Summarization (2007), 97-104
[Song et al. 2007] Song, X., Chi, Y., Hino, K., Tseng, B. L.: "Summarization System by Identifying Influential Blogs"; ICWSM 2007 (2007)
[Svore et al. 2007] Svore, K.M., Vanderwende, L., Burges, C.J.C.: "Enhancing Single-document Summarization by Combining RankNet and Third-party Sources"; Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (2007), Czech, 448–457,
- 25 -
[Verma et al. 2007] Verma, R., Chen, P., Lu, W.: "A Semantic Free-text Summarization System Using Ontology Knowledge"; Proceedings of the Document Understanding Conference (2007)
[Zhiqi et al. 2005] Zhiqi, W., Yongcheng, W., Chuanhan, L., Derong, L.: "An Automatic Summarization Service System Based on Web Services"; Proceedings of the 2005 The Fifth International Conference on Computer and Information Technology (2005)
- 26 -
ABBREVIATIONS
DUC Document Understanding Conference
DUC Document Understanding Conference
MEAD Platform for multi-document – multi-lingual text summarization
NE Named Entity
NLP Natural Language Processing
PDA Personal Device Assistant
ROUGE Recall Oriented Understudy for Gisting Evaluation
TAC Text Analysis Conference
UMLS Unified Medical Language System
- 27 -
FIGURES
Figure 1: The summarizer Architecture [ ] ................................................................................................................ 12
Figure 2: Architecture of SUMMONS [McKeown and Radev, 1995] ................................................................... 17
Figure 3: Articles provided to the system ................................................................................................................... 18
Figure 4: Templates generated from the provided articles [Radev, ACM SIGIR Tutorial] ...................... 19
Figure 5: Summary synthetised by the system [Radev, ACM SIGIR Tutorial] .............................................. 19
- 28 -
TABLES
Table 1: Features used in the system ............................................................................................................................ 14
Table 2: Features used in the system, based on external data sources........................................................... 15