-
Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics, pages 5070–5081Florence, Italy, July 28
- August 2, 2019. c©2019 Association for Computational
Linguistics
5070
Hierarchical Transformers for Multi-Document Summarization
Yang Liu and Mirella LapataInstitute for Language, Cognition and
Computation
School of Informatics, University of
[email protected], [email protected]
Abstract
In this paper, we develop a neural summa-rization model which
can effectively processmultiple input documents and distill
abstrac-tive summaries. Our model augments a previ-ously proposed
Transformer architecture (Liuet al., 2018) with the ability to
encode docu-ments in a hierarchical manner. We
representcross-document relationships via an attentionmechanism
which allows to share informationas opposed to simply concatenating
text spansand processing them as a flat sequence. Ourmodel learns
latent dependencies among tex-tual units, but can also take
advantage of ex-plicit graph representations focusing on
simi-larity or discourse relations. Empirical resultson the WikiSum
dataset demonstrate that theproposed architecture brings
substantial im-provements over several strong baselines.1
1 Introduction
Automatic summarization has enjoyed renewedinterest in recent
years, thanks to the popular-ity of neural network models and their
ability tolearn continuous representations without recourseto
preprocessing tools or linguistic annotations.The availability of
large-scale datasets (Sandhaus,2008; Hermann et al., 2015; Grusky
et al., 2018)containing hundreds of thousands of document-summary
pairs has driven the development ofneural architectures for
summarizing single doc-uments. Several approaches have shown
promis-ing results with sequence-to-sequence models thatencode a
source document and then decode it intoan abstractive summary (See
et al., 2017; Celiky-ilmaz et al., 2018; Paulus et al., 2018;
Gehrmannet al., 2018).
Multi-document summarization — the task ofproducing summaries
from clusters of themati-
1Our code and data is available at
https://github.com/nlpyang/hiersumm.
cally related documents — has received signif-icantly less
attention, partly due to the paucityof suitable data for the
application of learningmethods. High-quality multi-document
summa-rization datasets (i.e., document clusters pairedwith
multiple reference summaries written by hu-mans) have been produced
for the Document Un-derstanding and Text Analysis Conferences
(DUCand TAC), but are relatively small (in the rangeof a few
hundred examples) for training neu-ral models. In an attempt to
drive research fur-ther, Liu et al. (2018) tap into the potential
ofWikipedia and propose a methodology for cre-ating a large-scale
dataset (WikiSum) for multi-document summarization with hundreds of
thou-sands of instances. Wikipedia articles, specificallylead
sections, are viewed as summaries of varioustopics indicated by
their title, e.g.,“Florence” or“Natural Language Processing”.
Documents citedin the Wikipedia articles or web pages returnedby
Google (using the section titles as queries) areseen as the source
cluster which the lead sectionpurports to summarize.
Aside from the difficulties in obtaining train-ing data, a major
obstacle to the application ofend-to-end models to multi-document
summariza-tion is the sheer size and number of source doc-uments
which can be very large. As a result, itis practically infeasible
(given memory limitationsof current hardware) to train a model
which en-codes all of them into vectors and subsequentlygenerates a
summary from them. Liu et al. (2018)propose a two-stage
architecture, where an extrac-tive model first selects a subset of
salient passages,and subsequently an abstractive model generatesthe
summary while conditioning on the extractedsubset. The selected
passages are concatenatedinto a flat sequence and the Transformer
(Vaswaniet al., 2017), an architecture well-suited to lan-guage
modeling over long sequences, is used to
https://github.com/nlpyang/hiersummhttps://github.com/nlpyang/hiersumm
-
5071
decode the summary.
Although the model of Liu et al. (2018) takesan important first
step towards abstractive multi-document summarization, it still
considers themultiple input documents as a concatenated
flatsequence, being agnostic of the hierarchical struc-tures and
the relations that might exist among doc-uments. For example,
different web pages mightrepeat the same content, include
additional con-tent, present contradictory information, or
discussthe same fact in a different light (Radev, 2000).The
realization that cross-document links are im-portant in isolating
salient information, elimi-nating redundancy, and creating overall
coherentsummaries, has led to the widespread adoptionof graph-based
models for multi-document sum-marization (Erkan and Radev, 2004;
Christensenet al., 2013; Wan, 2008; Parveen and Strube,2014).
Graphs conveniently capture the relation-ships between textual
units within a document col-lection and can be easily constructed
under the as-sumption that text spans represent graph nodes
andedges are semantic links between them.
In this paper, we develop a neural summariza-tion model which
can effectively process multi-ple input documents and distill
abstractive sum-maries. Our model augments the previously pro-posed
Transformer architecture with the ability toencode multiple
documents in a hierarchical man-ner. We represent cross-document
relationshipsvia an attention mechanism which allows to
shareinformation across multiple documents as opposedto simply
concatenating text spans and feedingthem as a flat sequence to the
model. In thisway, the model automatically learns richer
struc-tural dependencies among textual units, thus in-corporating
well-established insights from earlierwork. Advantageously, the
proposed architecturecan easily benefit from information external
to themodel, i.e., by replacing inter-document attentionwith a
graph-matrix computed based on the basisof lexical similarity
(Erkan and Radev, 2004) ordiscourse relations (Christensen et al.,
2013).
We evaluate our model on the WikiSum datasetand show
experimentally that the proposed archi-tecture brings substantial
improvements over sev-eral strong baselines. We also find that the
ad-dition of a simple ranking module which scoresdocuments based on
their usefulness for the targetsummary can greatly boost the
performance of amulti-document summarization system.
2 Related Work
Most previous multi-document summarizationmethods are extractive
operating over graph-basedrepresentations of sentences or passages.
Ap-proaches vary depending on how edge weightsare computed e.g.,
based on cosine similarity withtf-idf weights for words (Erkan and
Radev, 2004)or on discourse relations (Christensen et al.,
2013),and the specific algorithm adopted for ranking textunits for
inclusion in the final summary. Sev-eral variants of the PageRank
algorithm have beenadopted in the literature (Erkan and Radev,
2004)in order to compute the importance or salience ofa passage
recursively based on the entire graph.More recently, Yasunaga et
al. (2017) propose aneural version of this framework, where
salienceis estimated using features extracted from sen-tence
embeddings and graph convolutional net-works (Kipf and Welling,
2017) applied over therelation graph representing cross-document
links.
Abstractive approaches have met with limitedsuccess. A few
systems generate summariesbased on sentence fusion, a technique
which iden-tifies fragments conveying common informationacross
documents and combines these into sen-tences (Barzilay and McKeown,
2005; Filippovaand Strube, 2008; Bing et al., 2015). Althoughneural
abstractive models have achieved promis-ing results on
single-document summarization(See et al., 2017; Paulus et al.,
2018; Gehrmannet al., 2018; Celikyilmaz et al., 2018), the
ex-tension of sequence-to-sequence architectures tomulti-document
summarization is less straightfor-ward. Apart from the lack of
sufficient trainingdata, neural models also face the
computationalchallenge of processing multiple source docu-ments.
Previous solutions include model trans-fer (Zhang et al., 2018;
Lebanoff and Liu, 2018),where a sequence-to-sequence model is
pretrainedon single-document summarization data and fine-tuned on
DUC (multi-document) benchmarks, orunsupervised models relying on
reconstruction ob-jectives (Ma et al., 2016; Chu and Liu,
2018).
Liu et al. (2018) propose a methodology forconstructing
large-scale summarization datasetsand a two-stage model which first
extracts salientinformation from source documents and then usesa
decoder-only architecture (that can attend to verylong sequences)
to generate the summary. We fol-low their setup in viewing
multi-document sum-marization as a supervised machine learning
prob-
-
5072
............
ranked paragraphs
source paragraphs
paragraph ranker
encoderpara 1
para L
para L
decoder
abstractive summarizer
target summary
Figure 1: Pipeline of our multi-document summariza-tion system.
L source paragraphs are first ranked andthe L′-best ones serve as
input to an encoder-decodermodel which generates the target
summary.
lem and for this purpose assume access to large,labeled datasets
(i.e., source documents-summarypairs). In contrast to their
approach, we use alearning-based ranker and our abstractive
modelcan hierarchically encode the input documents,with the ability
to learn latent relations across doc-uments and additionally
incorporate informationencoded in well-known graph
representations.
3 Model Description
We follow Liu et al. (2018) in treating the gen-eration of lead
Wikipedia sections as a multi-document summarization task. The
input to a hy-pothetical system is the title of a Wikipedia
arti-cle and a collection of source documents, whilethe output is
the Wikipedia article’s first section.Source documents are webpages
cited in the Ref-erences section of the Wikipedia article and
thetop 10 search results returned by Google (withthe title of the
article as the query). Since sourcedocuments could be relatively
long, they are splitinto multiple paragraphs by line-breaks.
Moreformally, given title T , and L input paragraphs{P1, · · · ,
PL} (retrieved from Wikipedia citationsand a search engine), the
task is to generate thelead section D of the Wikipedia article.
Our summarization system is illustrated in Fig-ure 1. Since the
input paragraphs are numerousand possibly lengthy, instead of
directly applyingan abstractive system, we first rank them and
sum-marize the L′-best ones. Our summarizer followsthe very
successful encoder-decoder architecture(Bahdanau et al., 2015),
where the encoder en-codes the input text into hidden
representationsand the decoder generates target summaries basedon
these representations. In this paper, we focusexclusively on the
encoder part of the model, ourdecoder follows the Transformer
architecture in-
troduced in Vaswani et al. (2017); it generates asummary token
by token while attending to thesource input. We also use beam
search and alength penalty (Wu et al., 2016) in the decodingprocess
to generate more fluent and longer sum-maries.
3.1 Paragraph RankingUnlike Liu et al. (2018) who rank
paragraphsbased on their similarity with the title (using
tf-idf-based cosine similarity), we adopt a learning-based
approach. A logistic regression model isapplied to each paragraph
to calculate a score in-dicating whether it should be selected for
summa-rization. We use two recurrent neural networkswith Long-Short
Term Memory units (LSTM;Hochreiter and Schmidhuber 1997) to
represent ti-tle T and source paragraph P :
{ut1, · · · , utm} = lstmt({wt1, · · · , wtm}) (1){up1, · · · ,
upn} = lstmp({wp1, · · · , wpn}) (2)
where wti, wpj are word embeddings for tokens inT and P , and
uti, upj are the updated vectors foreach token after applying the
LSTMs.
A max-pooling operation is then used over titlevectors to obtain
a fixed-length representation ût:
ût = maxpool({ut1, · · · , utm}) (3)
We concatenate ût with the vector upi of each to-ken in the
paragraph and apply a non-linear trans-formation to extract
features for matching the titleand the paragraph. A second
max-pooling opera-tion yields the final paragraph vector p̂:
pi = tanh(W1([upi; ût])) (4)
p̂ = maxpool({p1, · · · , pn}) (5)
Finally, to estimate whether a paragraph should beselected, we
use a linear transformation and a sig-moid function:
s = sigmoid(W2 ˆ(p)) (6)
where s is the score indicating whether para-graph P should be
used for summarization.
All input paragraphs {P1, · · · , PL} receivescores {s1, · · · ,
sL}. The model is trained byminimizing the cross entropy loss
between si andground-truth scores yi denoting the relatedness ofa
paragraph to the gold standard summary. Weadopt ROUGE-2 recall (of
paragraph Pi against
-
5073
gold target text D) as yi. In testing, input para-graphs are
ranked based on the model predictedscores and an ordering {R1, · ·
· , RL} is gener-ated. The first L′ paragraphs {R1, · · · , RL′}
areselected as input to the second abstractive stage.
3.2 Paragraph Encoding
Instead of treating the selected paragraphs asa very long
sequence, we develop a hierarchi-cal model based on the Transformer
architecture(Vaswani et al., 2017) to capture
inter-paragraphrelations. The model is composed of several lo-cal
and global transformer layers which can bestacked freely. Let tij
denote the j-th token in thei-th ranked paragraph Ri; the model
takes vectorsx0ij (for all tokens) as input. For the l-th
trans-former layer, the input will be xl−1ij , and the outputis
written as xlij .
3.2.1 Embeddings
Input tokens are first represented by word embed-dings. Let wij
∈ Rd denote the embedding as-signed to tij . Since the Transformer
is a non-recurrent model, we also assign a special posi-tional
embedding peij to tij , to indicate the po-sition of the token
within the input.
To calculate positional embeddings, we followVaswani et al.
(2017) and use sine and cosine func-tions of different frequencies.
The embedding epfor the p-th element in a sequence is:
ep[i] = sin(p/100002i/d) (7)
ep[2i+ 1] = cos(p/100002i/d) (8)
where ep[i] indicates the i-th dimension of the em-bedding
vector. Because each dimension of thepositional encoding
corresponds to a sinusoid, forany fixed offset o, ep+o can be
represented as alinear function of ep, which enables the model
todistinguish relative positions of input elements.
In multi-document summarization, token tij hastwo positions that
need to be considered, namely i(the rank of the paragraph) and j
(the positionof the token within the paragraph).
Positionalembedding peij ∈ Rd represents both positions(via
concatenation) and is added to word embed-ding wij to obtain the
final input vector x0ij :
peij = [ei; ej ] (9)
x0ij = wij + peij (10)
3.2.2 Local Transformer Layer
A local transformer layer is used to encode con-textual
information for tokens within each para-graph. The local
transformer layer is the sameas the vanilla transformer layer
(Vaswani et al.,2017), and composed of two sub-layers:
h = LayerNorm(xl−1 +MHAtt(xl−1)) (11)
xl = LayerNorm(h+ FFN(h)) (12)
where LayerNorm is layer normalization pro-posed in Ba et al.
(2016); MHAtt is the multi-head attention mechanism introduced in
Vaswaniet al. (2017) which allows each token to attendto other
tokens with different attention distribu-tions; and FFN is a
two-layer feed-forward net-work with ReLU as hidden activation
function.
3.2.3 Global Transformer Layer
A global transformer layer is used to exchange in-formation
across multiple paragraphs. As shownin Figure 2, we first apply a
multi-head pooling op-eration to each paragraph. Different heads
will en-code paragraphs with different attention weights.Then, for
each head, an inter-paragraph attentionmechanism is applied, where
each paragraph cancollect information from other paragraphs by
self-attention, generating a context vector to capturecontextual
information from the whole input. Fi-nally, context vectors are
concatenated, linearlytransformed, added to the vector of each
token,and fed to a feed-forward layer, updating the rep-resentation
of each token with global information.
Multi-head Pooling To obtain fixed-lengthparagraph
representations, we apply a weighted-pooling operation; instead of
using only one rep-resentation for each paragraph, we introduce
amulti-head pooling mechanism, where for eachparagraph, weight
distributions over tokens arecalculated, allowing the model to
flexibly encodeparagraphs in different representation subspacesby
attending to different words.
Let xl−1ij ∈ Rd denote the output vector of thelast transformer
layer for token tij , which is usedas input for the current layer.
For each paragraphRi, for head z ∈ {1, · · · , nhead}, we first
trans-form the input vectors into attention scores azijand value
vectors bzij . Then, for each head, wecalculate a probability
distribution âzij over tokens
-
5074
within the paragraph based on attention scores:
azij =Wzax
l−1ij (13)
bzij =Wzb x
l−1ij (14)
âzij = exp(azij)/
n∑j=1
exp(azij) (15)
where W za ∈ R1∗d and W zb ∈ Rdhead∗d areweights. dhead =
d/nhead is the dimension ofeach head. n is the number of tokens in
Ri.
We next apply a weighted summation with an-other linear
transformation and layer normaliza-tion to obtain vector headzi for
the paragraph:
headzi = LayerNorm(Wzc
n∑j=1
azijbzij) (16)
where W zc ∈ Rdhead∗dhead is the weight.The model can flexibly
incorporate multiple
heads, with each paragraph having multiple at-tention
distributions, thereby focusing on differentviews of the input.
Inter-paragraph Attention We model the de-pendencies across
multiple paragraphs with aninter-paragraph attention mechanism.
Similar toself-attention, inter-paragraph attention allows foreach
paragraph to attend to other paragraphs bycalculating an attention
distribution:
qzi =Wzq head
zi (17)
kzi =Wzkhead
zi (18)
vzi =Wzv head
zi (19)
contextzi =
m∑i=1
exp(qziTkzi′)∑m
o=1 exp(qziTkzo)
vzi′ (20)
where qzi , kzi , v
zi ∈ Rdhead∗dhead are query,
key, and value vectors that are linearly trans-formed from
headzi as in Vaswani et al. (2017);contextzi ∈ Rdhead represents
the context vec-tor generated by a self-attention operation overall
paragraphs. m is the number of input para-graphs. Figure 2 provides
a schematic view ofinter-paragraph attention.
Feed-forward Networks We next update tokenrepresentations with
contextual information. Wefirst fuse information from all heads by
concate-nating all context vectors and applying a
lineartransformation with weight Wc ∈ Rd∗d:
ci =Wc[context1i ; · · · ; context
nheadi ] (21)
Multi-head Pooling Multi-head Pooling
head 1
head 2
head 3
head 1
head 2
head 3
context 1
context 2
context 3
context 1
context 2
context 3
Inter-paragraph Attention
Inter-paragraph Attention
Inter-paragraph Attention
context
this is para one
Feed Forward
Feed Forward
Feed Forward
Feed Forward
context
this is para two
Feed Forward
Feed Forward
Feed Forward
Feed Forward
this is para one this is para two
Figure 2: A global transformer layer. Different col-ors indicate
different heads in multi-head pooling andinter-paragraph
attention.
We then add ci to each input token vector xl−1ij ,and feed it to
a two-layer feed-forward networkwith ReLU as the activation
function and a high-way layer normalization on top:
gij =Wo2ReLU(Wo1(xl−1ij + ci)) (22)
xlij = LayerNorm(gij + xl−1ij ) (23)
where Wo1 ∈ Rdff∗d and Wo2 ∈ Rd∗dff are theweights, dff is the
hidden size of the feed-forwardlater. This way, each token within
paragraph Rican collect information from other paragraphs in
ahierarchical and efficient manner.
3.2.4 Graph-informed AttentionThe inter-paragraph attention
mechanism can beviewed as learning a latent graph
representation(self-attention weights) of the input
paragraphs.Although previous work has shown that simi-lar latent
representations are beneficial for down-stream NLP tasks (Liu and
Lapata, 2018; Kimet al., 2017; Williams et al., 2018; Niculae et
al.,2018; Fernandes et al., 2019), much work inmulti-document
summarization has taken advan-tage of explicit graph
representations, each focus-ing on different facets of the
summarization task
-
5075
(e.g., capturing redundant information or repre-senting passages
referring to the same event orentity). One advantage of the
hierarchical trans-former is that we can easily incorporate graphs
ex-ternal to the model, to generate better summaries.
We experimented with two well-establishedgraph representations
which we discuss briefly be-low. However, there is nothing inherent
in ourmodel that restricts us to these, any graph mod-eling
relationships across paragraphs could havebeen used instead. Our
first graph aims to capturelexical relations; graph nodes
correspond to para-graphs and edge weights are cosine
similaritiesbased on tf-idf representations of the paragraphs.Our
second graph aims to capture discourse re-lations (Christensen et
al., 2013); it builds anApproximate Discourse Graph (ADG)
(Yasunagaet al., 2017) over paragraphs; edges between para-graphs
are drawn by counting (a) co-occurring en-tities and (b) discourse
markers (e.g., however,nevertheless) connecting two adjacent
paragraphs(see the Appendix for details on how ADGs
areconstructed).
We represent such graphs with a matrix G,where Gii′ is the
weight of the edge connectingparagraphs i and i′. We can then
inject this graphinto our hierarchical transformer by simply
substi-tuting one of its (learned) heads z′ with G. Equa-tion (20)
for calculating the context vector for thishead is modified as:
contextz′
i =m∑
i′=1
Gii′∑mo=1Gio
vz′
i′ (24)
4 Experimental Setup
WikiSum Dataset We used the scripts and urlsprovided in Liu et
al. (2018) to crawl Wikipediaarticles and source reference
documents. We suc-cessfully crawled 78.9% of the original
documents(some urls have become invalid and correspond-ing
documents could not be retrieved). We fur-ther removed clone
paragraphs (which are exactcopies of some parts of the Wikipedia
articles);these were paragraphs in the source documentswhose bigram
recall against the target summarywas higher than 0.8. On average,
each inputhas 525 paragraphs, and each paragraph has 70.1tokens.
The average length of the target sum-mary is 139.4 tokens. We split
the dataset with1, 579, 360 instances for training, 38, 144 for
vali-dation and 38, 205 for test.
MethodsROUGE-L Recall
L′ = 5 L′ = 10 L′ = 20 L′ = 40
Similarity 24.86 32.43 40.87 49.49Ranking 39.38 46.74 53.84
60.42
Table 1: ROUGE-L recall against target summary forL′-best
paragraphs obtained with tf-idf cosine similar-ity and our ranking
model.
For both ranking and summarization stages,we encode source
paragraphs and target sum-maries using subword tokenization with
Sentence-Piece (Kudo and Richardson, 2018). Our vocabu-lary
consists of 32, 000 subwords and is shared forboth source and
target.
Paragraph Ranking To train the regressionmodel, we calculated
the ROUGE-2 recall (Lin,2004) of each paragraph against the target
sum-mary and used this as the ground-truth score. Thehidden size of
the two LSTMs was set to 256,and dropout (with dropout probability
of 0.2) wasused before all linear layers. Adagrad (Duchiet al.,
2011) with learning rate 0.15 is used foroptimization. We compare
our ranking modelagainst the method proposed in Liu et al.
(2018)who use the tf-idf cosine similarity between eachparagraph
and the article title to rank the inputparagraphs. We take the
first L′ paragraphs fromthe ordered paragraph set produced by our
rankerand the similarity-based method, respectively. Weconcatenate
these paragraphs and calculate theirROUGE-L recall against the gold
target text. Theresults are shown in Table 1. We can see that
ourranker effectively extracts related paragraphs andproduces more
informative input for the down-stream summarization task.
Training Configuration In all abstractive mod-els, we apply
dropout (with probability of 0.1) be-fore all linear layers; label
smoothing (Szegedyet al., 2016) with smoothing factor 0.1 is also
used.Training is in traditional sequence-to-sequencemanner with
maximum likelihood estimation. Theoptimizer was Adam (Kingma and
Ba, 2014) withlearning rate of 2, β1 = 0.9, and β2 = 0.998;we also
applied learning rate warmup over thefirst 8, 000 steps, and decay
as in (Vaswani et al.,2017). All transformer-based models had 256
hid-den units; the feed-forward hidden size was 1, 024for all
layers. All models were trained on 4 GPUs(NVIDIA TITAN Xp) for 500,
000 steps. We used
-
5076
Model ROUGE-1 ROUGE-2 ROUGE-LLead 38.22 16.85 26.89LexRank 36.12
11.67 22.52FT (600 tokens, no ranking) 35.46 20.26 30.65FT (600
tokens) 40.46 25.26 34.65FT (800 tokens) 40.56 25.35 34.73FT (1,200
tokens) 39.55 24.63 33.99T-DMCA (3000 tokens) 40.77 25.60 34.90HT
(1,600 tokens) 40.82 25.99 35.08HT (1,600 tokens) + Similarity
Graph 40.80 25.95 35.08HT (1,600 tokens) + Discourse Graph 40.81
25.95 35.24HT (train on 1,600 tokens/test on 3000 tokens) 41.53
26.52 35.76
Table 2: Test set results on the WikiSum dataset using ROUGE
F1.
gradient accumulation to keep training time for allmodels
approximately consistent. We selected the5 best checkpoints based
on performance on thevalidation set and report averaged results on
thetest set.
During decoding we use beam search with beamsize 5 and length
penalty with α = 0.4 (Wu et al.,2016); we decode until an
end-of-sequence tokenis reached.
Comparison Systems We compared the pro-posed hierarchical
transformer against severalstrong baselines:
Lead is a simple baseline that concatenates the ti-tle and
ranked paragraphs, and extracts thefirst k tokens; we set k to the
length of theground-truth target.
LexRank (Erkan and Radev, 2004) is a widely-used graph-based
extractive summarizer; webuild a graph with paragraphs as nodes
andedges weighted by tf-idf cosine similarity; werun a
PageRank-like algorithm on this graphto rank and select paragraphs
until the lengthof the ground-truth summary is reached.
Flat Transformer (FT) is a baseline that appliesa
Transformer-based encoder-decoder modelto a flat token sequence. We
used a 6-layertransformer. The title and ranked paragraphswere
concatenated and truncated to 600, 800,and 1, 200 tokens.
T-DMCA is the best performing model of Liuet al. (2018) and a
shorthand for TransformerDecoder with Memory Compressed Atten-tion;
they only used a Transformer decoder
and compressed the key and value in self-attention with a
convolutional layer. Themodel has 5 layers as in Liu et al.
(2018).Its hidden size is 512 and its feed-forwardhidden size is 2,
048. The title and rankedparagraphs were concatenated and
truncatedto 3,000 tokens.
Hierarchical Transformer (HT) is the modelproposed in this
paper. The model archi-tecture is a 7-layer network (with 5
local-attention layers at the bottom and 2 global at-tention layers
at the top). The model takesthe title and L′ = 24 paragraphs as
input toproduce a target summary, which leads to ap-proximately 1,
600 input tokens per instance.
5 Results
Automatic Evaluation We evaluated summa-rization quality using
ROUGE F1 (Lin, 2004). Wereport unigram and bigram overlap
(ROUGE-1and ROUGE-2) as a means of assessing infor-mativeness and
the longest common subsequence(ROUGE-L) as a means of assessing
fluency.
Table 2 summarizes our results. The firstblock in the table
includes extractive systems(Lead, LexRank), the second block
includes sev-eral variants of Flat Transformer-based models(FT,
T-DMCA), while the rest of the table presentsthe results of our
Hierarchical Transformer (HT).As can be seen, abstractive models
generally out-perform extractive ones. The Flat
Transformer,achieves best results when the input length is setto
800 tokens, while longer input (i.e., 1, 200 to-kens) actually
hurts performance. The Hierarchi-cal Transformer with 1, 600 input
tokens, outper-
-
5077
Model R1 R2 RLHT 40.82 25.99 35.08HT w/o PP 40.21 24.54 34.71HT
w/o MP 39.90 24.34 34.61HT w/o GT 39.01 22.97 33.76
Table 3: Hierarchical Transformer and versions thereofwithout
(w/o) paragraph position (PP), multi-headpooling (MP), and global
transformer layer (GT).
forms FT, and even T-DMCA when the latter ispresented with 3,
000 tokens. Adding an externalgraph also seems to help the
summarization pro-cess. The similarity graph does not have an
ob-vious influence on the results, while the discoursegraph boosts
ROUGE-L by 0.16.
We also found that the performance of the Hi-erarchical
Transformer further improves when themodel is presented with longer
input at test time.2
As shown in the last row of Table 2, when test-ing on 3, 000
input tokens, summarization qualityimproves across the board. This
suggests that themodel can potentially generate better
summarieswithout increasing training time.
Table 3 summarizes ablation studies aiming toassess the
contribution of individual components.Our experiments confirmed
that encoding para-graph position in addition to token position
withineach paragraph is beneficial (see row w/o PP), aswell as
multi-head pooling (w/o MP is a modelwhere the number of heads is
set to 1), and theglobal transformer layer (w/o GT is a model
withonly 5 local transformer layers in the encoder).
Human Evaluation In addition to automaticevaluation, we also
assessed system performanceby eliciting human judgments on 20
randomly se-lected test instances. Our first evaluation
studyquantified the degree to which summarizationmodels retain key
information from the documentsfollowing a question-answering (QA)
paradigm(Clarke and Lapata, 2010; Narayan et al., 2018).We created
a set of questions based on the goldsummary under the assumption
that it contains themost important information from the input
para-graphs. We then examined whether participantswere able to
answer these questions by readingsystem summaries alone without
access to the goldsummary. The more questions a system can an-swer,
the better it is at summarization. We cre-ated 57 questions in
total varying from two to
2This was not the case with the other Transformer models.
Model QA RatingLead 31.59 -0.383FT 35.69 0.000T-DMCA 43.14
0.147HT 54.11 0.237
Table 4: System scores based on questions answeredby AMT
participants and summary quality rating.
four questions per gold summary. Examples ofquestions and their
answers are given in Table 5.We adopted the same scoring mechanism
usedin Clarke and Lapata (2010), i.e., correct answersare marked
with 1, partially correct ones with 0.5,and 0 otherwise. A system’s
score is the averageof all question scores.
Our second evaluation study assessed the over-all quality of the
summaries by asking partici-pants to rank them taking into account
the fol-lowing criteria: Informativeness (does the sum-mary convey
important facts about the topic inquestion?), Fluency (is the
summary fluent andgrammatical?), and Succinctness (does the
sum-mary avoid repetition?). We used Best-Worst Scal-ing (Louviere
et al., 2015), a less labor-intensivealternative to paired
comparisons that has beenshown to produce more reliable results
than ratingscales (Kiritchenko and Mohammad, 2017). Par-ticipants
were presented with the gold summaryand summaries generated from 3
out of 4 systemsand were asked to decide which summary was thebest
and which one was the worst in relation tothe gold standard, taking
into account the criteriamentioned above. The rating of each system
wascomputed as the percentage of times it was chosenas best minus
the times it was selected as worst.Ratings range from −1 (worst) to
1 (best).
Both evaluations were conducted on the Ama-zon Mechanical Turk
platform with 5 responsesper hit. Participants evaluated summaries
pro-duced by the Lead baseline, the Flat Transformer,T-DMCA, and
our Hierarchical Transformer. Allevaluated systems were variants
that achieved thebest performance in automatic evaluations. Asshown
in Table 4, on both evaluations, participantsoverwhelmingly prefer
our model (HT). All pair-wise comparisons among systems are
statisticallysignificant (using a one-way ANOVA with post-hoc Tukey
HSD tests; p < 0.01). Examples ofsystem output are provided in
Table 5.
-
5078
Pentagoet Archeological District
GO
LD
The Pentagoet Archeological District is a National Historic
Landmark District located at the southern edge ofthe Bagaduce
Peninsula in Castine, Maine. It is the site of Fort Pentagoet, a
17th-century fortified trading postestablished by fur traders of
French Acadia. From 1635 to 1654 this site was a center of trade
with the localAbenaki, and marked the effective western border of
Acadia with New England. From 1654 to 1670 the sitewas under
English control, after which it was returned to France by the
Treaty of Breda. The fort was destroyedin 1674 by Dutch raiders.
The site was designated a National Historic Landmark in 1993. It is
now a publicpark.
QA
What is the Pentagoet Archeological District? [a National
Historic Landmark District]Where is it located? [Castine ,
Maine]What did the Abenaki Indians use the site for? [trading
center]
LE
AD
The Pentagoet Archeological District is a National Historic
Landmark District located in Castine, Maine. Thisdistrict forms
part of the traditional homeland of the Abenaki Indians, in
particular the Penobscot tribe. Inthe colonial period, Abenakis
frequented the fortified trading post at this site, bartering
moosehides, sealskins,beaver and other furs in exchange for
European commodities. ”Pentagoet Archeological district” is a
NationalHistoric Landmark District located at the southern edge of
the Bagaduce Peninsula in Treaty Of Breda.
FT
the Pentagoet Archeological district is a National Historic
Landmark District located atthe southern edge of the Bagaduce
Peninsula in Treaty Of Breda. It was listed on thenational register
of historic places in 1983.
T-D
MC
A The Pentagoet Archeological District is a national historic
landmark district located in castine , maine . thisdistrict forms
part of the traditional homeland of the abenaki indians , in
particular the Penobscot tribe. Thedistrict was listed on the
national register of historic places in 1982.
HT
The Pentagoet Archeological district is a National Historic
Landmark District located in Castine, Maine. Thisdistrict forms
part of the traditional homeland of the Abenaki Indians, in
particular the Penobscot tribe. Inthe colonial period, Abenaki
frequented the fortified trading post at this site, bartering
moosehides, sealskins,beaver and other furs in exchange for
European commodities.
Melanesian Whistler
GO
LD The Melanesian whistler or Vanuatu whistler (Pachycephala
chlorura) is a species of passerine bird in the
whistler family Pachycephalidae. It is found on the Loyalty
Islands, Vanuatu, and Vanikoro in the far south-eastern
Solomons.
QA What is the Melanesian Whistler? [a species of passerine bird
in the whistler family Pachycephalidae]
Where is it found? [Loyalty Islands , Vanuatu , and Vanikoro in
the far south-eastern Solomons]
LE
AD The Australian golden whistler (Pachycephala pectoralis) is a
species of bird found in forest, woodland, mallee,
mangrove and scrub in Australia (except the interior and most of
the north) Most populations are resident, butsome in south-eastern
Australia migrate north during the winter.
FT The Melanesian whistler (P. Caledonica) is a species of bird
in the family Muscicapidae. It is endemic to
Melanesia.
T-D
MC
A
The Australian golden whistler (Pachycephala chlorura) is a
species of bird in the family Pachycephalidae,which is endemic to
Fiji.
HT The Melanesian whistler (Pachycephala chlorura) is a species
of bird in the family Pachycephalidae, which is
endemic to Fiji.
Table 5: GOLD human authored summaries, questions based on them
(answers shown in square brackets) andautomatic summaries produced
by the LEAD-3 baseline, the Flat Transformer (FT), T-DMCA (Liu et
al., 2018)and our Hierachical Transformer (HT).
6 Conclusions
In this paper we conceptualized abstractive multi-document
summarization as a machine learningproblem. We proposed a new model
which isable to encode multiple input documents hierar-chically,
learn latent relations across them, and ad-ditionally incorporate
structural information fromwell-known graph representations. We
have alsodemonstrated the importance of a learning-basedapproach
for selecting which documents to sum-marize. Experimental results
show that our modelproduces summaries which are both fluent and
in-
formative outperforming competitive systems by awide margin. In
the future we would like to applyour hierarchical transformer to
question answeringand related textual inference tasks.
Acknowledgments
We would like to thank Laura Perez-Beltrachinifor her help with
preprocessing the dataset. Thisresearch is supported by a Google
PhD Fellow-ship to the first author. The authors gratefully
ac-knowledge the financial support of the EuropeanResearch Council
(award number 681760).
-
5079
ReferencesJimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E
Hin-
ton. 2016. Layer normalization. arXiv
preprintarXiv:1607.06450.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015.
Neural machine translation by jointlylearning to align and
translate. In In Proceedings ofthe 3rd International Conference on
Learning Rep-resentations, San Diego, California.
Regina Barzilay and Kathleen R. McKeown. 2005.Sentence fusion
for multidocument news summa-rization. Computational Linguistics,
31(3):297–327.
Lidong Bing, Piji Li, Yi Liao, Wai Lam, Weiwei Guo,and Rebecca
Passonneau. 2015. Abstractive multi-document summarization via
phrase selection andmerging. In Proceedings of the 53rd Annual
Meet-ing of the Association for Computational Linguisticsand the
7th International Joint Conference on Natu-ral Language Processing
(Volume 1: Long Papers),pages 1587–1597, Beijing, China.
Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, andYejin Choi.
2018. Deep communicating agents forabstractive summarization. In
Proceedings of the2018 Conference of the North American Chapter
ofthe Association for Computational Linguistics: Hu-man Language
Technologies, Volume 1 (Long Pa-pers), pages 1662–1675, New
Orleans, Louisiana.
Janara Christensen, Mausam, Stephen Soderland, andOren Etzioni.
2013. Towards coherent multi-document summarization. In Proceedings
of the2013 Conference of the North American Chapter ofthe
Association for Computational Linguistics: Hu-man Language
Technologies, pages 1163–1173, At-lanta, Georgia. Association for
Computational Lin-guistics.
Eric Chu and Peter J Liu. 2018. Unsupervised
neuralmulti-document abstractive summarization. arXivpreprint
arXiv:1810.05739.
James Clarke and Mirella Lapata. 2010. Discourseconstraints for
document compression. Computa-tional Linguistics,
36(3):411–441.
John Duchi, Elad Hazan, and Yoram Singer. 2011.Adaptive
subgradient methods for online learningand stochastic optimization.
Journal of MachineLearning Research, 12(Jul):2121–2159.
Günes Erkan and Dragomir R Radev. 2004. Lexrank:Graph-based
lexical centrality as salience in textsummarization. Journal of
artificial intelligence re-search, 22:457–479.
Patrick Fernandes, Miltiadis Allamanis, and MarcBrockschmidt.
2019. Structured neural summariza-tion. In Proceedings of the 7th
International Con-ference on Learning Representations, New
Orleans,Louisiana.
Katja Filippova and Michael Strube. 2008. Sentencefusion via
dependency graph compression. In Pro-ceedings of the 2008
Conference on Empirical Meth-ods in Natural Language Processing,
pages 177–185, Honolulu, Hawaii.
Sebastian Gehrmann, Yuntian Deng, and AlexanderRush. 2018.
Bottom-up abstractive summarization.In Proceedings of the 2018
Conference on Empiri-cal Methods in Natural Language Processing,
pages4098–4109, Brussels, Belgium.
Max Grusky, Mor Naaman, and Yoav Artzi. 2018.Newsroom: A dataset
of 1.3 million summaries withdiverse extractive strategies. In
Proceedings of the2018 Conference of the North American Chapter
ofthe Association for Computational Linguistics: Hu-man Language
Technologies, Volume 1 (Long Pa-pers), pages 708–719, New Orleans,
Louisiana.
Karl Moritz Hermann, Tomas Kocisky, EdwardGrefenstette, Lasse
Espeholt, Will Kay, Mustafa Su-leyman, and Phil Blunsom. 2015.
Teaching ma-chines to read and comprehend. In Advances in Neu-ral
Information Processing Systems 28, pages 1693–1701. Curran
Associates, Inc.
Sepp Hochreiter and Jürgen Schmidhuber. 1997.Long short-term
memory. Neural computation,9(8):1735–1780.
Yoon Kim, Carl Denton, Luong Hoang, and Alexan-der M Rush. 2017.
Structured attention networks.In Proceedings of the 5th
International Conferenceon Learning Representations, Toulon,
France.
Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for
stochastic optimization. arXiv preprintarXiv:1412.6980.
Thomas N Kipf and Max Welling. 2017. Semi-supervised
classification with graph convolutionalnetworks. In Proceedings of
the 4th InternationalConference on Learning Representations, San
Juan,Puerto Rico.
Svetlana Kiritchenko and Saif Mohammad. 2017.Best-worst scaling
more reliable than rating scales:A case study on sentiment
intensity annotation. InProceedings of the 55th Annual Meeting of
the As-sociation for Computational Linguistics, pages 465–470,
Vancouver, Canada.
Taku Kudo and John Richardson. 2018. Sentencepiece:A simple and
language independent subword tok-enizer and detokenizer for neural
text processing.arXiv preprint arXiv:1808.06226.
Logan Lebanoff and Fei Liu. 2018. Automatic detec-tion of vague
words and sentences in privacy poli-cies. In Proceedings of the
2018 Conference on Em-pirical Methods in Natural Language
Processing,pages 3508–3517, Brussels, Belgium.
-
5080
Chin-Yew Lin. 2004. Rouge: A package for automaticevaluation of
summaries. In Text SummarizationBranches Out: Proceedings of the
ACL-04 Work-shop, pages 74–81, Barcelona, Spain. Associationfor
Computational Linguistics.
Peter J Liu, Mohammad Saleh, Etienne Pot, BenGoodrich, Ryan
Sepassi, Lukasz Kaiser, and NoamShazeer. 2018. Generating Wikipedia
by summariz-ing long sequences. In Proceedings of the 6th
Inter-national Conference on Learning Representations,Vancouver,
Canada.
Yang Liu and Mirella Lapata. 2018. Learning struc-tured text
representations. Transactions of the Asso-ciation for Computational
Linguistics, 6:63–75.
Jordan J Louviere, Terry N Flynn, and Anthony Al-fred John
Marley. 2015. Best-worst scaling: The-ory, methods and
applications. Cambridge Univer-sity Press.
Shulei Ma, Zhi-Hong Deng, and Yunlun Yang. 2016.An unsupervised
multi-document summarizationframework based on neural document
model. InProceedings of COLING 2016, the 26th Interna-tional
Conference on Computational Linguistics:Technical Papers, pages
1514–1523, Osaka, Japan.
Shashi Narayan, Shay B. Cohen, and Mirella Lapata.2018. Ranking
sentences for extractive summariza-tion with reinforcement
learning. In Proceedings ofthe 2018 Conference of the North
American Chap-ter of the Association for Computational
Linguistics:Human Language Technologies, Volume 1 (Long Pa-pers),
pages 1747–1759, New Orleans, Louisiana.
Vlad Niculae, André F. T. Martins, and Claire Cardie.2018.
Towards dynamic computation graphs viasparse latent structure. In
Proceedings of the 2018Conference on Empirical Methods in Natural
Lan-guage Processing, pages 905–911, Brussels, Bel-gium.
Daraksha Parveen and Michael Strube. 2014. Multi-document
summarization using bipartite graphs. InProceedings of
TextGraphs-9: the workshop onGraph-based Methods for Natural
Language Pro-cessing, pages 15–24, Doha, Qatar.
Romain Paulus, Caiming Xiong, and Richard Socher.2018. A deep
reinforced model for abstractive sum-marization. In Proceedings of
the 6th InternationalConference on Learning Representations,
Vancou-ver, Canada.
Dragomir Radev. 2000. A common theory of infor-mation fusion
from multiple text sources step one:Cross-document structure. In
1st SIGdial Workshopon Discourse and Dialogue, pages 74–83,
HongKong, China.
Evan Sandhaus. 2008. The New York Times AnnotatedCorpus.
Linguistic Data Consortium, Philadelphia,6(12).
Abigail See, Peter J. Liu, and Christopher D. Manning.2017. Get
to the point: Summarization with pointer-generator networks. In
Proceedings of the 55th An-nual Meeting of the Association for
ComputationalLinguistics (Volume 1: Long Papers), pages 1073–1083,
Vancouver, Canada.
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,Jon Shlens,
and Zbigniew Wojna. 2016. Rethinkingthe inception architecture for
computer vision. InThe IEEE Conference on Computer Vision and
Pat-tern Recognition (CVPR).
Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion
Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017.
Attention is allyou need. In Advances in Neural Information
Pro-cessing Systems 30, pages 5998–6008. Curran Asso-ciates,
Inc.
Xiaojun Wan. 2008. An exploration of documentimpact on
graph-based multi-document summariza-tion. In Proceedings of the
2008 Conference on Em-pirical Methods in Natural Language
Processing,pages 755–762, Honolulu, Hawaii.
Adina Williams, Andrew Drozdov, and Samuel R.Bowman. 2018. Do
latent tree learning models iden-tify meaningful structure in
sentences? Transac-tions of the Association for Computational
Linguis-tics, 6:253–267.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad
Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao,
KlausMacherey, et al. 2016. Google’s neural machinetranslation
system: Bridging the gap between hu-man and machine translation. In
arXiv preprintarXiv:1609.08144.
Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu,Ayush Pareek,
Krishnan Srinivasan, and DragomirRadev. 2017. Graph-based neural
multi-documentsummarization. In Proceedings of the 21st Confer-ence
on Computational Natural Language Learning(CoNLL 2017), pages
452–462, Vancouver, Canada.
Jianmin Zhang, Jiwei Tan, and Xiaojun Wan. 2018.Adapting neural
single-document summarizationmodel for abstractive multi-document
summariza-tion: A pilot study. In Proceedings of the Interna-tional
Conference on Natural Language Generation.
A Appendix
We describe here how the similarity and discoursegraphs
discussed in Section 3.2.4 were created.These graphs were added to
the hierarchical trans-former model as a means to enhance
summaryquality (see Section 5 for details).
-
5081
A.1 Similarity Graph
The similarity graph S is based on tf-idf cosinesimilarity. The
nodes of the graph are paragraphs.We first represent each paragraph
pi as a bag ofwords. Then, we calculate the tf-idf value vik
foreach token tik in a paragraph:
vik = Nw(tik)log(Nd
Ndw(tik)) (25)
where Nw(t) is the count of word t in the para-graph, Nd is the
total number of paragraphs,and Ndw(t) is the total number of
paragraphs con-taining the word. We thus obtain a tf-idf vectorfor
each paragraph. Then, for all paragraph pairs< pi, pi′ >, we
calculate the cosine similarity oftheir tf-idf vectors and use this
as the weight Sii′for the edge connecting the pair in the graph.
Weremove edges with weights lower than 0.2.
A.2 Discourse Graphs
To build the Approximate Discourse Graph(ADG)D, we follow
Christensen et al. (2013) andYasunaga et al. (2017). The original
ADG makesuse of several complex features. Here, we createa
simplified version with only two features (nodesin this graph are
again paragraphs).
Co-occurring Entities For each paragraph pi,we extract a set of
entities Ei in the paragraphusing the Spacy3 NER recognizer. We
only useentities with type {PERSON, NORP, FAC,ORG, GPE, LOC, EVENT,
WORK OF ART,LAW}. For each paragraph pair < pi, pj >, wecount
eij , the number of entities with exact match.
Discourse Markers We use the following 36 ex-plicit discourse
markers to identify edges betweentwo adjacent paragraphs in a
source webpage:
again, also, another, comparatively, fur-thermore, at the same
time,however, im-mediately, indeed, instead, to be sure,likewise,
meanwhile, moreover, never-theless, nonetheless, notably,
otherwise,regardless, similarly, unlike, in addition,even, in turn,
in exchange, in this case,in any event, finally, later, as well,
espe-cially, as a result, example, in fact, then,the day before
3https://spacy.io/api/entityrecognizer
If two paragraphs < pi, pi′ > are adjacent in onesource
webpage and they are connected with oneof the above 36 discourse
markers, mii′ will be 1,otherwise it will be 0.
The final edge weight Dii′ is the weighted sumof eii′ and
mii′
Dii′ = 0.2 ∗ eii′ +mii′ (26)