-
Journal of Artificial Intelligence Research 34 (2009) 637-674
Submitted 07/08; published 04/09
Sentence Compression as Tree Transduction
Trevor Cohn [email protected]
Mirella Lapata [email protected] of InformaticsUniversity
of Edinburgh10 Crichton Street Edinburgh EH8 10AB, UK
Abstract
This paper presents a tree-to-tree transduction method for
sentence compression. Ourmodel is based on synchronous tree
substitution grammar, a formalism that allows localdistortion of
the tree topology and can thus naturally capture structural
mismatches. Wedescribe an algorithm for decoding in this framework
and show how the model can betrained discriminatively within a
large margin framework. Experimental results on sentencecompression
bring significant improvements over a state-of-the-art model.
1. Introduction
Recent years have witnessed increasing interest in text-to-text
generation methods for manynatural language processing
applications, ranging from text summarisation to question
an-swering and machine translation. At the heart of these methods
lies the ability to performrewriting operations. For instance, text
simplification identifies which phrases or sentencesin a document
will pose reading difficulty for a given user and substitutes them
with sim-pler alternatives (Carroll, Minnen, Pearce, Canning,
Devlin, & Tait, 1999; Chandrasekar &Srinivas, 1996). In
question answering, questions are often paraphrased in order to
achievemore flexible matching with potential answers (Lin &
Pantel, 2001; Hermjakob, Echihabi,& Marcu, 2002). Another
example concerns reformulating written language so as to renderit
more natural sounding for speech synthesis applications (Kaji,
Okamoto, & Kurohashi,2004).
Sentence compression is perhaps one of the most popular
text-to-text rewriting methods.The aim is to produce a summary of a
single sentence that retains the most importantinformation while
remaining grammatical (Jing, 2000). The appeal of sentence
compressionlies in its potential for summarization and more
generally for document compression, e.g., fordisplaying text on
small screens such as mobile phones or PDAs (Vandeghinste &
Pan,2004). Much of the current work in the literature focuses on a
simplified formulation of thecompression task which does not allow
any rewriting operations other than word deletion.Given an input
source sentence of words x = x1, x2, . . . , xn, a target
compression y is formedby removing any subset of these words
(Knight & Marcu, 2002).
Despite being restricted to word deletion, the compression task
remains challenging froma modeling perspective. Figure 1
illustrates a source sentence and its target compressiontaken from
one of the compression corpora used in our experiments (see Section
5 for details).In this case, a hypothetical compression system must
apply a series of rewrite rules in order
c2009 AI Access Foundation. All rights reserved.
-
Cohn & Lapata
exactly what records made it and which ones are involvedRB WP
NNS VBD CC WP VBP VBNPRP NNS
NPVP
NPWHNP
SS
WHNPVP
NP
SS
S
VP
(a) Source
what records are involvedWP NNS VBP VBN
NPWHNPVP
SS
VP
(b) Target
Figure 1: Example of sentence compression showing the source and
target trees. The boldsource nodes show the terminals that need to
be removed to produce the targetstring.
RB WPWHNP
WPWHNP
VPNPS NP
VPNPS VP
(1) (2) (3)
SWHNPS VP
andCC
WHNP SS S
SWHNP
VPNPS
S
(5)(4)
Figure 2: Example transduction rules, each displayed as a pair
of tree fragments. The left(source) fragment is matched against a
node in the source tree, and the matchingpart is then replaced by
the right (target) fragment. Dotted lines denote
variablecorrespondences, and denotes node deletion.
to obtain the target, e.g., delete the leaf nodes exactly and
and, delete the subtrees made itand which ones, and merge the
subtrees corresponding to records and are involved. Moreconcretely,
the system must have access to rules like those shown in Figure 2.
The rulesare displayed as a pair of tree fragments where the left
fragment corresponds to the sourceand the right to the target. For
instance, rule (1) states that a wh-noun phrase (WHNP)consisting of
an adverb (RB) and a wh-pronoun (WP) (e.g., exactly what) can be
rewrittenas just a wh-pronoun (without the adverb). There are two
things to note here. First,syntactic information plays an important
role, since deletion decisions are not limited toindividual words
but often span larger constituents. Secondly, there can be a large
numberof compression rules of varying granularity and complexity
(see rule (5) in Figure 2).
Previous solutions to the compression problem have been cast
mostly in a supervisedlearning setting (for unsupervised methods
see Clarke & Lapata, 2008; Hori & Furui, 2004;Turner &
Charniak, 2005). Sentence compression is often modeled in a
generative framework
638
-
Sentence Compression as Tree Transduction
where the aim is to estimate the joint probability P (x,y) of
source sentence x havingthe target compression y (Knight &
Marcu, 2002; Turner & Charniak, 2005; Galley &McKeown,
2007). These approaches essentially learn rewrite rules similar to
those shownin Figure 4 from a parsed parallel corpus and
subsequently use them to find the bestcompression from the set of
all possible compressions for a given sentence. Other
approachesmodel compression discriminatively as subtree deletion
(Riezler, King, Crouch, & Zaenen,2003; Nguyen, Horiguchi,
Shimazu, & Ho, 2004; McDonald, 2006).
Despite differences in formulation, existing models are
specifically designed with sen-tence compression in mind and are
not generally applicable to other tasks requiring morecomplex
rewrite operations such as substitutions, insertions, or
reordering. A commonassumption underlying previous work is that the
tree structures representing the sourcesentences and their target
compressions are isomorphic, i.e., there exists an
edge-preservingbijection between the nodes in the two trees. This
assumption is valid for sentence com-pression but does not hold for
other rewriting tasks. Consequently, sentence compressionmodels are
too restrictive; they cannot be readily adapted to other generation
problemssince they are not able to handle structural and lexical
divergences. A related issue con-cerns the deletion operations
themselves which often take place without considering thestructure
of the target compression (the goal is to generate a compressed
string rather thanthe tree representing it). Without a syntax-based
language model (Turner & Charniak,2005) or an explicit
generation mechanism that licenses tree transformations there is
noguarantee that the compressions will have well-formed syntactic
structures. And it will notbe straightforward to process them for
subsequent generation or analysis tasks.
In this paper we present a sentence compression model that is
not deletion-specific butcan account for ample rewrite operations
and scales to other rewriting tasks. We formulatethe compression
problem as tree-to-tree rewriting using a synchronous grammar (with
ruleslike those shown in Figure 2). Specifically, we adopt the
synchronous tree substitutiongrammar (STSG) formalism (Eisner,
2003) which can model non-isomorphic tree structureswhile having
efficient inference algorithms. We show how such a grammar can be
inducedfrom a parallel corpus and propose a discriminative model
for the rewriting task whichcan be viewed as a weighted
tree-to-tree transducer. Our learning framework makes use ofthe
large margin algorithm put forward by Tsochantaridis, Joachims,
Hofmann, and Altun(2005) which efficiently learns a prediction
function to minimize a given loss function. Wealso develop an
appropriate algorithm that can be used in both training (i.e.,
learning themodel weights) and decoding (i.e., finding the most
plausible compression under the model).Beyond sentence compression,
we hope that some of the work described here might be ofrelevance
to other tasks involving structural matching (see the discussion in
Section 8).
The remainder of this paper is structured as follows. Section 2
provides an overviewof related work. Section 3 presents the STSG
framework and the compression model weemploy in our experiments.
Section 5 discusses our experimental set-up and Section 6presents
our results. Discussion of future work concludes the paper.
2. Related Work
Synchronous context-free grammars (SCFGs, Aho & Ullman,
1969) are a generalizationof the context-free grammar (CFG)
formalism to simultaneously produce strings in two
639
-
Cohn & Lapata
languages. They have been used extensively in syntax-based
statistical MT. Examplesinclude inversion transduction grammar (Wu,
1997), head transducers (Alshawi, Bangalore,& Douglas, 2000),
hierarchical phrase-based translation (Chiang, 2007), and several
variantsof tree transducers (Yamada & Knight, 2001; Grael &
Knight, 2004).
Sentence compression bears some resemblance to machine
translation. Instead of trans-lating from one language into
another, we are translating long sentences into shorter oneswithin
the same language. It is therefore not surprising that previous
work has also adoptedSCFGs for the compression task. Specifically,
Knight and Marcu (2002) proposed a noisy-channel formulation of
sentence compression. Their model consists of two components:
alanguage model P (y) whose role is to guarantee that the
compression output is grammat-ical and a channel model P (x|y)
capturing the probability that the source sentence x isan expansion
of the target compression y. Their decoding algorithm searches for
the com-pression y which maximizes P (y)P (x|y). The channel model
is a stochastic SCFG, therules of which are extracted from a parsed
parallel corpus and their weights estimated usingmaximum
likelihood. Galley and McKeown (2007) show how to obtain improved
SCFGprobability estimates through Markovization. Turner and
Charniak (2005) note that SCFGrules are not expressive enough to
model structurally complicated compressions as theyare restricted
to trees of depth 1. They remedy this by supplying their
synchronous gram-mar with a set of more general special rules. For
example, they allow rules of the formNP,NP [NP NP 1 CC NP 2 ],NP 1
(boxed subscripts are added to distinguish betweenthe two NPs).
Our own work formulates sentence compression in the framework of
synchronous tree-substitution grammar (STSG, Eisner, 2003). STSG
allows to describe non-isomorphic treepairs (the grammar rules can
comprise trees of arbitrary depth) and is thus suited to
text-rewriting tasks which typically involve a number of local
modifications to the input text.Especially if each modification can
be described succinctly in terms of syntactic transforma-tions,
such as dropping an adjectival phrase or converting a passive verb
phrase into activeform. STSG is a restricted version of synchronous
tree adjoining grammar (STAG, Shieber& Schabes, 1990) without
an adjunction operation. STAG affords mild context
sensitivity,however at increased cost of inference. SCFG and STSG
are weakly equivalent, that is, theirstring languages are identical
but they do not produce equivalent tree pairs. For example,in
Figure 2, rules (1)(4) can be expressed as SCFG rules, but rule (5)
cannot becauseboth the source and target fragments are two level
trees. In fact it would be impossible todescribe the trees in
Figure 1 using a SCFG. Our grammar rules are therefore more
generalthan those obtained by Knight and Marcu (2002) and can
account for more elaborate treedivergences. Moreover, by adopting a
more expressive grammar formalism, we can natu-rally model
syntactically complex compressions without having to specify
additional rules(as in Turner & Charniak, 2005).
A synchronous grammar will license a large number of
compressions for a given sourcetree. Each grammar rule typically
has a score from which the overall score of a compres-sion y for
sentence x can be derived. Previous work estimates these scores
generatively asdiscussed above. We opt for a discriminative
training procedure which allows for the incor-poration of all
manner of powerful features. We use the large margin technique
proposedby Tsochantaridis et al. (2005). The framework is
attractive in that it supports a config-urable loss function, which
describes the extent to which a predicted target tree differs
from
640
-
Sentence Compression as Tree Transduction
the reference tree. By devising suitable loss functions the
model can be straightforwardlyadapted to text rewriting tasks
besides sentence compression.
McDonald (2006) also presents a sentence compression model that
uses a discriminativelarge margin algorithm. The model has a rich
feature set defined over compression bigramsincluding parts of
speech, parse trees, and dependency information, without however
mak-ing explicit use of a synchronous grammar. Decoding in this
model amounts to finding thecombination of bigrams that maximize a
scoring function defined over adjacent words inthe compression and
the intervening words which were dropped. Our model differs
fromMcDonalds in two important respects. First, we can capture more
complex tree trans-formations that go beyond bigram deletion. Being
tree-based, our decoding algorithm isbetter able to preserve the
grammaticality of the compressed output. Second, the tree-based
representation allows greater modeling flexibility, e.g., by
defining a wide range ofloss functions over the tree or its string
yield. In contrast, McDonald can only define lossfunctions over the
final compression.
Although the bulk of research on sentence compression relies on
parallel corpora formodeling purposes, a few approaches use no
training data at all or a small amount. Anexample is in the work of
Hori and Furui (2004), who propose a model for
automaticallytranscribed spoken text. Their method scores candidate
compressions using a languagemodel combined with a significance
score (indicating whether a word is topical or not),and a score
representing the speech recognizers confidence in transcribing a
given wordcorrectly. Despite being conceptually simple and
knowledge lean, their model operates atthe word level. Since it
does not take syntax into account, it has no means of
deletingconstituents spanning several subtrees (e.g., relative
clauses). Clarke and Lapata (2008)show that such unsupervised
models can be greatly improved when linguistically
motivatedconstraints are used during decoding.
3. Problem Formulation
As mentioned earlier, we formulate sentence compression as a
tree-to-tree rewriting problemusing a weighted synchronous grammar
coupled with a large margin training process. Ourmodel learns from
a parallel corpus of input (uncompressed) and output (compressed)
pairs(x1,y1), . . . , (xn,yn) to predict a target labeled tree y
from a source labeled tree x. Wecapture the dependency between x
and y as a weighted STSG which we define in thefollowing section.
Section 3.2 discusses how we extract such a grammar from a
parallelcorpus. Each rule has a score, as does each ngram in the
output tree, from which theoverall score of a compression y for
sentence x can be derived. We introduce our scoringfunction in
Section 3.3 and explain our training algorithm in Section 3.5. In
this frameworkdecoding amounts to finding the best target tree
licensed by the grammar given a sourcetree. We present a
chart-based decoding algorithm in Section 3.4.
3.1 Synchronous Grammar
A synchronous grammar defines a space of valid source and target
tree pairs, much as aregular grammar defines a space of valid
trees. Synchronous grammars can be treated as treetransducers by
reasoning over the space of possible sister trees for a given tree,
that is, allthe trees which can be produced alongside the given
tree. This is essentially a transducer
641
-
Cohn & Lapata
Algorithm 1 Generative process for creating a pair of
trees.initialize source tree, x = RSinitialize target tree, y =
RTinitialize stack of frontier nodes, F = [(RS , RT )]for all node
pairs, (vS , vT ) F do
choose a rule vS , vT , ,rewrite node vS in x as rewrite node vT
in y as for all variables, u do
find aligned child nodes, (cS , cT ), under vS and vT
corresponding to upush (cS , cT ) on to F
end forend forx and y are now complete
which takes a tree as input and produces a tree as output. The
grammar rules specifythe steps taken by the transducer in
recursively mapping tree fragments of the input treeinto fragments
in the target tree. From the many families of synchronous grammars
(seeSection 2), we elect to use a synchronous tree-substitution
grammar (STSG). This is oneof the simpler formalisms, and
consequently has efficient inference algorithms, while stillbeing
complex enough to model a rich suite of tree edit operations.
A STSG is a 7-tuple, G = (NS ,NT ,S ,T , P,RS , RT ) where N are
the non-terminalsand are the terminals, with the subscripts S and T
indicating source and target respec-tively, P are the productions
and RS NS and RT NT are the distinguished root symbols.Each
production is a rewrite rule for two aligned non-terminals X NS and
Y NT in thesource and target:
X,Y , , (1)where and are elementary trees rooted with the
symbols X and Y respectively. Notethat a synchronous context free
grammar (SCFG) limits and to one level elementarytrees, but is
otherwise identical to a STSG, which imposes no such limits.
Non-terminalleaves of the elementary trees are referred to as
frontier nodes or variables. These are thepoints of recursion in
the transductive process. A one-to-one alignment between the
frontiernodes in and is specified by . The alignment can represent
deletion (or insertion) byaligning a node with the special symbol,
which indicates that the node is not present inthe other tree. Only
nodes in can be aligned to , which allows for subtrees to be
deletedduring transduction. We disallow the converse, -aligned
nodes in , as these would licenseunlimited insertion in the target
tree, independently of the source tree. This capabilitywould be of
limited use for sentence compression, while also increasing the
complexity ofinference.
The grammar productions can be used in a generative setting to
produce pairs of trees,or in a transductive setting to produce a
target tree when given a source tree. Algorithms 1and 2 present
pseudo-code for both processes. The generative process (Algorithm
1) startswith the two root symbols and applies a production which
rewrites the symbols as theproductions elementary trees. These
elementary trees might contain frontier nodes, in
642
-
Sentence Compression as Tree Transduction
Algorithm 2 The transduction of a source tree into a target
tree.Require: complete source tree, x, with root node labeled
RS
initialize target tree, y = RTinitialize stack of frontier
nodes, F = [(root(x), RT )]for all node pairs, (vS , vT ) F do
choose a rule vS , vT , , where matches the sub-tree rooted at
vS in xrewrite vT as in yfor all variables, u do
find aligned child nodes, (cS , cT ), under vS and vT
corresponding to upush (cS , cT ) on to F
end forend fory is now complete
which case the aligned pairs of frontier nodes are pushed on to
the stack, and later rewrittenusing another production. The process
continues in a recursive fashion until the stack isempty there are
no frontier nodes remaining , at which point the two trees are
complete.The sequence of rewrite rules are referred to as a
derivation, from which the source andtarget tree can be recovered
deterministically.
Our model uses a STSG in a transductive setting, where the
source tree is given and itis only the target tree that is
generated. This necessitates a different rewriting process, asshown
in Algorithm 2. We start with the source tree, and RT , the target
root symbol, whichis aligned to the root node of the source,
denoted root(x). Then we choose a production torewrite the pair of
aligned non-terminals such that the productions source side, ,
matchesthe source tree. The target symbol is then rewritten using .
For each variable in thematching node in the source and its
corresponding leaf node in the target tree are pushedon to the
stack for later processing.1 The process repeats until the stack is
empty, andtherefore the source tree has been covered. We now have a
complete target tree. As beforewe use the term derivation to refer
to this sequence of production applications. The targetstring is
the yield of the target tree, given by reading the non-terminals
from the tree in aleft to right manner.
Let us consider again the compression example from Figure 1. The
tree editing rules fromFigure 2 are encoded as STSG productions in
Figure 3 (see rules (1)(5)). Production (1),reproduces tree pair
(1) from Figure 2, production (2) tree pair (2), and so on. The
notationin Figure 3 (primarily for space reasons) uses brackets
([]) to indicate constituent boundaries.Brackets surround a
constituents non-terminal and its child nodes, which can each
beterminals, non-terminals or bracketed subtrees. The boxed indices
are short-hand notationfor the alignment, . For example, in rule
(1) they specify that the two WP non-terminalsare aligned and the
RB node occurs only in the source tree (i.e., heads a deleted
sub-tree). The grammar rules allow for differences in non-terminal
category between the sourceand target, as seen in rules (2)(4).
They also allow arbitrarily deep elementary trees,
1. Special care must be taken for aligned variables. Nodes in
which are -aligned signify that the sourcesub-tree below this point
can be deleted without affecting the target tree. For this reason
we can safelyignore source nodes deleted in this manner.
643
-
Cohn & Lapata
Rules which perform major tree edits(1) WHNP, WHNP [WHNP RB WP 1
], [WHNP WP 1 ](2) S, NP [S NP 1 VP ], NP 1 (3) S, VP [S NP VP 1 ],
VP 1 (4) S, VP [S WHNP S 1 ], VP 1 (5) S, S [S [S WHNP 1 S 2 ] [CC
and] S 3 ], [S WHNP 1 [S NP 2 VP 3 ]]
Rules which preserve the tree structure(6) WP, WP [WP what], [WP
what](7) NP, NP [NP NNS 1 ], [NP NNS 1 ](8) NNS, NNS [NNS records],
[NNS records](9) VP, VP [VP VBP 1 VP 2 ], [VP VBP 1 VP 2 ](10) VBP,
VBP [VBP are], [VBP are](11) VP, VP [VP VBN 1 ], [VP VBN 1 ](12)
VBN, VBN [VBN involved], [VBN involved]
Figure 3: The rules in a Synchronous Tree Substitution Grammar
(STSG) capable of gen-erating the sentence pair from Figure 1.
Equivalently, this grammar defines atransducer which can convert
the source tree (Figure 1(a)) into the target tree(Figure 1(b)).
Each rule rewrites a pair of non-terminals into a pair of
subtrees,shown in bracketed notation.
as evidenced by rule (5) which is has trees of depth two. Rules
(6)(12) complete thetoy grammar which describes the tree pair from
Figure 1. These rules copy parts of thesource tree into the target,
be they terminals (e.g., rule (6)) or internal nodes with
children(e.g., rule (9)).
Figure 4 shows how this grammar can be used to transduce the
source tree into thetarget tree from Figure 1. The first few steps
of the derivation are also shown graphi-cally in Figure 5. We start
with the source tree, and seek to transduce its root symbolinto the
target root symbol, denoted S/S. The first rule to be applied is
rule (5) in Fig-ure 3; its source side, = [S [S WHNP S] [CC and]
S], matches the root of source treeand it has the requisite target
category, Y = S. The matching part of the source treeis rewritten
using the rules target elementary tree, = [S WHNP [S NP VP]]. The
threethree variables are now annotated to reflect the category
transformations required for eachnode, WHNP/WHNP, S/NP and S/VP.
The process now continues for the leftmost of thesenodes, labeled
WHNP/WHNP. Rule (1) (from Figure 3) is then applied, which deletes
thenodes left child, shown as RB/, and retains its right child. The
subsequent rule completesthe transduction of the WHNP node by
matching the string exactly . The algorithm con-tinues to visit
each variable node and finishes when there are no variable nodes
remaining,resulting in the desired target tree.
3.2 Grammar
The previous section outlined the STSG formalism we employ in
our sentence compressionmodel, save one important detail: the
grammar itself. For example, we could obtain a
644
-
Sentence Compression as Tree Transduction
[S/S [S [WHNP exactly what] [S [NP records] [VP made it]]][CC
and] [S [WHNP which] [S [NP ones] [VP are involved]]]]
5 [S [WHNP/WHNP [RB exactly] [WP what]] [S [S/NP [NP records]
[VP made it]][S/VP [WHNP which] [S [NP ones] [VP are
involved]]]]]
1 [S [WHNP [WP/WP what]] [S [S/NP [NP records] [VP made
it]][S/VP [WHNP which] [S [NP ones] [VP are involved]]]]]
6 [S [WHNP [WP what]] [S [S/NP [NP records] [VP [VBD made] [NP
[PRP it]]]]][S/VP [WHNP which] [S [NP ones] [VP [VBP are] [VP [VBN
involved]]]]]]]
2 [S [WHNP [WP what]] [S [NP [NNS/NNS records]][S/VP [WHNP
which] [S [NP ones] [VP are involved]]]]]
8 [S [WHNP [WP what]] [S [NP [NNS records]]][S/VP [WHNP which]
[S [NP ones] [VP are involved]]]]
4 [S [WHNP what] [S [NP records] [S/VP [NP ones] [VP are
involved]]]]3 [S [WHNP what] [S [NP records] [VP/VP [VP [VBP are]
[VP [VBN involved]]]]]]9 [S [WHNP what] [S [NP records] [VP
[VBP/VBP are] [VP/VP [VBN involved]]]]]10 [S [WHNP what] [S [NP
records] [VP [VBP are] [VP/VP [VBN involved]]]]]11 [S [WHNP what]
[S [NP records] [VP [VBP are] [VP [VBN/VBN involved]]]]]12 [S [WHNP
[WP what]] [S [NP [NNS records]] [VP [VBP are] [VP [VBN
involved]]]]]
Figure 4: Derivation of example sentence pair from Figure 1.
Each line shows a rewrite step,denoted i where the subscript i
identifies which rule was used. The frontiernodes are shown in bold
with X/Y indicating that symbol X must be transducedinto Y in
subsequent steps. For the sake of clarity, some internal nodes have
beenomitted.
synchronous grammar by hand, automatically from a corpus, or by
some combination. Ouronly requirement is that the grammar allows
the source trees in the training set to betransduced into their
corresponding target trees. For maximum generality, we devised
anautomatic method to extract a grammar from a parsed, word-aligned
parallel compressioncorpus. The method maps the word alignment into
a constituent level alignment betweennodes in the source and target
trees. Pairs of aligned subtrees are next generalized to createtree
fragments (elementary trees) which form the rules of the
grammar.
The first step of the algorithm is to find the constituent
alignment, which we define asthe set of source and target
constituent pairs whose yields are aligned to one another underthe
word alignment. We base our approach on the alignment template
method (Och & Ney,2004), which uses word alignments to define
alignments between ngrams (called phrases inthe SMT literature).
This method finds pairs of ngrams where at least one word in oneof
the ngrams is aligned to a word in the other, but no word in either
ngram is aligned toa word outside the other ngram. In addition, we
require that these ngrams are syntacticconstituents. More formally,
we define constituent alignment as:
C = {(vS , vT ), ((s, t) A s Y (vS) t Y (vT )) (2)(@(s, t) A (s
Y (vS) Y t Y (vT )))}
where vS and vT are source and target tree nodes (subtrees), A =
{(s, t)} is the set of wordalignments (pairs of word-indices), Y ()
returns the yield span for a subtree (the minimumand maximum word
index in its yield) and Y is the exclusive-or operator. Figure 6
shows
645
-
Cohn & Lapata
exactly what records made it and which ones are involvedRB WP
NNS VBD CC WP VBP VBNPRP NNS
NPVP
NPWHNP
SS
WHNPVP
NP
SS
S
VP
S
VPNP
WHNP S
S
exactly what records made it and which ones are involvedRB WP
NNS VBD CC WP VBP VBNPRP NNS
NPVP
NPWHNP
SS
WHNPVP
NP
SS
S
VP
VPNP
WHNP S
S
exactly what records made it and which ones are involvedRB WP
NNS VBD CC WP VBP VBNPRP NNS
NPVP
NPWHNP
SS
WHNPVP
NP
SS
S
VP
WP
Figure 5: Graphical depiction of the first two steps of the
derivation in Figure 4. The sourcetree is shown on the left and the
partial target tree on the right. Variable nodesare shown in bold
face and dotted lines show their alignment.
the word alignment and the constituent alignments that are
licensed for the sentence pairfrom Figure 1.
The next step is to generalize the aligned subtree pairs by
replacing aligned child subtreeswith variable nodes. For example,
in Figure 6 when we consider the pair of aligned subtrees[S which
ones are involved] and [VP are involved], we could extract the
rule:
S,VP [S [WHNP [WP which]] [S [NP [NNS ones] [VP [VBP are] [VP
[VBN involved]]]]]],[VP [VBP are] [VP [VBN involved]]] (3)
However, this rule is very specific and consequently will not be
very useful in a transductionmodel. In order for it to be applied,
we must see the full S subtree, which is highly unlikelyto occur in
another sentence. Ideally, we should generalize the rule so as to
match manymore source trees, and thereby allow transduction of
previously unseen structures. In theexample, the node pairs labeled
(VP1, VP1), (VBP, VBP), (VP2, VP2) and (VBN, VBN)can all be
generalized as these nodes are aligned constituents (subscripts
added to distinguish
646
-
Sentence Compression as Tree Transduction
exactly what records made it and which ones are involvedRB WP
NNS VBD CC WP VBP VBNPRP NNS
NPVP
NPWHNP
SS
WHNPVP
NP
SS
S
VP
whatrecordsare
involvedWP
NNSVBP
VBNNP
WHNP
VPS
S
VP
Figure 6: Tree pair with word alignments shown as a binary
matrix. A dark square indicatesan alignment between the words on
its row and column. The overlaid rectanglesshow constituent
alignments which are inferred from the word alignment.
between the two VP nodes). In addition, the nodes WHNP, WP, NP
and NNS in the sourceare unaligned, and therefore can be
generalized using -alignment to signify deletion. If wewere to
perform all possible generalizations for the above example,2 we
would produce therule:
S,VP [S WHNP S 1 ], VP 1 (4)There are many other possible rules
which can be extracted by applying different legalcombinations of
the generalizations (there are 45 in total for this example).
Algorithm 3 shows how the minimial (most general) rules are
extracted.3 This resultsin the minimal set of synchronous rules
which can describe each tree pair.4 These rules areminimal in the
sense that they cannot be made smaller (e.g., by replacing a
subtree witha variable) while still honoring the word-alignment.
Figure 7 shows the resulting minimalset of synchronous rules for
the example from Figure 6. As can be seen from the example,many of
the rules extracted are overly general. Ideally, we would extract
every rule withevery legal combination of generalizations, however
this leads to a massive number of rules exponential in the size of
the source tree. We address this problem by allowing a
limitednumber of generalizations to be skipped in the extraction
process. This is equivalent toaltering lines 4 and 7 in Algorithm 3
to first make a non-deterministic decision whether tomatch or
ignore the match and continue descending the source tree. The
recursion depthlimits the number of matches that can be ignored in
this way. For example, if we allow one
2. Where some generalizations are mutually exclusive, we take
the highest match in the trees.3. The non-deterministic matching
step in line 8 allows the matching of all options individually.
This
is implemented as a mutually recursive function which replicates
the algorithm state to process eachdifferent match.
4. Algorithm 3 is an extension of Galley, Hopkins, Knight, and
Marcus (2004) technique for extracting aSCFG from a word-aligned
corpus consisting of (tree, string) pairs.
647
-
Cohn & Lapata
Algorithm 3 extract(x,y, A): extracts minimal rules from
constituent-aligned treesRequire: source tree, x, target tree, y,
and constituent-alignment, A
1: initialize source and target sides of rule, = x, = y2:
initialize frontier alignment, = 3: for all nodes vS , top-down
do4: if vS is null-aligned then5: (vS , )6: delete children of a7:
else if vS is aligned to some target node(s) then8: choose target
node, vT {non-deterministic choice}9: call extract(vS , vT , A)
10: (vS , vT )11: delete children of vS12: delete children of
vT13: end if14: end for15: emit rule root(), root() , ,
level of recursion when extracting rules from the (S, VP) pair
from Figure 6, we get theadditional rules:
S,VP [S [WHNP WP ] S 1 ], VP 1 S,VP [S WHNP [S NP VP 1 ]], VP
1
while at two levels of recursion, we also get:
S,VP [S [WHNP [WP which]] S 1 ], VP 1 S,VP [S [WHNP [WP which]]
[S NP VP 1 ]], VP 1 S,VP [S WHNP [S [NP NNS ] VP 1 ]], VP 1 S,VP [S
WHNP [S NP [VP VBD 1 VP 2 ]]], [VBD 1 VBD 2 ]
Compared to rule (4) we can see that the specialized rules above
add useful structureand lexicalisation, but are still sufficiently
abstract to generalize to new sentences, unlikerule (3). The number
of rules is exponential in the recursion depth, but with fixed a
depthit is polynomial in the size of the source tree fragment. We
set the recursion depth to asmall number (one or two) in our
experiments.
There is no guarantee that the induced rules will have good
coverage on unseen trees.Tree fragments containing previously
unseen terminals or non-terminals, or even an unseensequence of
children for a parent non-terminal, cannot be matched by any
grammar pro-ductions. In this case the transduction algorithm
(Algorithm 2) will fail as it has no wayof covering the source
tree. However, the problem can be easily remedied by adding
newrules to the grammar to allow the source tree to be fully
covered.5 For each node in the
5. There are alternative, equally valid, techniques for
improving coverage which simplify the syntax trees.For example,
this can be done explicitly by binarizing large productions (e.g.,
Petrov, Barrett, Thibaux,& Klein, 2006) or implicitly with a
Markov grammar over grammar productions (e.g., Collins, 1999).
648
-
Sentence Compression as Tree Transduction
S,S [S [S WHNP 1 S 2 ] CC S 3 ], [S WHNP 1 [S NP 2 VP 3
]]WHNP,WHNP [WHNP RB WP 1 ], [WHNP WP 1 ]
WP,WP [WP what], [WP what]S,NP [S NP 1 VP ], NP 1 NP,NP [NP NNS
1 ], [NP NNS 1 ]
NNS,NNS [NNS records], [NNS records]S,VP [S WHNP S 1 ], VP 1
S,VP [S NP VP 1 ], VP 1 VP,VP [VP VBP 1 VP 2 ], [VP VBP 1 VP 2
]
VBP,VBP [VBP are], [VBP are]VP,VP [VP VBN 1 ], [VP VBN 1 ]
VBN,VBN [VBN involved], [VBN involved]Figure 7: The minimal set
of STSG rules extracted from the aligned trees in Figure 6.
source tree, a rule is created to copy that node and its child
nodes into the target tree. Forexample, if we see the fragment [NP
DT JJ NN] in the source tree, we add the rule:
NP,NP [NP DT 1 JJ 2 NN 3 ], [NP DT 1 JJ 2 NN 3 ]
With these rules, each source node is copied into the target
tree, and therefore the trans-duction algorithm can trivially
recreate the original tree. Of course, the other grammarrules can
work in conjunction with the copying rules to produce other target
trees.
While the copy rules solve the coverage problem on unseen data,
they do not solve therelated problem of under-compression. This
occurs when there are unseen CFG productionsin the source tree and
therefore the only applicable grammar rules are copy rules,
whichcopy all child nodes into the target. None of the child
subtrees can be deleted unless theparent node can itself deleted by
a higher-level rule, in which case all the children aredeleted.
Clearly, it would add considerable modelling flexibility to be able
to delete some,but not all, of the children. For this reason, we
add explicit deletion rules for each sourceCFG production which
allow subsets of the child nodes to be deleted in a
linguisticallyplausible manner.
The deletion rules attempt to preserve the most important child
nodes. We measureimportance using the head-finding heuristic from
Collins parser (Appendix A, Collins,1999). Collins method finds the
single head child of a CFG production using hand-codedtables for
each non-terminal type. As we desire a set of child nodes, we run
the algorithm tofind all matches rather than stopping after the
first match. The order in which each matchis found is used as a
ranking of the importance of each child. The ordered list of child
nodesis then used to create synchronous rules which retain head 1,
heads 12, . . . , all heads.
649
-
Cohn & Lapata
For the fragment [NP DT JJ NN], the heads are found in the
following order (NN, DT,JJ). Therefore we create rules to retain
children (NN); (DT, NN) and (DT, JJ, NN):
NP,NP [NP DT JJ NN 1 ], [NP NN 1 ]NP,NN [NP DT JJ NN 1 ], NN 1
NP,NP [NP DT 1 JJ NN 2 ], [NP DT 1 NN 2 ]NP,NP [NP DT 1 JJ 2 NN 3
], [NP DT 1 JJ 2 NN 3 ]
Note that when only one child remains, the rule is also produced
without the parent node,as seen in the second rule above.
3.3 Linear Model
While an STSG defines a transducer capable of mapping a source
tree into many possibletarget trees, it is of little use without
some kind of weighting towards grammatical treeswhich have been
constructed using sensible STSG productions and which yield fluent
com-pressed target sentences. Ideally the model would define a
scoring function over targettrees or strings, however we instead
operate on derivations. In general, there may be manyderivations
which all produce the same target tree, a situation referred to as
spurious am-biguity. To fully account for spurious ambiguity would
require aggregating all derivationswhich produce the same target
tree. This would break the polynomial-time dynamic pro-gram used
for inference, rendering inference problem NP-complete (Knight,
1999). To thisend, we define a scoring function over
derivations:
score(d;w) = (d),w (5)
where d is a derivation6 consisting of a sequence of rules, w
are the model parameters, is a vector-valued feature function and
the operator , is the inner product. Theparameters, w, are learned
during training, described in Section 3.5.
The feature function, , is defined as:
(d) =rd
(r, source(d)) +
mngrams(d)(m, source(d)) (6)
where r are the rules of a derivation, ngrams(d) are the ngrams
in the yield of the targettree and is a feature function returning
a vector of feature values for each rule. Note thatthe feature
function has access to not only the rule, r, but also the source
tree, source(d),as this is a conditional model and therefore doing
so has no overhead in terms of modelingassumptions or the
complexity of inference.
In the second summand in (6), m are the ngrams in the yield of
the target tree and isa feature function over these ngrams.
Traditional (weighted) synchronous grammars onlyallow features
which decompose with the derivation (i.e., can be expressed using
the firstsummand in (6)). However, this is a very limiting
requirement, as the ngram featuresallow the modeling of local
coherence and are commonly used in the sentence
compressionliterature (Knight & Marcu, 2002; Turner &
Charniak, 2005; Galley & McKeown, 2007;
6. The derivation, d, fully specifies both the source, x =
source(d), and the target tree, y = target(d).
650
-
Sentence Compression as Tree Transduction
Clarke & Lapata, 2008; Hori & Furui, 2004; McDonald,
2006). For instance, when deletinga sub-tree with left and right
siblings, it is critical to know not only that the new siblingsare
in a grammatical configuration, but also that their yield still
forms a coherent string.For this reason, we allow ngram features,
specifically the conditional log-probability ofan ngram language
model. Unfortunately, this comes at a price as the ngram
featuressignificantly increase the complexity of inference used for
training and decoding.
3.4 Decoding
Decoding aims to find the best target tree licensed by the
grammar given a source tree.As mentioned above, we deal with
derivations in place of target trees. Decoding finds themaximizing
derivation, d, of:
d = argmaxd:source(d)=x
score(d;w) (7)
where x is the (given) source tree, source(d) extracts the
source tree from the derivation dand score is defined in (5). The
maximization is performed over the space of derivationsfor the
given source tree, as defined by the transduction process shown in
Algorithm 2.
The maximization problem in (7) is solved using the chart-based
dynamic programshown in Algorithm 4. This extends earlier inference
algorithms for weighted STSGs (Eis-ner, 2003) which assume that the
scoring function must decompose with the derivation,i.e., features
apply to rules but not to terminal ngrams. Relaxing this assumption
leads toadditional complications and increased time and space
complexity. This is equivalent to us-ing as our grammar the
intersection between the original grammar and an ngram
languagemodel, as explained by Chiang (2007) in the context of
string transduction with an SCFG.
The algorithm defines a chart, C, to record the best scoring
(partial) target tree for eachsource node vS and with root
non-terminal t. The back-pointers, B, record the maximizingrule and
store pointers to the child chart cells filling each variable in
the rule. The chart isalso indexed by the n 1 terminals at the left
and right edges of the target trees yield toallow scoring of ngram
features.7 The terminal ngrams provide sufficient context to
evaluatengram features overlapping the cells boundary when the
chart cell is combined in anotherrule application (this is the
operation performed by the boundary-ngrams function on line15).
This is best illustrated with an example. Using trigram features, n
= 3, if a node wererewritten as [NP the fast car] then we must
store the ngram context (the fast, fast car) inits chart entry.
Similarly [VP skidded to a halt] would have ngram context (skidded
to, ahalt). When applying a parent rule [S NP VP] which rewrites
these two trees as adjacentsiblings we need to find the ngrams on
the boundary between the NP and VP. These canbe easily retrieved
from the two chart cells contexts. We combine the right edge of the
NPcontext, fast car, with the left edge of the VP context, skidded
to, to get the two trigramsfast car skidded and car skidded to. The
other trigrams the fast car, skidded toa and to a halt will have
already been evaluated in the child chart cells. The newcombined S
chart cell is now given the context (the fast, a halt) by taking
the left and right
7. Strictly speaking, only the terminals on the right edge are
required for a compression model which wouldcreate the target
string in a left-to-right manner. However, our algorithm is more
general in that itallows reordering rules such as PP,PP [PP IN 1 NP
2 ], [PP NP 2 IN 1 ]. Such rules are required formost other
text-rewriting tasks besides sentence compression.
651
-
Cohn & Lapata
Algorithm 4 Exact chart based decoding algorithm.Require:
complete source tree, x, with root node labeled RS
1: let C[v, t, l] R be a chart representing the score of the
best derivation transducing thetree rooted at v to a tree with root
category t and ngram context l
2: let B[v, t, l] (P,x NT L) be the corresponding back-pointers,
each consisting ofa production and the source node, target category
and ngram context for each of theproductions variables
3: initialize chart, C[, , ] = 4: initialize back-pointers, B[,
, ] = none5: for all source nodes, vS x, bottom-up do6: for all
rules, r = vS , Y , , where matches the sub-tree rooted at vS do7:
let m be the target ngrams wholly contained in 8: let features
vector, (r,x) + (m,x)9: let l be an empty ngram context
10: let score, q 011: for all variables, u do12: find source
child node, cu, under vS corresponding to u13: let tu be the
non-terminal for target child node under corresponding to u14:
choose child chart entry, qu = C[cu, tu, lu] {non-deterministic
choice of lu}15: let m boundary-ngrams(r, lu)16: update features, +
(m,x)17: update ngram context, l merge-ngram-context(l, lu)18:
update score, q q + qu19: end for20: update score, q q + ,w21: if q
> C[vS , Y, l] then22: update chart, C[vS , Y, l] q23: update
back-pointers, B[vS , Y, l] (r, {(cu, tu, lu)u})24: end if25: end
for26: end for27: find best root chart entry, l argmaxl C[root(x),
RT , l]28: create derivation, d, by traversing back-pointers from
B[root(x), RT , l]
edges of the two child cells. This merging process is performed
by the merge-ngram-contextfunction on line 17. Finally we add
artificial root node to the target tree with n1 artificialstart
terminals and one end terminal. This allows the ngram features to
be applied overboundary ngrams at the beginning and end of the
target string.
The decoding algorithm processes the source tree in a post-order
traversal, finding theset of possible trees and their ngram
contexts for each source node and inserting these intothe chart.
The rules which match the node are processed in lines 624. The
feature vector,, is calculated on the rule and the ngrams therein
(line 8), and for ngrams bordering childcells filling the rules
variables (line 16). Note that the feature vector only includes
thosefeatures specific to the rule and the boundary ngrams, but not
those wholly contained in
652
-
Sentence Compression as Tree Transduction
the child cell. For this reason the score is the sum of the
scores for each child cell (line18) and the feature vector and the
model weights (line 20). The new ngram context, l, iscalculated by
combining the rules frontier and the ngram contexts of the child
cells (line17). Finally the chart entry for this node is updated if
the score betters the previous value(lines 2124).
When choosing the child chart cell entry in line 14, there can
be many different entrieseach with a different ngram context, lu.
This affects the ngram features, , and consequentlythe ngram
context, l, and the score, q, for the rule. The non-determinism
means that everycombination of child chart entries are chosen for
each variable, and these combinations arethen evaluated and
inserted into the chart. The number of combinations is the product
ofthe number of child chart entries for each variable. This can be
bounded by O(|TT |2(n1)V )where |TT | is the size of the target
lexicon and V is the number of variables. Therefore theasymptotic
time complexity of decoding is the O(SR|TT |2(n1)V ) where S are
the numberof source nodes and R is the number of matching rules for
each node. This high complexityclearly makes exact decoding
infeasible, especially so when either n or V are large.
We adopt a popular approach in syntax-inspired machine
translation to address thisproblem (Chiang, 2007). Firstly, we use
a beam-search, which limits the number of differentngram contexts
stored in each chart cell to a constant, W . This changes the base
inthe complexity term, leading to an improved O(SRW V ) but which
is still exponential inthe number of variables. In addition, we use
Chiangs cube-pruning heuristic to furtherlimit the number of
combinations. Cube-pruning uses a heuristic scoring function
whichapproximates the conditional log-probability from a ngram
language model with the log-probability from a unigram model.8 This
allows us to visit the combinations in best-firstorder under the
heuristic scoring function until the beam is filled.The beam is
then rescoredusing the correct scoring function. This can be done
cheaply in O(WV ) time, leading to anoverall time complexity of
decoding to O(SRWV ). We refer the interested reader to thework of
Chiang (2007) for further details.
3.5 Training
We now turn to the problem of how derivations are scored in our
model. For a given sourcetree, the space of sister target trees
implied by the synchronous grammar is often very large,and the
majority of these trees are ungrammatical or poor compressions. It
is the job ofthe training algorithm to find weights such that the
reference target trees have high scoresand the many other target
trees licensed by the grammar are given lower scores.
As explained in Section 3.3 we define a scoring function over
derivations. This functionwas given in (5) and (7), and is
reproduced below:
f(d;w) = argmaxd:source(d)=x
w,(d) (8)
Equation (8) finds the best scoring derivation, d, for a given
source, x, under a linear model.Recall that y is a derivation which
generates the source tree x and a target tree. The goal
8. We use the conditional log-probability of an ngram language
model as our only ngram feature. In order touse other ngram
features, such as binary identity features for specific ngrams, it
would first be advisable toconstruct an approximation which
decomposes with the derivation for use in the cube-pruning
heuristic.
653
-
Cohn & Lapata
of the training procedure is to find a parameter vector w which
satisfies the condition:
i,d : source(d) = xi d 6= di : w,(di)(d) 0 (9)
where xi,di are the ith training source tree and reference
derivation. This condition statesthat for all training instances
the reference derivation is at least as high scoring as anyother
derivations. Ideally, we would also like to know the extent to
which a predicted targettree differs from the reference tree. For
example, a compression that differs from the goldstandard with
respect to one or two words should be treated differently from a
compressionthat bears no resemblance to it. Another important
factor is the length of the compression.Compressions whose length
is similar to the gold standard should be be preferable to longeror
shorter output. A loss function (yi,y) quantifies the accuracy of
prediction y withrespect to the true output value yi.
There are a plethora of different discriminative training
frameworks which can optimizea linear model. Possibilities include
perceptron training (Collins, 2002), log-linear optimi-sation of
the conditional log-likelihood (Berger, Pietra, & Pietra, 1996)
and large marginmethods. We base our training on Tsochantaridis et
al.s (2005) framework for learningSupport Vector Machines (SVMs)
over structured output spaces, using the SVMstruct
im-plementation.9 The framework supports a configurable loss
function which is particularlyappealing in the context of sentence
compression and more generally text-to-text genera-tion. It also
has an efficient training algorithm and powerful regularization.
The latter isis critical for discriminative models with large
numbers of features, which would otherwiseover-fit the training
sample at the expense of generalization accuracy. We briefly
summarizethe approach below; for a more detailed description we
refer the interested reader to thework of Tsochantaridis et al.
(2005).
Traditionally SVMs learn a linear classifier that separates two
or more classes with thelargest possible margin. Analogously,
structured SVMs attempt to separate the correctstructure from all
other structures with a large margin. The learning objective for
thestructured SVM uses the soft-margin formulation which allows for
errors in the training setvia the slack variables, i:
minw,
12||w||2 + C
n
ni=1
i, i 0 (10)
i,d : source(d) = xi y 6= di : w,(di)(d) (di,d) iThe slack
variables, i, are introduced here for each training example, xi and
C is a constantthat controls the trade-off between training error
minimization and margin maximization.Note that slack variables are
combined with the loss incurred in each of the linear con-straints.
This means that a high loss output must be separated by a larger
margin thana low loss output, or have a much larger slack variable
to satisfy the constraint. Alter-natively, the loss function can be
used to rescale the slack parameters, in which case theconstraints
in (10) are replaced with w,(di) (d) 1 i(di,d) . Margin rescaling
istheoretically less desirable as it is not scale invariant, and
therefore requires the tuning ofan additional hyperparameter
compared to slack rescaling. However, empirical results show
9. http://svmlight.joachims.org/svm_struct.html
654
-
Sentence Compression as Tree Transduction
little difference between the two rescaling methods
(Tsochantaridis et al., 2005). We usemargin rescaling for the
practical reason that it can be approximated more accurately
thancan slack rescaling by our chart based inference method.
The optimization problem in (10) is approximated using an
algorithm proposed byTsochantaridis et al. (2005). The algorithm
finds a small set of constraints from the full-sized optimization
problem that ensures a sufficiently accurate solution.
Specifically, itconstructs a nested sequence of successively
tighter relaxation of the original problem usinga (polynomial time)
cutting plane algorithm. For each training instance, the
algorithmkeeps track of the selected constraints defining the
current relaxation. Iterating throughthe training examples, it
proceeds by finding the output that most radically violates
aconstraint. In our case, the optimization crucially relies on
finding the derivation which isboth high scoring and has high loss
compared to the gold standard. This requires findingthe maximizer
of:
H(d) = (d,d) w,(di)(d) (11)
The search for the maximizer of H(d) in (11) can be performed by
the decoding al-gorithm presented in Section 3.4 with some
extensions. Firstly, by expanding (11) toH(d) = (d,d) (di),w+ (d),w
we can see that the second term is constant withrespect to d, and
thus does not influence the search. The decoding algorithm
maximizesthe last term, so all that remains is to include the loss
function into the search process.
Loss functions which decompose with the rules or target ngrams
in the derivation,(d,d) =
rd R(d
, r) +
nngrams(d) N (d, n), can be easily integrated into the
decoding algorithm. This is done by adding the partial loss,
R(d, r) + N (d, n) to eachrules score in line 20 of Algorithm 4
(the ngrams are recovered from the ngram contexts inthe same manner
used to evaluate the ngram features).
However, many of our loss functions do not decompose with the
rules or the ngrams. Inorder to calculate these losses the chart
must be stratified by the loss functions arguments(Joachims, 2005).
For example, unigram precision measures the ratio of correctly
predictedtokens to total predicted tokens and therefore its loss
arguments are the pair of counts,(TP, FP ), for true and false
positives. They are initialized to (0, 0) and are then updatedfor
each rule used in a derivation. This equates to checking whether
each target terminal is inthe reference string and incrementing the
relevant value. The chart is extended (stratified)to store the loss
arguments in the same way that ngram contexts are stored for
decoding.This means that a rule accessing a child chart cell can
get multiple entries, each withdifferent loss argument values as
well as multiple ngram contexts (line 14 in Algorithm4). The loss
argument for a rule application is calculated from the rule itself
and the lossarguments of its children. This is then stored in the
chart and the back-pointer list (lines2223 in Algorithm 4).
Although this loss can only be evaluated correctly for
completederivations, we also evaluate the loss on partial
derivations as part of the cube-pruningheuristic. Losses with a
large space of argument values will be more coarsely approximatedby
the beam search, which prunes the number of chart entries to a
constant size. For thisreason, we have focused mainly on simple
loss functions which have a relatively small spaceof argument
values, and also use a wide beam during the search (200 unique
items or 500items, whichever comes first).
655
-
Cohn & Lapata
Algorithm 5 Find the gold standard derivation for a pair of
trees (i.e., alignment).Require: source tree, x, and target tree,
y
1: let C[vS , vT ] R be a chart representing the maximum number
of rules used to alignnodes vS x and vT y
2: let B[vS , vT ] (P,xy) be the corresponding back-pointers,
consisting of a productionand a pair aligned nodes for each of the
productions variables
3: initialize chart, C[, ] = 4: initialize back-pointers, B[, ]
= none5: for all source nodes, vS x, bottom-up do6: for all rules,
r = vS , Y , , where matches the sub-tree rooted at vS do7: for all
target nodes, vT y, matching do8: let rule count, j 19: for all
variables, u do
10: find aligned child nodes, (cS , cT ), under vS and vT
corresponding to u11: update rule count, j j + C[cS , cT ]12: end
for13: if n greater than previous value in chart then14: update
chart, C[vS , vT ] j15: update back-pointers, B[vS , vT ] (r, {(cS
, cT )u})16: end if17: end for18: end for19: end for20: if
C[root(x), root(y)] 6= then21: success; create derivation by
traversing back-pointers from B[root(x), root(y)]22: end if
In our discussion so far we have assumed that we are given a
gold standard derivation, yiglossing over the issue of how to find
it. Spurious ambiguity in the grammar means thatthere are often
many derivations linking the source and target, none of which are
clearlycorrect. We select the derivation using the maximum number
of rules, each of which will besmall, and therefore should provide
maximum generality.10 This is found using Algorithm 5,a chart-based
dynamic program similar to the alignment algorithm for inverse
transductiongrammars (Wu, 1997). The algorithm has time complexity
O(S2R) where S is the size ofthe larger of the two trees and R is
the number of rules which can match a node.
3.6 Loss Functions
The training algorithm described above is highly modular and in
theory can support a widerange of loss functions. There is no
widely accepted evaluation metric for text compres-sion. A zero-one
loss would be straightforward to define but inappropriate for our
problem,
10. We also experimented with other heuristics, including
choosing the derivation at random and selectingthe derivation with
the maximum or minimum score under the model (all using the same
search algorithmbut with a different objective). Of these, only the
maximum scoring derivation was competitive with themaximum rules
heuristic.
656
-
Sentence Compression as Tree Transduction
as it would always penalize target derivations that differ even
slightly from the referencederivation. Ideally, we would like a
loss with a wider scoring range that can discriminatebetween
derivations that differ from the reference. Some of these may be
good compres-sions whereas others may be entirely ungrammatical.
For this reason we have developeda range of loss functions which
draw inspiration from various metrics used for
evaluatingtext-to-text rewriting tasks such as summarization and
machine translation.
Loss functions are defined over derivations and can look at any
item accessible includingtokens, ngrams and CFG rules. Our first
class of loss functions calculates the Hammingdistance between
unordered bags of items. It measures the number of predicted items
thatdid not appear in the reference, along with a penalty for short
output:
hamming(d,d) = FP + max (l (TP + FP ), 0) (12)
where TP and FP are the number of true and false positives,
respectively, when comparingthe predicted target, dT , with the
reference, dT , and l is the length of the reference. Weinclude the
second term to penalize overly short output as otherwise predicting
very littleor nothing would incur no penalty.
We have created three instantiations of the loss function in
(12) over: 1) tokens,2) ngrams (n 3), and 3) CFG productions. In
each case, the loss argument space isquadratic in the size of the
source tree. Our Hamming ngram loss is an attempt at defininga loss
function similar to BLEU (Papineni, Roukos, Ward, & Zhu, 2002).
The latter isdefined over documents rather than individual
sentences, and is thus not directly applicableto our problem. Now,
since these losses all operate on unordered bags they may
rewarderroneous predictions, for example, a permutation of the
reference tokens will have zerotoken-loss. This is less of a
problem for the CFG and ngram losses whose items overlap,thereby
encoding a partial order. Another problem with the loss functions
just described isthat they do not penalize multiply predicting an
item that occurred only once in the refer-ence. This could be a
problem for function words which are common in most sentences.
Therefore we developed two additional loss functions which take
multiple predictions intoaccount. The first measures the edit
distance the number of insertions and deletions between the
predicted and the reference compressions, both as bags-of-tokens.
In contrastto the previous loss functions, it requires the true
positive counts to be clipped to thenumber of occurrences of each
type in the reference. The edit distance is given by:
edit(d,d) = p+ r 2i
min(pi, qi) (13)
where p and q denote the number of target tokens in the
predicted tree, target(d), andreference, y = target(d),
respectively, and pi and qi are the counts for type i. The
lossarguments for the edit distance consist of a vector of counts
for each item type in thereference, {pi,i}. The space of possible
values is exponential in the size of the source tree,compared to
quadratic for the Hamming losses. Consequently, we expect beam
search toresult in many more search errors when using the edit
distance loss.
Our last loss function is the F1 measure, a harmonic mean
between precision and recall,measured over bags-of-tokens. As with
the edit distance, its calculation requires the countsto be clipped
to the number of occurrences of each terminal type in the
reference. We
657
-
Cohn & Lapata
Ref: [S [WHNP [WP what]] [S [NP [NNS records]] [VP [VBP are] [VP
[VBN involved]]]]]Pred: [S [WHNP [WP what]] [S [NP [NNS ones]] [VP
[VBP are] [VBN involved]]]]
Loss Arguments ValueToken Hamming TP = 3, FP = 1 1/4
3-gram Hamming TP = 8, FP = 5 5/14CFG Hamming TP = 8, FP = 1
1/9
Edit distance p = (1, 0, 1, 1, 1) 2F1 p = (1, 0, 1, 1, 1)
1/4
Table 1: Loss arguments and values for the example predicted and
reference compressions.Note that loss values should not be compared
between different loss functions;these values are purely
illustrative.
therefore use the same loss arguments for its calculation. The
F1 loss is given by:
F1(d,d) = 1 2 precision recall
precision+ recall(14)
where precision =Pimin(pi,qi)
p and recall =Pimin(pi,qi)
q . As F1 shares the same argumentswith the edit distance loss,
it also has the same exponential space of loss argument valuesand
will consequently be subject to severe pruning during the beam
search used in training.
To illustrate the above loss functions, we present an example in
Table 1. Here, theprediction (Pred) and reference (Ref) have the
same length (4 tokens), identical syntacticstructure, but differ by
one word (ones versus records). Correspondingly, there are
threecorrect tokens and one incorrect, which forms the arguments
for the token Hamming loss,resulting in a loss of 1/4. The ngram
loss is measured for n 3 and the start and end of thestring are
padded with special symbols to allow evaluation of the boundary
ngrams. TheCFG loss records only one incorrect CFG production (the
preterminal [NNS ones]) from thetotal of nine productions. The last
two losses use the same arguments: a vector with valuesfor the
counts of each reference type. The first four cells correspond to
what, records, areand involved, the last cell records all other
types. For the example, the edit distance is two(one deletion and
one insertion) while the F1 loss is 1/4 (precision and recall are
both 3/4).
4. Features
Our feature space is defined over source trees, x, and target
derivations, d. We devised twobroad classes of features, applying
to grammar rules and to ngrams of target terminals. Wedefined only
a single ngram feature, the conditional log-probability of a
trigram languagemodel. This was trained on the BNC (100 million
words) using the SRI Language Modelingtoolkit (Stolcke, 2002), with
modified Kneser-Ney smoothing.
For each rule X,Y , ,, we extract features according to the
templates detailedbelow. Our templates give rise to binary
indicator features, except where explicitly stated.These features
perform a boolean test, returning value 1 when the test succeeds
and 0otherwise. An example rule and its corresponding features are
shown in Table 2.
658
-
Sentence Compression as Tree Transduction
Type: Whether the rule was extracted from the training set,
created as a copy rule and/orcreated as a delete rule. This allows
the model to learn a preference for each of thethree sources of
grammar rules (see row Type in Table 2)
Root: The root categories of the source, X, and target, Y , and
their conjunction, X Y(see rows Root in Table 2).
Identity: The source side, , target side, , and the full rule,
(, ,). This allows themodel to learn weights on individual rules or
those sharing an elementary tree. An-other feature checks if the
rules source and target elementary trees are identical, = (see rows
Identity in Table 2).
Unlexicalised Identity: The identity feature templates above are
replicated for unlex-icalised elementary trees, i.e., with the
terminals removed from their frontiers (seerows UnlexId in Table
2).
Rule count: This feature is always 1, allowing the model to
count the number of rulesused in a derivation (see row Rule count
in Table 2).
Word count: Counts the number of terminals in , allowing a
global preference forshorter or longer output. Additionally, we
record the number of terminals in thesource tree, which can be used
with the target terminal count to find the number ofdeleted
terminals (see rows Word count in Table 2).
Yield: These features compare the terminal yield of the source,
Y (), and target, Y ().The first feature checks the identity of two
sequences, Y () Y (). We use identityfeatures for each terminal in
both yields, and for each terminal only in the source (seerows
Yield in Table 2). We also replicate these feature templates for
the sequence ofnon-terminals on the frontier (pre-terminals or
variable non-terminals).
Length: Records the difference in the lengths of the frontiers
of and , and whetherthe targets frontier is shorter than that of
the source (see rows Length in Table 2).
The features listed above are defined for all the rules in the
grammar. This includesthe copy and delete rules, as described in
Section 3.2, which were added to address theproblem of unseen words
or productions in the source trees at test time. Many of theserules
can not be applied to the training set, but will receive some
weight because they sharefeatures with rules that can be used in
training. However, in training the model learns todisprefer these
coverage rules as they are unnecessary to model the training set,
which canbe described perfectly using the extracted transduction
rules. Our dual use of the trainingset for grammar extraction and
parameter estimation results in a bias against the coveragerules.
The bias could be addressed by extracting the grammar from a
separate corpus, inwhich case the coverage rules would then be
useful in modeling both the training set and thetesting sets.
However, this solution has its own problems, namely that many of
the targettrees in the training may not longer be reachable. This
bias and its possible solutions is aninteresting research problem
and deserves further work.
659
-
Cohn & Lapata
Rule: NP,NNS [NP CD ADJP [NNS activists]], [NNS activists]Type
type = training set 1Root X = NP 1Root Y = NNS 1Root X = NP Y = NNS
1
Identity = [NP CD ADJP [NNS activists]] 1Identity = [NNS
activists] 1Identity = [NP CD ADJP [NNS activists]] = [NNS
activists] 1
UnlexId. unlex. = [NP CD ADJP NNS] 1UnlexId. unlex. = NNS
1UnlexId. unlex. = [NP CD ADJP NNS] = NNS 1
Rule count 1Word count target terminals 1Word count source
terminals 1
Yield source = [activists] target = [activists] 1Yield terminal
activists in both source and target 1Yield non-terms. source = [CD,
ADJP, NNS] target = [NNS] 1Yield non-terminal CD in source and not
target 1Yield non-terminal ADJP in source and not target 1Yield
non-terminal NNS in both source and target 1
Length difference in length 2Length target shorter 1
Table 2: Features instantiated for the synchronous rule shown
above. Only features withnon-zero values are displayed. The number
of source terminals is calculated usingthe source tree at the time
the rule is applied.
5. Experimental Set-up
In this section we present our experimental set-up for assessing
the performance of thesentence compression model described above.
We give details of the corpora used, brieflyintroduce McDonalds
(2006) model used for comparison with our approach, and explainhow
system output was evaluated.
5.1 Corpora
We evaluated our system on three publicly available corpora. The
first is the Ziff-Daviscorpus, a popular choice in the sentence
compression literature. The corpus originatesfrom a collection of
news articles on computer products. It was created automatically
bymatching sentences that occur in an article with sentences that
occur in an abstract (Knight& Marcu, 2002). The other two
corpora11 were created manually; annotators were asked toproduce
target compressions by deleting extraneous words from the source
without changingthe word order (Clarke & Lapata, 2008). One
corpus was sampled from written sources,
11. Available from
http://homepages.inf.ed.ac.uk/s0460084/data/.
660
-
Sentence Compression as Tree Transduction
Corpus Articles Sentences Training Development TestingCLspoken
50 1370 882 78 410CLwritten 82 1433 908 63 462Ziff-Davis 1084 1020
32 32
Table 3: Sizes of the various corpora, measured in articles or
sentence pairs. The data splitinto training, development and
testing sets is measured in sentence pairs.
the British National Corpus (BNC) and the American News Text
corpus, whereas the otherwas created from manually transcribed
broadcast news stories. We will henceforth referto these two
corpora as CLwritten and CLspoken, respectively. The sizes of these
threecorpora are shown in Table 3.
These three corpora pose different challenges to a hypothetical
sentence compressionsystem. Firstly, they are representative of
different domains and text genres. Secondly,they have different
compression requirements. The Ziff-Davis corpus is more
aggressivelycompressed in comparison to CLspoken and CLwritten
(Clarke & Lapata, 2008). As CLspo-ken is a speech corpus, it
often contains incomplete and ungrammatical utterances andspeech
artefacts such as disfluencies, false starts and hesitations. Its
utterances have vary-ing lengths, some are very wordy whereas
others cannot be reduced any further. This meansthat a compression
system should leave some sentences uncompressed. Finally, we
shouldnote the CLwritten has on average longer sentences than
Ziff-Davis or CLspoken. Parsersare more likely to make mistakes on
long sentences which could potentially be problematicfor
syntax-based systems like the one presented here.
Although our model is capable of performing any editing
operation, such as reorderingor substitution, it will not learn to
do so from the training corpora. These corpora containonly
deletions, and therefore the model will not learn transduction
rules encoding, e.g.,reordering. Instead the rules encode only the
deleting and inserting terminals and re-structuring internal nodes
of the syntax tree. However, the model is capable general
textrewriting, and given the appropriate training set will learn to
perform these additionaledits. This is demonstrated by our recent
results from adapting the model to abstractivecompression (Cohn
& Lapata, 2008), where any edit is permitted, not just
deletion.
Our experiments on CLspoken and CLwritten followed Clarke and
Lapatas (2008) par-tition of training, test, and development sets.
The partition sizes are shown in Table 3. Inthe case of the
Ziff-Davis corpus, Knight and Marcu (2002) had not defined a
developmentset. Therefore we randomly selected (and held-out) 32
sentence pairs from their trainingset to form our development
set.
5.2 Comparison with State-of-the-Art
We evaluated our results against McDonalds (2006) discriminative
model. In this approach,sentence compression is formalized as a
classification task: pairs of words from the sourcesentence are
classified as being adjacent or not in the target compression. Let
x = x1, . . . , xNdenote a source sentence with a target
compression y = y1, . . . , yM where each yi occursin x. The
function L(yi) {1 . . . N} maps word yi the target to the index of
the word in
661
-
Cohn & Lapata
the source (subject to the constraint that L(yi) < L(yi+1)).
McDonald defines the score ofa compression y for a sentence x as
the dot product between a high dimensional featurerepresentation, f
, over bigrams and a corresponding weight vector, w,
score(x,y;w) =Mi=2
w, f(x, L(yj1), L(yj)) (15)
Decoding in this framework amounts to finding the combination of
bigrams that maximizethe scoring function in (15). The maximization
is solved using a semi-Markov Viterbialgorithm (McDonald,
2006).
The model parameters are estimated using the Margin Infused
Relaxed Algorithm(MIRA Crammer & Singer, 2003), a
discriminative large-margin online learning technique.McDonald
(2006) uses a similar loss function to our Hamming loss (see (12))
but withoutan explicit length penalty. This loss function counts
the number of words falsely retained ordropped in the predicted
target relative to the reference. McDonald employs a rich
featureset defined over words, parts of speech, phrase structure
trees, and dependencies. These aregathered over adjacent words in
the compression and the words which were dropped.
Clarke and Lapata (2008) reformulate McDonalds (2006) model in
the context of integerlinear programming (ILP) and augment it with
constraints ensuring that the compressedoutput is grammatically and
semantically well formed. For example, if the target sentencehas
negation, this must be included in the compression; If the source
verb has a subject,this must also be retained in the compression.
They generate and solve an ILP for everysource sentence using the
branch-and-bound algorithm. Since they obtain
performanceimprovements over McDonalds model on several corpora, we
also use it for comparisonagainst our model.
To summarize, we believe that McDonalds (2006) model is a good
basis for comparisonfor several reasons. First, it is has good
performance, and can be treated as a state-of-the-art model.
Secondly, it is similar to our model in many respects its training
algorithmand feature space but differs in one very important
respect: compression is performedon strings and not trees.
McDonalds system does make use of syntax trees, but
onlyperipherally via the feature set. In contrast, the syntax tree
is an integral part of ourmodel.
5.3 Evaluation
In line with previous work we assessed our models output by
eliciting human judgments.Following Knight and Marcu (2002), we
conducted two separate experiments. In the firstexperiment
participants were presented with a source sentence and its target
compressionand asked to rate how well the compression preserved the
most important information fromthe source sentence. In the second
experiment, they were asked to rate the grammaticalityof the
compressed outputs. In both cases they used a five point rating
scale where a highnumber indicates better performance. We randomly
selected 20 sentences from the testportion of each corpus. These
sentences were compressed automatically by our system andMcDonalds
(2006) system. We also included gold standard compressions. Our
materialsthus consisted of 180 (20 3 3) source-target sentences. A
Latin square design ensuredthat subjects did not see two different
compressions of the same sentence. We collected
662
-
Sentence Compression as Tree Transduction
ratings from 30 unpaid volunteers, all self reported native
English speakers. Both studieswere conducted over the Internet
using WebExp,12 a software package for running Internet-based
experiments.
We also report results using F1 computed over grammatical
relations (Riezler et al.,2003). Although F1 conflates
grammaticality and importance into a single score, it nev-ertheless
has been shown to correlate reliably with human judgments (Clarke
& Lapata,2006). Furthermore, it can be usefully employed during
development for feature engineer-ing and parameter optimization
experiments. We measured F1 over directed and labeleddependency
relations. For all models the compressed output was parsed using
the RASPdependency parser (Briscoe & Carroll, 2002). Note that
we could extract dependencies di-rectly from the output of our
model since it generates trees in addition to strings. However,we
refrained from doing this in order to compare all models on an
equal footing.
6. Results
The framework presented in Section 3 is quite flexible.
Depending on the grammar extrac-tion strategy, choice of features,
and loss function, different classes of models can be
derived.Before presenting our results on the test set we discuss
the specific model employed in ourexperiments and explain how its
parameters were instantiated.
6.1 Model Selection
All our parameter tuning and model selection experiments were
conducted on the devel-opment set of the CLspoken corpus. We
obtained syntactic analyses for source and targetsentences with
Bikels (2002) parser. The corpus was automatically aligned using an
algo-rithm which finds the set of deletions which transform the
source into the target. This isequivalent to the minimum edit
distance script when only deletion operations are permitted.
As expected, the predicted parse trees contained a number of
errors, although we didnot have gold standard trees with which to
quantify this error or its effect on predictionoutput. We did
notice, however, that errors in the source trees in the test set
did not alwaysnegatively affect the performance of the model. In
many instances the model was able torecover from these errors and
still produce good output compressions. Of these recoveries,most
cases involved either deleting the erroneous structure or entirely
preserving it. Whilethis often resulted in a poor output tree, the
string yield was acceptable in most cases. Lesscommonly, the model
corrected the errors in the source using tree transformation
rules.These rules were acquired from the training set where there
were errors in the source treebut not in the test tree. For
example, one transformation allows a prepositional phrase tobe
moved from a high VP attachment to an object NP attachment.
We obtained a synchronous tree substitution grammar from the
CLspoken corpus usingthe method described in Section 3.2. We
extracted all maximally general synchronous rules.These were
complemented with specified rules allowing recursion up to one
ancestor for anygiven node.13 Grammar rules were represented by the
features described in Section 4.An important parameter in our
modeling framework is the choice of loss function. We
12. See http://www.webexp.info/.13. Rules were pruned so as to
have no more than 5 variables and 15 nodes.
663
-
Cohn & Lapata
Losses Rating Std. devHamming (tokens) 3.38 1.05Hamming (ngram)
3.28 1.13Hamming (CFG) 3.22 0.91Edit Distance 3.30 1.20F1 3.15
1.13Reference 4.28 0.70
Table 4: Mean ratings on system output (CLspoken development
set) while using differentloss functions.
evaluated the loss functions presented in Section 3.6 as
follows. We performed a grid searchfor the hyper-parameters (a
regularization parameter and a feature scaling parameter,
whichbalances the magnitude of the feature vectors with the scale
of the loss function)14 whichminimized the relevant loss on the
development set, and used the corresponding systemoutput. The gold
standard derivation was selected using the maximum number of
rulesheuristic, as described in Section 3.5. The beam was limited
to 100 unique items or 200 itemsin total. The grammar was filtered
to allow no more than 50 target elementary trees forevery source
elementary tree.
We next asked two human judges to rate on a scale of 1 to 5 the
systems compressionswhen optimized for the different loss
functions. To get an idea of the quality of the outputwe also
included human-authored reference compressions. Sentences given
high numberswere both grammatical and preserved the most important
information. The mean ratingsare shown in Table 4. As can be seen
the differences among the losses are not very large,and the
standard deviation is high. The Hamming loss over tokens performed
best with amean rating of 3.38, closely followed by the edit
distance (3.30). We chose the former overthe latter as it is less
coarsely approximated during search. All subsequent
experimentsreport results using the token-based Hamming loss.
We also wanted to investigate how the synchronous grammar
influences performance.The default system described above used
general rules together with specialized rules wherethe recursion
depth was limited to one. We also experimented with a grammar that
usesspecialised rules with a maximum recursion depth of two and a
grammar that uses solely themaximally general rules. In Table 5 we
report the average compression rate, relations-basedF1 and the
Hamming loss over tokens for these different grammars. We see that
addingthe specified rules allows for better F1 (and loss) despite
the fact that the search spaceremains the same. We observe a slight
degradation in performance moving to depth 2rules. This is probably
due to the increase in spurious ambiguity affecting search
quality,and also allowing greater overfitting of the training data.
The number of transduction rulesin the grammar also grows
substantially with the increased depth from 20,764 for themaximally
general extraction technique to 33,430 and 62,116 for specified
rules with depth
14. We found that setting the regularization parameter C = 0.01
and the scaling parameter to 1 generallyyields good performance
across loss functions.
664
-
Sentence Compression as Tree Transduction
Model Compression rate Relations F1 Lossmax general rules 80.79
65.04 341
depth 1-specified rules 79.72 68.56 315depth 2-specified rules
79.71 66.44 328
max rules 79.72 68.56 315max scoring 81.03 65.54 344unigram LM
76.83 59.05 336bigram LM 83.12 67.71 317
trigram LM 79.72 68.56 315all features 79.72 68.56 315
only rule features 83.06 67.51 346only token features 85.10
68.31 341
Table 5: Parameter exploration and feature ablation studies
(CLspoken development set).The default system is shown with an
asterisk.
1 and 2, respectively. The growth in grammar size is exponential
in the specificationdepth and therefore only small values should be
used.
We also inspected the rules obtained with the maximally general
extraction techniqueto better assess how our rules differ from
those obtained from a vanilla SCFG (see Knight &Marcu, 2002).
Many of these rules (12%) have deeper structure and therefore would
not belicensed by an SCFG. This is due to structural divergences
between the source and targetsyntax trees in the training set. A
further 13% of the rules describe a change of syntacticcategory (X
6= Y ), and therefore only the remaining 76% of the rules would be
allowable inKnight and Marcus transducer. The proportion of SCFG
rules decreases substantially asthe rule specification depth is
increased.
Recall from Section 3.3 that our scoring function is defined
over derivations rather thantarget trees or strings, and that we
treat the derivation using the maximum number of rulesas the gold
standard derivation. As a sanity check, we also experimented with
selecting thederivation with the maximum score under the model. The
results in Table 5 indicate thatthe latter strategy is not as
effective as selecting the derivation with the maximum numberof
rules. Again we conjecture this is due to overfitting. As the
training data is used toextract the grammar, the derivations with
the maximum score may consist of rules withrare features which
model the data well but do not generalize to unseen instances.
Finally, we conducted a feature ablation study to assess which
features are more usefulto our task. We were particularly
interested to see if the ngram features would bringany benefit,
especially since they increase computational complexity during
decoding andtraining. We experimented with a unigram, bigram, and
trigram language model. Note thatthe unigram language model is not
as computationally expensive as the other two modelsbecause there
is no need to record ngram contexts in the chart. As shown in Table
5, theunigram language model is substantially worse than the bigram
and trigram which deliversimilar performances. We also examined the
impact of the other features by grouping theminto two broad
classes, those defined over rules and those defined over tokens.
Our aim wasto see whether the underlying grammar (represented by
rule-based features) contributes
665
-
Cohn & Lapata
to better compression output. The results in Table 5 reveal that
the two feature groupsperform comparably. However, the model using
only token-based features tends to compressless. These features are
highly lexicalized, and the model is not able to generalize well
onunseen data. In conclusion, the full feature set does better on
all counts than the twoablation sets, with a better compression
rate.
The results reported have all been measured over string output.
This was done by firststripping the tree structure from the
compression output, reparsing, extracting dependencyrelations and
finally comparing to the dependency relations in the reference.
However, wemay wish to measure the quality of the trees themselves,
not just their string yield. Asimple way to measure this15 would be
to extract dependency relations directly from thephrase-structure
tree output.16 Compared to dependencies extracted from the
predictedparses using Bikels (2002) parser on the output string, we
observe that the relation F1score increases uniformly for all
tasks, by between 2.50% and 4.15% absolute. Thereforethe systems
tree output better encodes the syntactic dependencies than the tree
resultingfrom re-parsing the string output. If the system is part
of a NLP pipeline, and its outputis destined for down-stream
processing, then having an accurate syntax tree is
extremelyimportant. This is also true for related tasks where the
desired output is a tree, e.g.,semantic parsing.
7. Model Comparison
In this section we present our results on the test set using the
best performing model fromthe previous section. This model uses a
grammar with unlexicalized and lexicalized rules(recursion depth
1), a Hamming loss based on tokens, and all the features from
Section 4.The model was trained separately on each corpus (training
portion). We first discuss ourresults using relations F1 and then
move on to the human study.
Table 6 illustrates the performance of our model (Transducer1)
on CLspoken, CLwrit-ten, and Ziff Davis. We also report results on
the same corpora using McDonalds (2006)model (McDonald) and the
improved version (Clarke ILP) put forward by Clarke andLapata
(2008). We also present the compression rate for each system and
the reference goldstandard. In all cases our tree transducer model
outperforms McDonalds original modeland the improved ILP-based
version.
Nevertheless, it may be argued that our model has an unfair
advantage here since ittends to compress less than the other
models, and is therefore less likely to make manymistakes. To
ensure that this is not the case, we created a version of our model
with ac