Sentence Compression as ILP problem

Journal of Artificial Intelligence Research 34 (2009) 637-674 Submitted 07/08; published 04/09

Sentence Compression as Tree Transduction

Trevor Cohn [email protected]

Mirella Lapata [email protected] of InformaticsUniversity of Edinburgh10 Crichton Street Edinburgh EH8 10AB, UK

Abstract

This paper presents a tree-to-tree transduction method for sentence compression. Ourmodel is based on synchronous tree substitution grammar, a formalism that allows localdistortion of the tree topology and can thus naturally capture structural mismatches. Wedescribe an algorithm for decoding in this framework and show how the model can betrained discriminatively within a large margin framework. Experimental results on sentencecompression bring significant improvements over a state-of-the-art model.

1. Introduction

Recent years have witnessed increasing interest in text-to-text generation methods for manynatural language processing applications, ranging from text summarisation to question an-swering and machine translation. At the heart of these methods lies the ability to performrewriting operations. For instance, text simplification identifies which phrases or sentencesin a document will pose reading difficulty for a given user and substitutes them with sim-pler alternatives (Carroll, Minnen, Pearce, Canning, Devlin, & Tait, 1999; Chandrasekar &Srinivas, 1996). In question answering, questions are often paraphrased in order to achievemore flexible matching with potential answers (Lin & Pantel, 2001; Hermjakob, Echihabi,& Marcu, 2002). Another example concerns reformulating written language so as to renderit more natural sounding for speech synthesis applications (Kaji, Okamoto, & Kurohashi,2004).

Sentence compression is perhaps one of the most popular text-to-text rewriting methods.The aim is to produce a summary of a single sentence that retains the most importantinformation while remaining grammatical (Jing, 2000). The appeal of sentence compressionlies in its potential for summarization and more generally for document compression, e.g., fordisplaying text on small screens such as mobile phones or PDAs (Vandeghinste & Pan,2004). Much of the current work in the literature focuses on a simplified formulation of thecompression task which does not allow any rewriting operations other than word deletion.Given an input source sentence of words x = x1, x2, . . . , xn, a target compression y is formedby removing any subset of these words (Knight & Marcu, 2002).

Despite being restricted to word deletion, the compression task remains challenging froma modeling perspective. Figure 1 illustrates a source sentence and its target compressiontaken from one of the compression corpora used in our experiments (see Section 5 for details).In this case, a hypothetical compression system must apply a series of rewrite rules in order

c2009 AI Access Foundation. All rights reserved.

Cohn & Lapata

exactly what records made it and which ones are involvedRB WP NNS VBD CC WP VBP VBNPRP NNS

NPVP

NPWHNP

SS

WHNPVP

NP

SS

S

VP

(a) Source

what records are involvedWP NNS VBP VBN

NPWHNPVP

SS

VP

(b) Target

Figure 1: Example of sentence compression showing the source and target trees. The boldsource nodes show the terminals that need to be removed to produce the targetstring.

RB WPWHNP

WPWHNP

VPNPS NP

VPNPS VP

(1) (2) (3)

SWHNPS VP

andCC

WHNP SS S

SWHNP

VPNPS

S

(5)(4)

Figure 2: Example transduction rules, each displayed as a pair of tree fragments. The left(source) fragment is matched against a node in the source tree, and the matchingpart is then replaced by the right (target) fragment. Dotted lines denote variablecorrespondences, and denotes node deletion.

to obtain the target, e.g., delete the leaf nodes exactly and and, delete the subtrees made itand which ones, and merge the subtrees corresponding to records and are involved. Moreconcretely, the system must have access to rules like those shown in Figure 2. The rulesare displayed as a pair of tree fragments where the left fragment corresponds to the sourceand the right to the target. For instance, rule (1) states that a wh-noun phrase (WHNP)consisting of an adverb (RB) and a wh-pronoun (WP) (e.g., exactly what) can be rewrittenas just a wh-pronoun (without the adverb). There are two things to note here. First,syntactic information plays an important role, since deletion decisions are not limited toindividual words but often span larger constituents. Secondly, there can be a large numberof compression rules of varying granularity and complexity (see rule (5) in Figure 2).

Previous solutions to the compression problem have been cast mostly in a supervisedlearning setting (for unsupervised methods see Clarke & Lapata, 2008; Hori & Furui, 2004;Turner & Charniak, 2005). Sentence compression is often modeled in a generative framework

638


where the aim is to estimate the joint probability P (x,y) of source sentence x havingthe target compression y (Knight & Marcu, 2002; Turner & Charniak, 2005; Galley &McKeown, 2007). These approaches essentially learn rewrite rules similar to those shownin Figure 4 from a parsed parallel corpus and subsequently use them to find the bestcompression from the set of all possible compressions for a given sentence. Other approachesmodel compression discriminatively as subtree deletion (Riezler, King, Crouch, & Zaenen,2003; Nguyen, Horiguchi, Shimazu, & Ho, 2004; McDonald, 2006).

Despite differences in formulation, existing models are specifically designed with sen-tence compression in mind and are not generally applicable to other tasks requiring morecomplex rewrite operations such as substitutions, insertions, or reordering. A commonassumption underlying previous work is that the tree structures representing the sourcesentences and their target compressions are isomorphic, i.e., there exists an edge-preservingbijection between the nodes in the two trees. This assumption is valid for sentence com-pression but does not hold for other rewriting tasks. Consequently, sentence compressionmodels are too restrictive; they cannot be readily adapted to other generation problemssince they are not able to handle structural and lexical divergences. A related issue con-cerns the deletion operations themselves which often take place without considering thestructure of the target compression (the goal is to generate a compressed string rather thanthe tree representing it). Without a syntax-based language model (Turner & Charniak,2005) or an explicit generation mechanism that licenses tree transformations there is noguarantee that the compressions will have well-formed syntactic structures. And it will notbe straightforward to process them for subsequent generation or analysis tasks.

In this paper we present a sentence compression model that is not deletion-specific butcan account for ample rewrite operations and scales to other rewriting tasks. We formulatethe compression problem as tree-to-tree rewriting using a synchronous grammar (with ruleslike those shown in Figure 2). Specifically, we adopt the synchronous tree substitutiongrammar (STSG) formalism (Eisner, 2003) which can model non-isomorphic tree structureswhile having efficient inference algorithms. We show how such a grammar can be inducedfrom a parallel corpus and propose a discriminative model for the rewriting task whichcan be viewed as a weighted tree-to-tree transducer. Our learning framework makes use ofthe large margin algorithm put forward by Tsochantaridis, Joachims, Hofmann, and Altun(2005) which efficiently learns a prediction function to minimize a given loss function. Wealso develop an appropriate algorithm that can be used in both training (i.e., learning themodel weights) and decoding (i.e., finding the most plausible compression under the model).Beyond sentence compression, we hope that some of the work described here might be ofrelevance to other tasks involving structural matching (see the discussion in Section 8).

The remainder of this paper is structured as follows. Section 2 provides an overviewof related work. Section 3 presents the STSG framework and the compression model weemploy in our experiments. Section 5 discusses our experimental set-up and Section 6presents our results. Discussion of future work concludes the paper.

2. Related Work

Synchronous context-free grammars (SCFGs, Aho & Ullman, 1969) are a generalizationof the context-free grammar (CFG) formalism to simultaneously produce strings in two

639

Cohn & Lapata

languages. They have been used extensively in syntax-based statistical MT. Examplesinclude inversion transduction grammar (Wu, 1997), head transducers (Alshawi, Bangalore,& Douglas, 2000), hierarchical phrase-based translation (Chiang, 2007), and several variantsof tree transducers (Yamada & Knight, 2001; Grael & Knight, 2004).

Sentence compression bears some resemblance to machine translation. Instead of trans-lating from one language into another, we are translating long sentences into shorter oneswithin the same language. It is therefore not surprising that previous work has also adoptedSCFGs for the compression task. Specifically, Knight and Marcu (2002) proposed a noisy-channel formulation of sentence compression. Their model consists of two components: alanguage model P (y) whose role is to guarantee that the compression output is grammat-ical and a channel model P (x|y) capturing the probability that the source sentence x isan expansion of the target compression y. Their decoding algorithm searches for the com-pression y which maximizes P (y)P (x|y). The channel model is a stochastic SCFG, therules of which are extracted from a parsed parallel corpus and their weights estimated usingmaximum likelihood. Galley and McKeown (2007) show how to obtain improved SCFGprobability estimates through Markovization. Turner and Charniak (2005) note that SCFGrules are not expressive enough to model structurally complicated compressions as theyare restricted to trees of depth 1. They remedy this by supplying their synchronous gram-mar with a set of more general special rules. For example, they allow rules of the formNP,NP [NP NP 1 CC NP 2 ],NP 1 (boxed subscripts are added to distinguish betweenthe two NPs).

Our own work formulates sentence compression in the framework of synchronous tree-substitution grammar (STSG, Eisner, 2003). STSG allows to describe non-isomorphic treepairs (the grammar rules can comprise trees of arbitrary depth) and is thus suited to text-rewriting tasks which typically involve a number of local modifications to the input text.Especially if each modification can be described succinctly in terms of syntactic transforma-tions, such as dropping an adjectival phrase or converting a passive verb phrase into activeform. STSG is a restricted version of synchronous tree adjoining grammar (STAG, Shieber& Schabes, 1990) without an adjunction operation. STAG affords mild context sensitivity,however at increased cost of inference. SCFG and STSG are weakly equivalent, that is, theirstring languages are identical but they do not produce equivalent tree pairs. For example,in Figure 2, rules (1)(4) can be expressed as SCFG rules, but rule (5) cannot becauseboth the source and target fragments are two level trees. In fact it would be impossible todescribe the trees in Figure 1 using a SCFG. Our grammar rules are therefore more generalthan those obtained by Knight and Marcu (2002) and can account for more elaborate treedivergences. Moreover, by adopting a more expressive grammar formalism, we can natu-rally model syntactically complex compressions without having to specify additional rules(as in Turner & Charniak, 2005).

A synchronous grammar will license a large number of compressions for a given sourcetree. Each grammar rule typically has a score from which the overall score of a compres-sion y for sentence x can be derived. Previous work estimates these scores generatively asdiscussed above. We opt for a discriminative training procedure which allows for the incor-poration of all manner of powerful features. We use the large margin technique proposedby Tsochantaridis et al. (2005). The framework is attractive in that it supports a config-urable loss function, which describes the extent to which a predicted target tree differs from

640


the reference tree. By devising suitable loss functions the model can be straightforwardlyadapted to text rewriting tasks besides sentence compression.

McDonald (2006) also presents a sentence compression model that uses a discriminativelarge margin algorithm. The model has a rich feature set defined over compression bigramsincluding parts of speech, parse trees, and dependency information, without however mak-ing explicit use of a synchronous grammar. Decoding in this model amounts to finding thecombination of bigrams that maximize a scoring function defined over adjacent words inthe compression and the intervening words which were dropped. Our model differs fromMcDonalds in two important respects. First, we can capture more complex tree trans-formations that go beyond bigram deletion. Being tree-based, our decoding algorithm isbetter able to preserve the grammaticality of the compressed output. Second, the tree-based representation allows greater modeling flexibility, e.g., by defining a wide range ofloss functions over the tree or its string yield. In contrast, McDonald can only define lossfunctions over the final compression.

Although the bulk of research on sentence compression relies on parallel corpora formodeling purposes, a few approaches use no training data at all or a small amount. Anexample is in the work of Hori and Furui (2004), who propose a model for automaticallytranscribed spoken text. Their method scores candidate compressions using a languagemodel combined with a significance score (indicating whether a word is topical or not),and a score representing the speech recognizers confidence in transcribing a given wordcorrectly. Despite being conceptually simple and knowledge lean, their model operates atthe word level. Since it does not take syntax into account, it has no means of deletingconstituents spanning several subtrees (e.g., relative clauses). Clarke and Lapata (2008)show that such unsupervised models can be greatly improved when linguistically motivatedconstraints are used during decoding.

3. Problem Formulation

As mentioned earlier, we formulate sentence compression as a tree-to-tree rewriting problemusing a weighted synchronous grammar coupled with a large margin training process. Ourmodel learns from a parallel corpus of input (uncompressed) and output (compressed) pairs(x1,y1), . . . , (xn,yn) to predict a target labeled tree y from a source labeled tree x. Wecapture the dependency between x and y as a weighted STSG which we define in thefollowing section. Section 3.2 discusses how we extract such a grammar from a parallelcorpus. Each rule has a score, as does each ngram in the output tree, from which theoverall score of a compression y for sentence x can be derived. We introduce our scoringfunction in Section 3.3 and explain our training algorithm in Section 3.5. In this frameworkdecoding amounts to finding the best target tree licensed by the grammar given a sourcetree. We present a chart-based decoding algorithm in Section 3.4.

3.1 Synchronous Grammar

A synchronous grammar defines a space of valid source and target tree pairs, much as aregular grammar defines a space of valid trees. Synchronous grammars can be treated as treetransducers by reasoning over the space of possible sister trees for a given tree, that is, allthe trees which can be produced alongside the given tree. This is essentially a transducer

641

Cohn & Lapata

Algorithm 1 Generative process for creating a pair of trees.initialize source tree, x = RSinitialize target tree, y = RTinitialize stack of frontier nodes, F = [(RS , RT )]for all node pairs, (vS , vT ) F do

choose a rule vS , vT , ,rewrite node vS in x as rewrite node vT in y as for all variables, u do

find aligned child nodes, (cS , cT ), under vS and vT corresponding to upush (cS , cT ) on to F

end forend forx and y are now complete

which takes a tree as input and produces a tree as output. The grammar rules specifythe steps taken by the transducer in recursively mapping tree fragments of the input treeinto fragments in the target tree. From the many families of synchronous grammars (seeSection 2), we elect to use a synchronous tree-substitution grammar (STSG). This is oneof the simpler formalisms, and consequently has efficient inference algorithms, while stillbeing complex enough to model a rich suite of tree edit operations.

A STSG is a 7-tuple, G = (NS ,NT ,S ,T , P,RS , RT ) where N are the non-terminalsand are the terminals, with the subscripts S and T indicating source and target respec-tively, P are the productions and RS NS and RT NT are the distinguished root symbols.Each production is a rewrite rule for two aligned non-terminals X NS and Y NT in thesource and target:

X,Y , , (1)where and are elementary trees rooted with the symbols X and Y respectively. Notethat a synchronous context free grammar (SCFG) limits and to one level elementarytrees, but is otherwise identical to a STSG, which imposes no such limits. Non-terminalleaves of the elementary trees are referred to as frontier nodes or variables. These are thepoints of recursion in the transductive process. A one-to-one alignment between the frontiernodes in and is specified by . The alignment can represent deletion (or insertion) byaligning a node with the special symbol, which indicates that the node is not present inthe other tree. Only nodes in can be aligned to , which allows for subtrees to be deletedduring transduction. We disallow the converse, -aligned nodes in , as these would licenseunlimited insertion in the target tree, independently of the source tree. This capabilitywould be of limited use for sentence compression, while also increasing the complexity ofinference.

The grammar productions can be used in a generative setting to produce pairs of trees,or in a transductive setting to produce a target tree when given a source tree. Algorithms 1and 2 present pseudo-code for both processes. The generative process (Algorithm 1) startswith the two root symbols and applies a production which rewrites the symbols as theproductions elementary trees. These elementary trees might contain frontier nodes, in

642


Algorithm 2 The transduction of a source tree into a target tree.Require: complete source tree, x, with root node labeled RS

initialize target tree, y = RTinitialize stack of frontier nodes, F = [(root(x), RT )]for all node pairs, (vS , vT ) F do

choose a rule vS , vT , , where matches the sub-tree rooted at vS in xrewrite vT as in yfor all variables, u do

find aligned child nodes, (cS , cT ), under vS and vT corresponding to upush (cS , cT ) on to F

end forend fory is now complete

which case the aligned pairs of frontier nodes are pushed on to the stack, and later rewrittenusing another production. The process continues in a recursive fashion until the stack isempty there are no frontier nodes remaining , at which point the two trees are complete.The sequence of rewrite rules are referred to as a derivation, from which the source andtarget tree can be recovered deterministically.

Our model uses a STSG in a transductive setting, where the source tree is given and itis only the target tree that is generated. This necessitates a different rewriting process, asshown in Algorithm 2. We start with the source tree, and RT , the target root symbol, whichis aligned to the root node of the source, denoted root(x). Then we choose a production torewrite the pair of aligned non-terminals such that the productions source side, , matchesthe source tree. The target symbol is then rewritten using . For each variable in thematching node in the source and its corresponding leaf node in the target tree are pushedon to the stack for later processing.1 The process repeats until the stack is empty, andtherefore the source tree has been covered. We now have a complete target tree. As beforewe use the term derivation to refer to this sequence of production applications. The targetstring is the yield of the target tree, given by reading the non-terminals from the tree in aleft to right manner.

Let us consider again the compression example from Figure 1. The tree editing rules fromFigure 2 are encoded as STSG productions in Figure 3 (see rules (1)(5)). Production (1),reproduces tree pair (1) from Figure 2, production (2) tree pair (2), and so on. The notationin Figure 3 (primarily for space reasons) uses brackets ([]) to indicate constituent boundaries.Brackets surround a constituents non-terminal and its child nodes, which can each beterminals, non-terminals or bracketed subtrees. The boxed indices are short-hand notationfor the alignment, . For example, in rule (1) they specify that the two WP non-terminalsare aligned and the RB node occurs only in the source tree (i.e., heads a deleted sub-tree). The grammar rules allow for differences in non-terminal category between the sourceand target, as seen in rules (2)(4). They also allow arbitrarily deep elementary trees,

1. Special care must be taken for aligned variables. Nodes in which are -aligned signify that the sourcesub-tree below this point can be deleted without affecting the target tree. For this reason we can safelyignore source nodes deleted in this manner.

643

Cohn & Lapata

Rules which perform major tree edits(1) WHNP, WHNP [WHNP RB WP 1 ], [WHNP WP 1 ](2) S, NP [S NP 1 VP ], NP 1 (3) S, VP [S NP VP 1 ], VP 1 (4) S, VP [S WHNP S 1 ], VP 1 (5) S, S [S [S WHNP 1 S 2 ] [CC and] S 3 ], [S WHNP 1 [S NP 2 VP 3 ]]

Rules which preserve the tree structure(6) WP, WP [WP what], [WP what](7) NP, NP [NP NNS 1 ], [NP NNS 1 ](8) NNS, NNS [NNS records], [NNS records](9) VP, VP [VP VBP 1 VP 2 ], [VP VBP 1 VP 2 ](10) VBP, VBP [VBP are], [VBP are](11) VP, VP [VP VBN 1 ], [VP VBN 1 ](12) VBN, VBN [VBN involved], [VBN involved]

Figure 3: The rules in a Synchronous Tree Substitution Grammar (STSG) capable of gen-erating the sentence pair from Figure 1. Equivalently, this grammar defines atransducer which can convert the source tree (Figure 1(a)) into the target tree(Figure 1(b)). Each rule rewrites a pair of non-terminals into a pair of subtrees,shown in bracketed notation.

as evidenced by rule (5) which is has trees of depth two. Rules (6)(12) complete thetoy grammar which describes the tree pair from Figure 1. These rules copy parts of thesource tree into the target, be they terminals (e.g., rule (6)) or internal nodes with children(e.g., rule (9)).

Figure 4 shows how this grammar can be used to transduce the source tree into thetarget tree from Figure 1. The first few steps of the derivation are also shown graphi-cally in Figure 5. We start with the source tree, and seek to transduce its root symbolinto the target root symbol, denoted S/S. The first rule to be applied is rule (5) in Fig-ure 3; its source side, = [S [S WHNP S] [CC and] S], matches the root of source treeand it has the requisite target category, Y = S. The matching part of the source treeis rewritten using the rules target elementary tree, = [S WHNP [S NP VP]]. The threethree variables are now annotated to reflect the category transformations required for eachnode, WHNP/WHNP, S/NP and S/VP. The process now continues for the leftmost of thesenodes, labeled WHNP/WHNP. Rule (1) (from Figure 3) is then applied, which deletes thenodes left child, shown as RB/, and retains its right child. The subsequent rule completesthe transduction of the WHNP node by matching the string exactly . The algorithm con-tinues to visit each variable node and finishes when there are no variable nodes remaining,resulting in the desired target tree.

3.2 Grammar

The previous section outlined the STSG formalism we employ in our sentence compressionmodel, save one important detail: the grammar itself. For example, we could obtain a

644


[S/S [S [WHNP exactly what] [S [NP records] [VP made it]]][CC and] [S [WHNP which] [S [NP ones] [VP are involved]]]]

5 [S [WHNP/WHNP [RB exactly] [WP what]] [S [S/NP [NP records] [VP made it]][S/VP [WHNP which] [S [NP ones] [VP are involved]]]]]

1 [S [WHNP [WP/WP what]] [S [S/NP [NP records] [VP made it]][S/VP [WHNP which] [S [NP ones] [VP are involved]]]]]

6 [S [WHNP [WP what]] [S [S/NP [NP records] [VP [VBD made] [NP [PRP it]]]]][S/VP [WHNP which] [S [NP ones] [VP [VBP are] [VP [VBN involved]]]]]]]

2 [S [WHNP [WP what]] [S [NP [NNS/NNS records]][S/VP [WHNP which] [S [NP ones] [VP are involved]]]]]

8 [S [WHNP [WP what]] [S [NP [NNS records]]][S/VP [WHNP which] [S [NP ones] [VP are involved]]]]

4 [S [WHNP what] [S [NP records] [S/VP [NP ones] [VP are involved]]]]3 [S [WHNP what] [S [NP records] [VP/VP [VP [VBP are] [VP [VBN involved]]]]]]9 [S [WHNP what] [S [NP records] [VP [VBP/VBP are] [VP/VP [VBN involved]]]]]10 [S [WHNP what] [S [NP records] [VP [VBP are] [VP/VP [VBN involved]]]]]11 [S [WHNP what] [S [NP records] [VP [VBP are] [VP [VBN/VBN involved]]]]]12 [S [WHNP [WP what]] [S [NP [NNS records]] [VP [VBP are] [VP [VBN involved]]]]]

Figure 4: Derivation of example sentence pair from Figure 1. Each line shows a rewrite step,denoted i where the subscript i identifies which rule was used. The frontiernodes are shown in bold with X/Y indicating that symbol X must be transducedinto Y in subsequent steps. For the sake of clarity, some internal nodes have beenomitted.

synchronous grammar by hand, automatically from a corpus, or by some combination. Ouronly requirement is that the grammar allows the source trees in the training set to betransduced into their corresponding target trees. For maximum generality, we devised anautomatic method to extract a grammar from a parsed, word-aligned parallel compressioncorpus. The method maps the word alignment into a constituent level alignment betweennodes in the source and target trees. Pairs of aligned subtrees are next generalized to createtree fragments (elementary trees) which form the rules of the grammar.

The first step of the algorithm is to find the constituent alignment, which we define asthe set of source and target constituent pairs whose yields are aligned to one another underthe word alignment. We base our approach on the alignment template method (Och & Ney,2004), which uses word alignments to define alignments between ngrams (called phrases inthe SMT literature). This method finds pairs of ngrams where at least one word in oneof the ngrams is aligned to a word in the other, but no word in either ngram is aligned toa word outside the other ngram. In addition, we require that these ngrams are syntacticconstituents. More formally, we define constituent alignment as:

C = {(vS , vT ), ((s, t) A s Y (vS) t Y (vT )) (2)(@(s, t) A (s Y (vS) Y t Y (vT )))}

where vS and vT are source and target tree nodes (subtrees), A = {(s, t)} is the set of wordalignments (pairs of word-indices), Y () returns the yield span for a subtree (the minimumand maximum word index in its yield) and Y is the exclusive-or operator. Figure 6 shows

645

Cohn & Lapata


NPVP

NPWHNP

SS

WHNPVP

NP

SS

S

VP

S

VPNP

WHNP S

S


NPVP

NPWHNP

SS

WHNPVP

NP

SS

S

VP

VPNP

WHNP S

S


NPVP

NPWHNP

SS

WHNPVP

NP

SS

S

VP

WP

Figure 5: Graphical depiction of the first two steps of the derivation in Figure 4. The sourcetree is shown on the left and the partial target tree on the right. Variable nodesare shown in bold face and dotted lines show their alignment.

the word alignment and the constituent alignments that are licensed for the sentence pairfrom Figure 1.

The next step is to generalize the aligned subtree pairs by replacing aligned child subtreeswith variable nodes. For example, in Figure 6 when we consider the pair of aligned subtrees[S which ones are involved] and [VP are involved], we could extract the rule:

S,VP [S [WHNP [WP which]] [S [NP [NNS ones] [VP [VBP are] [VP [VBN involved]]]]]],[VP [VBP are] [VP [VBN involved]]] (3)

However, this rule is very specific and consequently will not be very useful in a transductionmodel. In order for it to be applied, we must see the full S subtree, which is highly unlikelyto occur in another sentence. Ideally, we should generalize the rule so as to match manymore source trees, and thereby allow transduction of previously unseen structures. In theexample, the node pairs labeled (VP1, VP1), (VBP, VBP), (VP2, VP2) and (VBN, VBN)can all be generalized as these nodes are aligned constituents (subscripts added to distinguish

646



NPVP

NPWHNP

SS

WHNPVP

NP

SS

S

VP

whatrecordsare

involvedWP

NNSVBP

VBNNP

WHNP

VPS

S

VP

Figure 6: Tree pair with word alignments shown as a binary matrix. A dark square indicatesan alignment between the words on its row and column. The overlaid rectanglesshow constituent alignments which are inferred from the word alignment.

between the two VP nodes). In addition, the nodes WHNP, WP, NP and NNS in the sourceare unaligned, and therefore can be generalized using -alignment to signify deletion. If wewere to perform all possible generalizations for the above example,2 we would produce therule:

S,VP [S WHNP S 1 ], VP 1 (4)There are many other possible rules which can be extracted by applying different legalcombinations of the generalizations (there are 45 in total for this example).

Algorithm 3 shows how the minimial (most general) rules are extracted.3 This resultsin the minimal set of synchronous rules which can describe each tree pair.4 These rules areminimal in the sense that they cannot be made smaller (e.g., by replacing a subtree witha variable) while still honoring the word-alignment. Figure 7 shows the resulting minimalset of synchronous rules for the example from Figure 6. As can be seen from the example,many of the rules extracted are overly general. Ideally, we would extract every rule withevery legal combination of generalizations, however this leads to a massive number of rules exponential in the size of the source tree. We address this problem by allowing a limitednumber of generalizations to be skipped in the extraction process. This is equivalent toaltering lines 4 and 7 in Algorithm 3 to first make a non-deterministic decision whether tomatch or ignore the match and continue descending the source tree. The recursion depthlimits the number of matches that can be ignored in this way. For example, if we allow one

2. Where some generalizations are mutually exclusive, we take the highest match in the trees.3. The non-deterministic matching step in line 8 allows the matching of all options individually. This

is implemented as a mutually recursive function which replicates the algorithm state to process eachdifferent match.

4. Algorithm 3 is an extension of Galley, Hopkins, Knight, and Marcus (2004) technique for extracting aSCFG from a word-aligned corpus consisting of (tree, string) pairs.

647

Cohn & Lapata

Algorithm 3 extract(x,y, A): extracts minimal rules from constituent-aligned treesRequire: source tree, x, target tree, y, and constituent-alignment, A

1: initialize source and target sides of rule, = x, = y2: initialize frontier alignment, = 3: for all nodes vS , top-down do4: if vS is null-aligned then5: (vS , )6: delete children of a7: else if vS is aligned to some target node(s) then8: choose target node, vT {non-deterministic choice}9: call extract(vS , vT , A)

10: (vS , vT )11: delete children of vS12: delete children of vT13: end if14: end for15: emit rule root(), root() , ,

level of recursion when extracting rules from the (S, VP) pair from Figure 6, we get theadditional rules:

S,VP [S [WHNP WP ] S 1 ], VP 1 S,VP [S WHNP [S NP VP 1 ]], VP 1

while at two levels of recursion, we also get:

S,VP [S [WHNP [WP which]] S 1 ], VP 1 S,VP [S [WHNP [WP which]] [S NP VP 1 ]], VP 1 S,VP [S WHNP [S [NP NNS ] VP 1 ]], VP 1 S,VP [S WHNP [S NP [VP VBD 1 VP 2 ]]], [VBD 1 VBD 2 ]

Compared to rule (4) we can see that the specialized rules above add useful structureand lexicalisation, but are still sufficiently abstract to generalize to new sentences, unlikerule (3). The number of rules is exponential in the recursion depth, but with fixed a depthit is polynomial in the size of the source tree fragment. We set the recursion depth to asmall number (one or two) in our experiments.

There is no guarantee that the induced rules will have good coverage on unseen trees.Tree fragments containing previously unseen terminals or non-terminals, or even an unseensequence of children for a parent non-terminal, cannot be matched by any grammar pro-ductions. In this case the transduction algorithm (Algorithm 2) will fail as it has no wayof covering the source tree. However, the problem can be easily remedied by adding newrules to the grammar to allow the source tree to be fully covered.5 For each node in the

5. There are alternative, equally valid, techniques for improving coverage which simplify the syntax trees.For example, this can be done explicitly by binarizing large productions (e.g., Petrov, Barrett, Thibaux,& Klein, 2006) or implicitly with a Markov grammar over grammar productions (e.g., Collins, 1999).

648


S,S [S [S WHNP 1 S 2 ] CC S 3 ], [S WHNP 1 [S NP 2 VP 3 ]]WHNP,WHNP [WHNP RB WP 1 ], [WHNP WP 1 ]

WP,WP [WP what], [WP what]S,NP [S NP 1 VP ], NP 1 NP,NP [NP NNS 1 ], [NP NNS 1 ]

NNS,NNS [NNS records], [NNS records]S,VP [S WHNP S 1 ], VP 1 S,VP [S NP VP 1 ], VP 1 VP,VP [VP VBP 1 VP 2 ], [VP VBP 1 VP 2 ]

VBP,VBP [VBP are], [VBP are]VP,VP [VP VBN 1 ], [VP VBN 1 ]

VBN,VBN [VBN involved], [VBN involved]Figure 7: The minimal set of STSG rules extracted from the aligned trees in Figure 6.

source tree, a rule is created to copy that node and its child nodes into the target tree. Forexample, if we see the fragment [NP DT JJ NN] in the source tree, we add the rule:

NP,NP [NP DT 1 JJ 2 NN 3 ], [NP DT 1 JJ 2 NN 3 ]

With these rules, each source node is copied into the target tree, and therefore the trans-duction algorithm can trivially recreate the original tree. Of course, the other grammarrules can work in conjunction with the copying rules to produce other target trees.

While the copy rules solve the coverage problem on unseen data, they do not solve therelated problem of under-compression. This occurs when there are unseen CFG productionsin the source tree and therefore the only applicable grammar rules are copy rules, whichcopy all child nodes into the target. None of the child subtrees can be deleted unless theparent node can itself deleted by a higher-level rule, in which case all the children aredeleted. Clearly, it would add considerable modelling flexibility to be able to delete some,but not all, of the children. For this reason, we add explicit deletion rules for each sourceCFG production which allow subsets of the child nodes to be deleted in a linguisticallyplausible manner.

The deletion rules attempt to preserve the most important child nodes. We measureimportance using the head-finding heuristic from Collins parser (Appendix A, Collins,1999). Collins method finds the single head child of a CFG production using hand-codedtables for each non-terminal type. As we desire a set of child nodes, we run the algorithm tofind all matches rather than stopping after the first match. The order in which each matchis found is used as a ranking of the importance of each child. The ordered list of child nodesis then used to create synchronous rules which retain head 1, heads 12, . . . , all heads.

649

Cohn & Lapata

For the fragment [NP DT JJ NN], the heads are found in the following order (NN, DT,JJ). Therefore we create rules to retain children (NN); (DT, NN) and (DT, JJ, NN):

NP,NP [NP DT JJ NN 1 ], [NP NN 1 ]NP,NN [NP DT JJ NN 1 ], NN 1 NP,NP [NP DT 1 JJ NN 2 ], [NP DT 1 NN 2 ]NP,NP [NP DT 1 JJ 2 NN 3 ], [NP DT 1 JJ 2 NN 3 ]

Note that when only one child remains, the rule is also produced without the parent node,as seen in the second rule above.

3.3 Linear Model

While an STSG defines a transducer capable of mapping a source tree into many possibletarget trees, it is of little use without some kind of weighting towards grammatical treeswhich have been constructed using sensible STSG productions and which yield fluent com-pressed target sentences. Ideally the model would define a scoring function over targettrees or strings, however we instead operate on derivations. In general, there may be manyderivations which all produce the same target tree, a situation referred to as spurious am-biguity. To fully account for spurious ambiguity would require aggregating all derivationswhich produce the same target tree. This would break the polynomial-time dynamic pro-gram used for inference, rendering inference problem NP-complete (Knight, 1999). To thisend, we define a scoring function over derivations:

score(d;w) = (d),w (5)

where d is a derivation6 consisting of a sequence of rules, w are the model parameters, is a vector-valued feature function and the operator , is the inner product. Theparameters, w, are learned during training, described in Section 3.5.

The feature function, , is defined as:

(d) =rd

(r, source(d)) +

mngrams(d)(m, source(d)) (6)

where r are the rules of a derivation, ngrams(d) are the ngrams in the yield of the targettree and is a feature function returning a vector of feature values for each rule. Note thatthe feature function has access to not only the rule, r, but also the source tree, source(d),as this is a conditional model and therefore doing so has no overhead in terms of modelingassumptions or the complexity of inference.

In the second summand in (6), m are the ngrams in the yield of the target tree and isa feature function over these ngrams. Traditional (weighted) synchronous grammars onlyallow features which decompose with the derivation (i.e., can be expressed using the firstsummand in (6)). However, this is a very limiting requirement, as the ngram featuresallow the modeling of local coherence and are commonly used in the sentence compressionliterature (Knight & Marcu, 2002; Turner & Charniak, 2005; Galley & McKeown, 2007;

6. The derivation, d, fully specifies both the source, x = source(d), and the target tree, y = target(d).

650


Clarke & Lapata, 2008; Hori & Furui, 2004; McDonald, 2006). For instance, when deletinga sub-tree with left and right siblings, it is critical to know not only that the new siblingsare in a grammatical configuration, but also that their yield still forms a coherent string.For this reason, we allow ngram features, specifically the conditional log-probability ofan ngram language model. Unfortunately, this comes at a price as the ngram featuressignificantly increase the complexity of inference used for training and decoding.

3.4 Decoding

Decoding aims to find the best target tree licensed by the grammar given a source tree.As mentioned above, we deal with derivations in place of target trees. Decoding finds themaximizing derivation, d, of:

d = argmaxd:source(d)=x

score(d;w) (7)

where x is the (given) source tree, source(d) extracts the source tree from the derivation dand score is defined in (5). The maximization is performed over the space of derivationsfor the given source tree, as defined by the transduction process shown in Algorithm 2.

The maximization problem in (7) is solved using the chart-based dynamic programshown in Algorithm 4. This extends earlier inference algorithms for weighted STSGs (Eis-ner, 2003) which assume that the scoring function must decompose with the derivation,i.e., features apply to rules but not to terminal ngrams. Relaxing this assumption leads toadditional complications and increased time and space complexity. This is equivalent to us-ing as our grammar the intersection between the original grammar and an ngram languagemodel, as explained by Chiang (2007) in the context of string transduction with an SCFG.

The algorithm defines a chart, C, to record the best scoring (partial) target tree for eachsource node vS and with root non-terminal t. The back-pointers, B, record the maximizingrule and store pointers to the child chart cells filling each variable in the rule. The chart isalso indexed by the n 1 terminals at the left and right edges of the target trees yield toallow scoring of ngram features.7 The terminal ngrams provide sufficient context to evaluatengram features overlapping the cells boundary when the chart cell is combined in anotherrule application (this is the operation performed by the boundary-ngrams function on line15). This is best illustrated with an example. Using trigram features, n = 3, if a node wererewritten as [NP the fast car] then we must store the ngram context (the fast, fast car) inits chart entry. Similarly [VP skidded to a halt] would have ngram context (skidded to, ahalt). When applying a parent rule [S NP VP] which rewrites these two trees as adjacentsiblings we need to find the ngrams on the boundary between the NP and VP. These canbe easily retrieved from the two chart cells contexts. We combine the right edge of the NPcontext, fast car, with the left edge of the VP context, skidded to, to get the two trigramsfast car skidded and car skidded to. The other trigrams the fast car, skidded toa and to a halt will have already been evaluated in the child chart cells. The newcombined S chart cell is now given the context (the fast, a halt) by taking the left and right

7. Strictly speaking, only the terminals on the right edge are required for a compression model which wouldcreate the target string in a left-to-right manner. However, our algorithm is more general in that itallows reordering rules such as PP,PP [PP IN 1 NP 2 ], [PP NP 2 IN 1 ]. Such rules are required formost other text-rewriting tasks besides sentence compression.

651

Cohn & Lapata

Algorithm 4 Exact chart based decoding algorithm.Require: complete source tree, x, with root node labeled RS

1: let C[v, t, l] R be a chart representing the score of the best derivation transducing thetree rooted at v to a tree with root category t and ngram context l

2: let B[v, t, l] (P,x NT L) be the corresponding back-pointers, each consisting ofa production and the source node, target category and ngram context for each of theproductions variables

3: initialize chart, C[, , ] = 4: initialize back-pointers, B[, , ] = none5: for all source nodes, vS x, bottom-up do6: for all rules, r = vS , Y , , where matches the sub-tree rooted at vS do7: let m be the target ngrams wholly contained in 8: let features vector, (r,x) + (m,x)9: let l be an empty ngram context

10: let score, q 011: for all variables, u do12: find source child node, cu, under vS corresponding to u13: let tu be the non-terminal for target child node under corresponding to u14: choose child chart entry, qu = C[cu, tu, lu] {non-deterministic choice of lu}15: let m boundary-ngrams(r, lu)16: update features, + (m,x)17: update ngram context, l merge-ngram-context(l, lu)18: update score, q q + qu19: end for20: update score, q q + ,w21: if q > C[vS , Y, l] then22: update chart, C[vS , Y, l] q23: update back-pointers, B[vS , Y, l] (r, {(cu, tu, lu)u})24: end if25: end for26: end for27: find best root chart entry, l argmaxl C[root(x), RT , l]28: create derivation, d, by traversing back-pointers from B[root(x), RT , l]

edges of the two child cells. This merging process is performed by the merge-ngram-contextfunction on line 17. Finally we add artificial root node to the target tree with n1 artificialstart terminals and one end terminal. This allows the ngram features to be applied overboundary ngrams at the beginning and end of the target string.

The decoding algorithm processes the source tree in a post-order traversal, finding theset of possible trees and their ngram contexts for each source node and inserting these intothe chart. The rules which match the node are processed in lines 624. The feature vector,, is calculated on the rule and the ngrams therein (line 8), and for ngrams bordering childcells filling the rules variables (line 16). Note that the feature vector only includes thosefeatures specific to the rule and the boundary ngrams, but not those wholly contained in

652


the child cell. For this reason the score is the sum of the scores for each child cell (line18) and the feature vector and the model weights (line 20). The new ngram context, l, iscalculated by combining the rules frontier and the ngram contexts of the child cells (line17). Finally the chart entry for this node is updated if the score betters the previous value(lines 2124).

When choosing the child chart cell entry in line 14, there can be many different entrieseach with a different ngram context, lu. This affects the ngram features, , and consequentlythe ngram context, l, and the score, q, for the rule. The non-determinism means that everycombination of child chart entries are chosen for each variable, and these combinations arethen evaluated and inserted into the chart. The number of combinations is the product ofthe number of child chart entries for each variable. This can be bounded by O(|TT |2(n1)V )where |TT | is the size of the target lexicon and V is the number of variables. Therefore theasymptotic time complexity of decoding is the O(SR|TT |2(n1)V ) where S are the numberof source nodes and R is the number of matching rules for each node. This high complexityclearly makes exact decoding infeasible, especially so when either n or V are large.

We adopt a popular approach in syntax-inspired machine translation to address thisproblem (Chiang, 2007). Firstly, we use a beam-search, which limits the number of differentngram contexts stored in each chart cell to a constant, W . This changes the base inthe complexity term, leading to an improved O(SRW V ) but which is still exponential inthe number of variables. In addition, we use Chiangs cube-pruning heuristic to furtherlimit the number of combinations. Cube-pruning uses a heuristic scoring function whichapproximates the conditional log-probability from a ngram language model with the log-probability from a unigram model.8 This allows us to visit the combinations in best-firstorder under the heuristic scoring function until the beam is filled.The beam is then rescoredusing the correct scoring function. This can be done cheaply in O(WV ) time, leading to anoverall time complexity of decoding to O(SRWV ). We refer the interested reader to thework of Chiang (2007) for further details.

3.5 Training

We now turn to the problem of how derivations are scored in our model. For a given sourcetree, the space of sister target trees implied by the synchronous grammar is often very large,and the majority of these trees are ungrammatical or poor compressions. It is the job ofthe training algorithm to find weights such that the reference target trees have high scoresand the many other target trees licensed by the grammar are given lower scores.

As explained in Section 3.3 we define a scoring function over derivations. This functionwas given in (5) and (7), and is reproduced below:

f(d;w) = argmaxd:source(d)=x

w,(d) (8)

Equation (8) finds the best scoring derivation, d, for a given source, x, under a linear model.Recall that y is a derivation which generates the source tree x and a target tree. The goal

8. We use the conditional log-probability of an ngram language model as our only ngram feature. In order touse other ngram features, such as binary identity features for specific ngrams, it would first be advisable toconstruct an approximation which decomposes with the derivation for use in the cube-pruning heuristic.

653

Cohn & Lapata

of the training procedure is to find a parameter vector w which satisfies the condition:

i,d : source(d) = xi d 6= di : w,(di)(d) 0 (9)

where xi,di are the ith training source tree and reference derivation. This condition statesthat for all training instances the reference derivation is at least as high scoring as anyother derivations. Ideally, we would also like to know the extent to which a predicted targettree differs from the reference tree. For example, a compression that differs from the goldstandard with respect to one or two words should be treated differently from a compressionthat bears no resemblance to it. Another important factor is the length of the compression.Compressions whose length is similar to the gold standard should be be preferable to longeror shorter output. A loss function (yi,y) quantifies the accuracy of prediction y withrespect to the true output value yi.

There are a plethora of different discriminative training frameworks which can optimizea linear model. Possibilities include perceptron training (Collins, 2002), log-linear optimi-sation of the conditional log-likelihood (Berger, Pietra, & Pietra, 1996) and large marginmethods. We base our training on Tsochantaridis et al.s (2005) framework for learningSupport Vector Machines (SVMs) over structured output spaces, using the SVMstruct im-plementation.9 The framework supports a configurable loss function which is particularlyappealing in the context of sentence compression and more generally text-to-text genera-tion. It also has an efficient training algorithm and powerful regularization. The latter isis critical for discriminative models with large numbers of features, which would otherwiseover-fit the training sample at the expense of generalization accuracy. We briefly summarizethe approach below; for a more detailed description we refer the interested reader to thework of Tsochantaridis et al. (2005).

Traditionally SVMs learn a linear classifier that separates two or more classes with thelargest possible margin. Analogously, structured SVMs attempt to separate the correctstructure from all other structures with a large margin. The learning objective for thestructured SVM uses the soft-margin formulation which allows for errors in the training setvia the slack variables, i:

minw,

12||w||2 + C

n

ni=1

i, i 0 (10)

i,d : source(d) = xi y 6= di : w,(di)(d) (di,d) iThe slack variables, i, are introduced here for each training example, xi and C is a constantthat controls the trade-off between training error minimization and margin maximization.Note that slack variables are combined with the loss incurred in each of the linear con-straints. This means that a high loss output must be separated by a larger margin thana low loss output, or have a much larger slack variable to satisfy the constraint. Alter-natively, the loss function can be used to rescale the slack parameters, in which case theconstraints in (10) are replaced with w,(di) (d) 1 i(di,d) . Margin rescaling istheoretically less desirable as it is not scale invariant, and therefore requires the tuning ofan additional hyperparameter compared to slack rescaling. However, empirical results show

9. http://svmlight.joachims.org/svm_struct.html

654


little difference between the two rescaling methods (Tsochantaridis et al., 2005). We usemargin rescaling for the practical reason that it can be approximated more accurately thancan slack rescaling by our chart based inference method.

The optimization problem in (10) is approximated using an algorithm proposed byTsochantaridis et al. (2005). The algorithm finds a small set of constraints from the full-sized optimization problem that ensures a sufficiently accurate solution. Specifically, itconstructs a nested sequence of successively tighter relaxation of the original problem usinga (polynomial time) cutting plane algorithm. For each training instance, the algorithmkeeps track of the selected constraints defining the current relaxation. Iterating throughthe training examples, it proceeds by finding the output that most radically violates aconstraint. In our case, the optimization crucially relies on finding the derivation which isboth high scoring and has high loss compared to the gold standard. This requires findingthe maximizer of:

H(d) = (d,d) w,(di)(d) (11)

The search for the maximizer of H(d) in (11) can be performed by the decoding al-gorithm presented in Section 3.4 with some extensions. Firstly, by expanding (11) toH(d) = (d,d) (di),w+ (d),w we can see that the second term is constant withrespect to d, and thus does not influence the search. The decoding algorithm maximizesthe last term, so all that remains is to include the loss function into the search process.

Loss functions which decompose with the rules or target ngrams in the derivation,(d,d) =

rd R(d

, r) +

nngrams(d) N (d, n), can be easily integrated into the

decoding algorithm. This is done by adding the partial loss, R(d, r) + N (d, n) to eachrules score in line 20 of Algorithm 4 (the ngrams are recovered from the ngram contexts inthe same manner used to evaluate the ngram features).

However, many of our loss functions do not decompose with the rules or the ngrams. Inorder to calculate these losses the chart must be stratified by the loss functions arguments(Joachims, 2005). For example, unigram precision measures the ratio of correctly predictedtokens to total predicted tokens and therefore its loss arguments are the pair of counts,(TP, FP ), for true and false positives. They are initialized to (0, 0) and are then updatedfor each rule used in a derivation. This equates to checking whether each target terminal is inthe reference string and incrementing the relevant value. The chart is extended (stratified)to store the loss arguments in the same way that ngram contexts are stored for decoding.This means that a rule accessing a child chart cell can get multiple entries, each withdifferent loss argument values as well as multiple ngram contexts (line 14 in Algorithm4). The loss argument for a rule application is calculated from the rule itself and the lossarguments of its children. This is then stored in the chart and the back-pointer list (lines2223 in Algorithm 4). Although this loss can only be evaluated correctly for completederivations, we also evaluate the loss on partial derivations as part of the cube-pruningheuristic. Losses with a large space of argument values will be more coarsely approximatedby the beam search, which prunes the number of chart entries to a constant size. For thisreason, we have focused mainly on simple loss functions which have a relatively small spaceof argument values, and also use a wide beam during the search (200 unique items or 500items, whichever comes first).

655

Cohn & Lapata

Algorithm 5 Find the gold standard derivation for a pair of trees (i.e., alignment).Require: source tree, x, and target tree, y

1: let C[vS , vT ] R be a chart representing the maximum number of rules used to alignnodes vS x and vT y

2: let B[vS , vT ] (P,xy) be the corresponding back-pointers, consisting of a productionand a pair aligned nodes for each of the productions variables

3: initialize chart, C[, ] = 4: initialize back-pointers, B[, ] = none5: for all source nodes, vS x, bottom-up do6: for all rules, r = vS , Y , , where matches the sub-tree rooted at vS do7: for all target nodes, vT y, matching do8: let rule count, j 19: for all variables, u do

10: find aligned child nodes, (cS , cT ), under vS and vT corresponding to u11: update rule count, j j + C[cS , cT ]12: end for13: if n greater than previous value in chart then14: update chart, C[vS , vT ] j15: update back-pointers, B[vS , vT ] (r, {(cS , cT )u})16: end if17: end for18: end for19: end for20: if C[root(x), root(y)] 6= then21: success; create derivation by traversing back-pointers from B[root(x), root(y)]22: end if

In our discussion so far we have assumed that we are given a gold standard derivation, yiglossing over the issue of how to find it. Spurious ambiguity in the grammar means thatthere are often many derivations linking the source and target, none of which are clearlycorrect. We select the derivation using the maximum number of rules, each of which will besmall, and therefore should provide maximum generality.10 This is found using Algorithm 5,a chart-based dynamic program similar to the alignment algorithm for inverse transductiongrammars (Wu, 1997). The algorithm has time complexity O(S2R) where S is the size ofthe larger of the two trees and R is the number of rules which can match a node.

3.6 Loss Functions

The training algorithm described above is highly modular and in theory can support a widerange of loss functions. There is no widely accepted evaluation metric for text compres-sion. A zero-one loss would be straightforward to define but inappropriate for our problem,

10. We also experimented with other heuristics, including choosing the derivation at random and selectingthe derivation with the maximum or minimum score under the model (all using the same search algorithmbut with a different objective). Of these, only the maximum scoring derivation was competitive with themaximum rules heuristic.

656


as it would always penalize target derivations that differ even slightly from the referencederivation. Ideally, we would like a loss with a wider scoring range that can discriminatebetween derivations that differ from the reference. Some of these may be good compres-sions whereas others may be entirely ungrammatical. For this reason we have developeda range of loss functions which draw inspiration from various metrics used for evaluatingtext-to-text rewriting tasks such as summarization and machine translation.

Loss functions are defined over derivations and can look at any item accessible includingtokens, ngrams and CFG rules. Our first class of loss functions calculates the Hammingdistance between unordered bags of items. It measures the number of predicted items thatdid not appear in the reference, along with a penalty for short output:

hamming(d,d) = FP + max (l (TP + FP ), 0) (12)

where TP and FP are the number of true and false positives, respectively, when comparingthe predicted target, dT , with the reference, dT , and l is the length of the reference. Weinclude the second term to penalize overly short output as otherwise predicting very littleor nothing would incur no penalty.

We have created three instantiations of the loss function in (12) over: 1) tokens,2) ngrams (n 3), and 3) CFG productions. In each case, the loss argument space isquadratic in the size of the source tree. Our Hamming ngram loss is an attempt at defininga loss function similar to BLEU (Papineni, Roukos, Ward, & Zhu, 2002). The latter isdefined over documents rather than individual sentences, and is thus not directly applicableto our problem. Now, since these losses all operate on unordered bags they may rewarderroneous predictions, for example, a permutation of the reference tokens will have zerotoken-loss. This is less of a problem for the CFG and ngram losses whose items overlap,thereby encoding a partial order. Another problem with the loss functions just described isthat they do not penalize multiply predicting an item that occurred only once in the refer-ence. This could be a problem for function words which are common in most sentences.

Therefore we developed two additional loss functions which take multiple predictions intoaccount. The first measures the edit distance the number of insertions and deletions between the predicted and the reference compressions, both as bags-of-tokens. In contrastto the previous loss functions, it requires the true positive counts to be clipped to thenumber of occurrences of each type in the reference. The edit distance is given by:

edit(d,d) = p+ r 2i

min(pi, qi) (13)

where p and q denote the number of target tokens in the predicted tree, target(d), andreference, y = target(d), respectively, and pi and qi are the counts for type i. The lossarguments for the edit distance consist of a vector of counts for each item type in thereference, {pi,i}. The space of possible values is exponential in the size of the source tree,compared to quadratic for the Hamming losses. Consequently, we expect beam search toresult in many more search errors when using the edit distance loss.

Our last loss function is the F1 measure, a harmonic mean between precision and recall,measured over bags-of-tokens. As with the edit distance, its calculation requires the countsto be clipped to the number of occurrences of each terminal type in the reference. We

657

Cohn & Lapata

Ref: [S [WHNP [WP what]] [S [NP [NNS records]] [VP [VBP are] [VP [VBN involved]]]]]Pred: [S [WHNP [WP what]] [S [NP [NNS ones]] [VP [VBP are] [VBN involved]]]]

Loss Arguments ValueToken Hamming TP = 3, FP = 1 1/4

3-gram Hamming TP = 8, FP = 5 5/14CFG Hamming TP = 8, FP = 1 1/9

Edit distance p = (1, 0, 1, 1, 1) 2F1 p = (1, 0, 1, 1, 1) 1/4

Table 1: Loss arguments and values for the example predicted and reference compressions.Note that loss values should not be compared between different loss functions;these values are purely illustrative.

therefore use the same loss arguments for its calculation. The F1 loss is given by:

F1(d,d) = 1 2 precision recall

precision+ recall(14)

where precision =Pimin(pi,qi)

p and recall =Pimin(pi,qi)

q . As F1 shares the same argumentswith the edit distance loss, it also has the same exponential space of loss argument valuesand will consequently be subject to severe pruning during the beam search used in training.

To illustrate the above loss functions, we present an example in Table 1. Here, theprediction (Pred) and reference (Ref) have the same length (4 tokens), identical syntacticstructure, but differ by one word (ones versus records). Correspondingly, there are threecorrect tokens and one incorrect, which forms the arguments for the token Hamming loss,resulting in a loss of 1/4. The ngram loss is measured for n 3 and the start and end of thestring are padded with special symbols to allow evaluation of the boundary ngrams. TheCFG loss records only one incorrect CFG production (the preterminal [NNS ones]) from thetotal of nine productions. The last two losses use the same arguments: a vector with valuesfor the counts of each reference type. The first four cells correspond to what, records, areand involved, the last cell records all other types. For the example, the edit distance is two(one deletion and one insertion) while the F1 loss is 1/4 (precision and recall are both 3/4).

4. Features

Our feature space is defined over source trees, x, and target derivations, d. We devised twobroad classes of features, applying to grammar rules and to ngrams of target terminals. Wedefined only a single ngram feature, the conditional log-probability of a trigram languagemodel. This was trained on the BNC (100 million words) using the SRI Language Modelingtoolkit (Stolcke, 2002), with modified Kneser-Ney smoothing.

For each rule X,Y , ,, we extract features according to the templates detailedbelow. Our templates give rise to binary indicator features, except where explicitly stated.These features perform a boolean test, returning value 1 when the test succeeds and 0otherwise. An example rule and its corresponding features are shown in Table 2.

658


Type: Whether the rule was extracted from the training set, created as a copy rule and/orcreated as a delete rule. This allows the model to learn a preference for each of thethree sources of grammar rules (see row Type in Table 2)

Root: The root categories of the source, X, and target, Y , and their conjunction, X Y(see rows Root in Table 2).

Identity: The source side, , target side, , and the full rule, (, ,). This allows themodel to learn weights on individual rules or those sharing an elementary tree. An-other feature checks if the rules source and target elementary trees are identical, = (see rows Identity in Table 2).

Unlexicalised Identity: The identity feature templates above are replicated for unlex-icalised elementary trees, i.e., with the terminals removed from their frontiers (seerows UnlexId in Table 2).

Rule count: This feature is always 1, allowing the model to count the number of rulesused in a derivation (see row Rule count in Table 2).

Word count: Counts the number of terminals in , allowing a global preference forshorter or longer output. Additionally, we record the number of terminals in thesource tree, which can be used with the target terminal count to find the number ofdeleted terminals (see rows Word count in Table 2).

Yield: These features compare the terminal yield of the source, Y (), and target, Y ().The first feature checks the identity of two sequences, Y () Y (). We use identityfeatures for each terminal in both yields, and for each terminal only in the source (seerows Yield in Table 2). We also replicate these feature templates for the sequence ofnon-terminals on the frontier (pre-terminals or variable non-terminals).

Length: Records the difference in the lengths of the frontiers of and , and whetherthe targets frontier is shorter than that of the source (see rows Length in Table 2).

The features listed above are defined for all the rules in the grammar. This includesthe copy and delete rules, as described in Section 3.2, which were added to address theproblem of unseen words or productions in the source trees at test time. Many of theserules can not be applied to the training set, but will receive some weight because they sharefeatures with rules that can be used in training. However, in training the model learns todisprefer these coverage rules as they are unnecessary to model the training set, which canbe described perfectly using the extracted transduction rules. Our dual use of the trainingset for grammar extraction and parameter estimation results in a bias against the coveragerules. The bias could be addressed by extracting the grammar from a separate corpus, inwhich case the coverage rules would then be useful in modeling both the training set and thetesting sets. However, this solution has its own problems, namely that many of the targettrees in the training may not longer be reachable. This bias and its possible solutions is aninteresting research problem and deserves further work.

659

Cohn & Lapata

Rule: NP,NNS [NP CD ADJP [NNS activists]], [NNS activists]Type type = training set 1Root X = NP 1Root Y = NNS 1Root X = NP Y = NNS 1

Identity = [NP CD ADJP [NNS activists]] 1Identity = [NNS activists] 1Identity = [NP CD ADJP [NNS activists]] = [NNS activists] 1

UnlexId. unlex. = [NP CD ADJP NNS] 1UnlexId. unlex. = NNS 1UnlexId. unlex. = [NP CD ADJP NNS] = NNS 1

Rule count 1Word count target terminals 1Word count source terminals 1

Yield source = [activists] target = [activists] 1Yield terminal activists in both source and target 1Yield non-terms. source = [CD, ADJP, NNS] target = [NNS] 1Yield non-terminal CD in source and not target 1Yield non-terminal ADJP in source and not target 1Yield non-terminal NNS in both source and target 1

Length difference in length 2Length target shorter 1

Table 2: Features instantiated for the synchronous rule shown above. Only features withnon-zero values are displayed. The number of source terminals is calculated usingthe source tree at the time the rule is applied.

5. Experimental Set-up

In this section we present our experimental set-up for assessing the performance of thesentence compression model described above. We give details of the corpora used, brieflyintroduce McDonalds (2006) model used for comparison with our approach, and explainhow system output was evaluated.

5.1 Corpora

We evaluated our system on three publicly available corpora. The first is the Ziff-Daviscorpus, a popular choice in the sentence compression literature. The corpus originatesfrom a collection of news articles on computer products. It was created automatically bymatching sentences that occur in an article with sentences that occur in an abstract (Knight& Marcu, 2002). The other two corpora11 were created manually; annotators were asked toproduce target compressions by deleting extraneous words from the source without changingthe word order (Clarke & Lapata, 2008). One corpus was sampled from written sources,

11. Available from http://homepages.inf.ed.ac.uk/s0460084/data/.

660


Corpus Articles Sentences Training Development TestingCLspoken 50 1370 882 78 410CLwritten 82 1433 908 63 462Ziff-Davis 1084 1020 32 32

Table 3: Sizes of the various corpora, measured in articles or sentence pairs. The data splitinto training, development and testing sets is measured in sentence pairs.

the British National Corpus (BNC) and the American News Text corpus, whereas the otherwas created from manually transcribed broadcast news stories. We will henceforth referto these two corpora as CLwritten and CLspoken, respectively. The sizes of these threecorpora are shown in Table 3.

These three corpora pose different challenges to a hypothetical sentence compressionsystem. Firstly, they are representative of different domains and text genres. Secondly,they have different compression requirements. The Ziff-Davis corpus is more aggressivelycompressed in comparison to CLspoken and CLwritten (Clarke & Lapata, 2008). As CLspo-ken is a speech corpus, it often contains incomplete and ungrammatical utterances andspeech artefacts such as disfluencies, false starts and hesitations. Its utterances have vary-ing lengths, some are very wordy whereas others cannot be reduced any further. This meansthat a compression system should leave some sentences uncompressed. Finally, we shouldnote the CLwritten has on average longer sentences than Ziff-Davis or CLspoken. Parsersare more likely to make mistakes on long sentences which could potentially be problematicfor syntax-based systems like the one presented here.

Although our model is capable of performing any editing operation, such as reorderingor substitution, it will not learn to do so from the training corpora. These corpora containonly deletions, and therefore the model will not learn transduction rules encoding, e.g.,reordering. Instead the rules encode only the deleting and inserting terminals and re-structuring internal nodes of the syntax tree. However, the model is capable general textrewriting, and given the appropriate training set will learn to perform these additionaledits. This is demonstrated by our recent results from adapting the model to abstractivecompression (Cohn & Lapata, 2008), where any edit is permitted, not just deletion.

Our experiments on CLspoken and CLwritten followed Clarke and Lapatas (2008) par-tition of training, test, and development sets. The partition sizes are shown in Table 3. Inthe case of the Ziff-Davis corpus, Knight and Marcu (2002) had not defined a developmentset. Therefore we randomly selected (and held-out) 32 sentence pairs from their trainingset to form our development set.

5.2 Comparison with State-of-the-Art

We evaluated our results against McDonalds (2006) discriminative model. In this approach,sentence compression is formalized as a classification task: pairs of words from the sourcesentence are classified as being adjacent or not in the target compression. Let x = x1, . . . , xNdenote a source sentence with a target compression y = y1, . . . , yM where each yi occursin x. The function L(yi) {1 . . . N} maps word yi the target to the index of the word in

661

Cohn & Lapata

the source (subject to the constraint that L(yi) < L(yi+1)). McDonald defines the score ofa compression y for a sentence x as the dot product between a high dimensional featurerepresentation, f , over bigrams and a corresponding weight vector, w,

score(x,y;w) =Mi=2

w, f(x, L(yj1), L(yj)) (15)

Decoding in this framework amounts to finding the combination of bigrams that maximizethe scoring function in (15). The maximization is solved using a semi-Markov Viterbialgorithm (McDonald, 2006).

The model parameters are estimated using the Margin Infused Relaxed Algorithm(MIRA Crammer & Singer, 2003), a discriminative large-margin online learning technique.McDonald (2006) uses a similar loss function to our Hamming loss (see (12)) but withoutan explicit length penalty. This loss function counts the number of words falsely retained ordropped in the predicted target relative to the reference. McDonald employs a rich featureset defined over words, parts of speech, phrase structure trees, and dependencies. These aregathered over adjacent words in the compression and the words which were dropped.

Clarke and Lapata (2008) reformulate McDonalds (2006) model in the context of integerlinear programming (ILP) and augment it with constraints ensuring that the compressedoutput is grammatically and semantically well formed. For example, if the target sentencehas negation, this must be included in the compression; If the source verb has a subject,this must also be retained in the compression. They generate and solve an ILP for everysource sentence using the branch-and-bound algorithm. Since they obtain performanceimprovements over McDonalds model on several corpora, we also use it for comparisonagainst our model.

To summarize, we believe that McDonalds (2006) model is a good basis for comparisonfor several reasons. First, it is has good performance, and can be treated as a state-of-the-art model. Secondly, it is similar to our model in many respects its training algorithmand feature space but differs in one very important respect: compression is performedon strings and not trees. McDonalds system does make use of syntax trees, but onlyperipherally via the feature set. In contrast, the syntax tree is an integral part of ourmodel.

5.3 Evaluation

In line with previous work we assessed our models output by eliciting human judgments.Following Knight and Marcu (2002), we conducted two separate experiments. In the firstexperiment participants were presented with a source sentence and its target compressionand asked to rate how well the compression preserved the most important information fromthe source sentence. In the second experiment, they were asked to rate the grammaticalityof the compressed outputs. In both cases they used a five point rating scale where a highnumber indicates better performance. We randomly selected 20 sentences from the testportion of each corpus. These sentences were compressed automatically by our system andMcDonalds (2006) system. We also included gold standard compressions. Our materialsthus consisted of 180 (20 3 3) source-target sentences. A Latin square design ensuredthat subjects did not see two different compressions of the same sentence. We collected

662


ratings from 30 unpaid volunteers, all self reported native English speakers. Both studieswere conducted over the Internet using WebExp,12 a software package for running Internet-based experiments.

We also report results using F1 computed over grammatical relations (Riezler et al.,2003). Although F1 conflates grammaticality and importance into a single score, it nev-ertheless has been shown to correlate reliably with human judgments (Clarke & Lapata,2006). Furthermore, it can be usefully employed during development for feature engineer-ing and parameter optimization experiments. We measured F1 over directed and labeleddependency relations. For all models the compressed output was parsed using the RASPdependency parser (Briscoe & Carroll, 2002). Note that we could extract dependencies di-rectly from the output of our model since it generates trees in addition to strings. However,we refrained from doing this in order to compare all models on an equal footing.

6. Results

The framework presented in Section 3 is quite flexible. Depending on the grammar extrac-tion strategy, choice of features, and loss function, different classes of models can be derived.Before presenting our results on the test set we discuss the specific model employed in ourexperiments and explain how its parameters were instantiated.

6.1 Model Selection

All our parameter tuning and model selection experiments were conducted on the devel-opment set of the CLspoken corpus. We obtained syntactic analyses for source and targetsentences with Bikels (2002) parser. The corpus was automatically aligned using an algo-rithm which finds the set of deletions which transform the source into the target. This isequivalent to the minimum edit distance script when only deletion operations are permitted.

As expected, the predicted parse trees contained a number of errors, although we didnot have gold standard trees with which to quantify this error or its effect on predictionoutput. We did notice, however, that errors in the source trees in the test set did not alwaysnegatively affect the performance of the model. In many instances the model was able torecover from these errors and still produce good output compressions. Of these recoveries,most cases involved either deleting the erroneous structure or entirely preserving it. Whilethis often resulted in a poor output tree, the string yield was acceptable in most cases. Lesscommonly, the model corrected the errors in the source using tree transformation rules.These rules were acquired from the training set where there were errors in the source treebut not in the test tree. For example, one transformation allows a prepositional phrase tobe moved from a high VP attachment to an object NP attachment.

We obtained a synchronous tree substitution grammar from the CLspoken corpus usingthe method described in Section 3.2. We extracted all maximally general synchronous rules.These were complemented with specified rules allowing recursion up to one ancestor for anygiven node.13 Grammar rules were represented by the features described in Section 4.An important parameter in our modeling framework is the choice of loss function. We

12. See http://www.webexp.info/.13. Rules were pruned so as to have no more than 5 variables and 15 nodes.

663

Cohn & Lapata

Losses Rating Std. devHamming (tokens) 3.38 1.05Hamming (ngram) 3.28 1.13Hamming (CFG) 3.22 0.91Edit Distance 3.30 1.20F1 3.15 1.13Reference 4.28 0.70

Table 4: Mean ratings on system output (CLspoken development set) while using differentloss functions.

evaluated the loss functions presented in Section 3.6 as follows. We performed a grid searchfor the hyper-parameters (a regularization parameter and a feature scaling parameter, whichbalances the magnitude of the feature vectors with the scale of the loss function)14 whichminimized the relevant loss on the development set, and used the corresponding systemoutput. The gold standard derivation was selected using the maximum number of rulesheuristic, as described in Section 3.5. The beam was limited to 100 unique items or 200 itemsin total. The grammar was filtered to allow no more than 50 target elementary trees forevery source elementary tree.

We next asked two human judges to rate on a scale of 1 to 5 the systems compressionswhen optimized for the different loss functions. To get an idea of the quality of the outputwe also included human-authored reference compressions. Sentences given high numberswere both grammatical and preserved the most important information. The mean ratingsare shown in Table 4. As can be seen the differences among the losses are not very large,and the standard deviation is high. The Hamming loss over tokens performed best with amean rating of 3.38, closely followed by the edit distance (3.30). We chose the former overthe latter as it is less coarsely approximated during search. All subsequent experimentsreport results using the token-based Hamming loss.

We also wanted to investigate how the synchronous grammar influences performance.The default system described above used general rules together with specialized rules wherethe recursion depth was limited to one. We also experimented with a grammar that usesspecialised rules with a maximum recursion depth of two and a grammar that uses solely themaximally general rules. In Table 5 we report the average compression rate, relations-basedF1 and the Hamming loss over tokens for these different grammars. We see that addingthe specified rules allows for better F1 (and loss) despite the fact that the search spaceremains the same. We observe a slight degradation in performance moving to depth 2rules. This is probably due to the increase in spurious ambiguity affecting search quality,and also allowing greater overfitting of the training data. The number of transduction rulesin the grammar also grows substantially with the increased depth from 20,764 for themaximally general extraction technique to 33,430 and 62,116 for specified rules with depth

14. We found that setting the regularization parameter C = 0.01 and the scaling parameter to 1 generallyyields good performance across loss functions.

664


Model Compression rate Relations F1 Lossmax general rules 80.79 65.04 341

depth 1-specified rules 79.72 68.56 315depth 2-specified rules 79.71 66.44 328

max rules 79.72 68.56 315max scoring 81.03 65.54 344unigram LM 76.83 59.05 336bigram LM 83.12 67.71 317

trigram LM 79.72 68.56 315all features 79.72 68.56 315

only rule features 83.06 67.51 346only token features 85.10 68.31 341

Table 5: Parameter exploration and feature ablation studies (CLspoken development set).The default system is shown with an asterisk.

1 and 2, respectively. The growth in grammar size is exponential in the specificationdepth and therefore only small values should be used.

We also inspected the rules obtained with the maximally general extraction techniqueto better assess how our rules differ from those obtained from a vanilla SCFG (see Knight &Marcu, 2002). Many of these rules (12%) have deeper structure and therefore would not belicensed by an SCFG. This is due to structural divergences between the source and targetsyntax trees in the training set. A further 13% of the rules describe a change of syntacticcategory (X 6= Y ), and therefore only the remaining 76% of the rules would be allowable inKnight and Marcus transducer. The proportion of SCFG rules decreases substantially asthe rule specification depth is increased.

Recall from Section 3.3 that our scoring function is defined over derivations rather thantarget trees or strings, and that we treat the derivation using the maximum number of rulesas the gold standard derivation. As a sanity check, we also experimented with selecting thederivation with the maximum score under the model. The results in Table 5 indicate thatthe latter strategy is not as effective as selecting the derivation with the maximum numberof rules. Again we conjecture this is due to overfitting. As the training data is used toextract the grammar, the derivations with the maximum score may consist of rules withrare features which model the data well but do not generalize to unseen instances.

Finally, we conducted a feature ablation study to assess which features are more usefulto our task. We were particularly interested to see if the ngram features would bringany benefit, especially since they increase computational complexity during decoding andtraining. We experimented with a unigram, bigram, and trigram language model. Note thatthe unigram language model is not as computationally expensive as the other two modelsbecause there is no need to record ngram contexts in the chart. As shown in Table 5, theunigram language model is substantially worse than the bigram and trigram which deliversimilar performances. We also examined the impact of the other features by grouping theminto two broad classes, those defined over rules and those defined over tokens. Our aim wasto see whether the underlying grammar (represented by rule-based features) contributes

665

Cohn & Lapata

to better compression output. The results in Table 5 reveal that the two feature groupsperform comparably. However, the model using only token-based features tends to compressless. These features are highly lexicalized, and the model is not able to generalize well onunseen data. In conclusion, the full feature set does better on all counts than the twoablation sets, with a better compression rate.

The results reported have all been measured over string output. This was done by firststripping the tree structure from the compression output, reparsing, extracting dependencyrelations and finally comparing to the dependency relations in the reference. However, wemay wish to measure the quality of the trees themselves, not just their string yield. Asimple way to measure this15 would be to extract dependency relations directly from thephrase-structure tree output.16 Compared to dependencies extracted from the predictedparses using Bikels (2002) parser on the output string, we observe that the relation F1score increases uniformly for all tasks, by between 2.50% and 4.15% absolute. Thereforethe systems tree output better encodes the syntactic dependencies than the tree resultingfrom re-parsing the string output. If the system is part of a NLP pipeline, and its outputis destined for down-stream processing, then having an accurate syntax tree is extremelyimportant. This is also true for related tasks where the desired output is a tree, e.g.,semantic parsing.

7. Model Comparison

In this section we present our results on the test set using the best performing model fromthe previous section. This model uses a grammar with unlexicalized and lexicalized rules(recursion depth 1), a Hamming loss based on tokens, and all the features from Section 4.The model was trained separately on each corpus (training portion). We first discuss ourresults using relations F1 and then move on to the human study.

Table 6 illustrates the performance of our model (Transducer1) on CLspoken, CLwrit-ten, and Ziff Davis. We also report results on the same corpora using McDonalds (2006)model (McDonald) and the improved version (Clarke ILP) put forward by Clarke andLapata (2008). We also present the compression rate for each system and the reference goldstandard. In all cases our tree transducer model outperforms McDonalds original modeland the improved ILP-based version.

Nevertheless, it may be argued that our model has an unfair advantage here since ittends to compress less than the other models, and is therefore less likely to make manymistakes. To ensure that this is not the case, we created a version of our model with ac