Discriminative Modeling of Extraction Sets for Machine ...nlp.cs.berkeley.edu/pubs/DeNero-Klein_2010_Discriminative_paper.pdf · Discriminative Modeling of Extraction Sets for Machine

Discriminative Modeling of Extraction Sets for Machine Translation

John DeNero and Dan KleinComputer Science Division

University of California, Berkeley{denero,klein}@cs.berkeley.edu

Abstract

We present a discriminative model that di-rectly predicts which set of phrasal transla-tion rules should be extracted from a sen-tence pair. Our model scores extractionsets: nested collections of all the overlap-ping phrase pairs consistent with an under-lying word alignment. Extraction set mod-els provide two principle advantages overword-factored alignment models. First,we can incorporate features on phrasepairs, in addition to word links. Second,we can optimize for an extraction-basedloss function that relates directly to theend task of generating translations. Ourmodel gives improvements in alignmentquality relative to state-of-the-art unsuper-vised and supervised baselines, as wellas providing up to a 1.4 improvement inBLEU score in Chinese-to-English trans-lation experiments.

1 Introduction

In the last decade, the field of statistical machinetranslation has shifted from generating sentencesword by word to systems that recycle whole frag-ments of training examples, expressed as transla-tion rules. This general paradigm was first pur-sued using contiguous phrases (Och et al., 1999;Koehn et al., 2003), and has since been general-ized to a wide variety of hierarchical and syntacticformalisms. The training stage of statistical sys-tems focuses primarily on discovering translationrules in parallel corpora.

Most systems discover translation rules via atwo-stage pipeline: a parallel corpus is aligned atthe word level, and then a second procedure ex-tracts fragment-level rules from word-aligned sen-tence pairs. This paper offers a model-based alter-native to phrasal rule extraction, which merges this

two-stage pipeline into a single step. We present adiscriminative model that directly predicts whichset of phrasal translation rules should be extractedfrom a sentence pair. Our model predicts extrac-tion sets: combinatorial objects that include theset of all overlapping phrasal translation rules con-sistent with an underlying word-level alignment.This approach provides additional discriminativepower relative to word aligners because extractionsets are scored based on the phrasal rules they con-tain in addition to word-to-word alignment links.Moreover, the structure of our model directly re-flects the purpose of alignment models in general,which is to discover translation rules.

We address several challenges to training andapplying an extraction set model. First, we wouldlike to leverage existing word-level alignment re-sources. To do so, we define a deterministic map-ping from word alignments to extraction sets, in-spired by existing extraction procedures. In ourmapping, possible alignment links have a preciseinterpretation that dictates what phrasal translationrules can be extracted from a sentence pair. Thismapping allows us to train with existing annotateddata sets and use the predictions from word-levelaligners as features in our extraction set model.

Second, our model solves a structured predic-tion problem, and the choice of loss function dur-ing training affects model performance. We opti-mize for a phrase-level F-measure in order to fo-cus learning on the task of predicting phrasal rulesrather than word alignment links.

Third, our discriminative approach requires thatwe perform inference in the space of extractionsets. Our model does not factor over disjoint word-to-word links or minimal phrase pairs, and so ex-isting inference procedures do not directly apply.However, we show that the dynamic program fora block ITG aligner can be augmented to score ex-traction sets that are indexed by underlying ITGword alignments (Wu, 1997). We also describe a

k

l

g h

2月

15日

2010年

On February 15 2010

2月

15日

2010年

On February 15 2010

σ(ei)

σ(f2)

σ(e1)

(a)

(b)

Type 1: Language-specific function words omitted in the other language

Type 2: Role-equivalent word pairs that are not lexical equivalents

过

地球

[go over]

[Earth]

over the Earth 65%

31%

被

发现

[passive marker]

[discover]

was discovered

Distribution over possible link types

σ(fj)

年

过去

两

中

In the past two years

[past]

[two]

[year]

[in]

(a)

(b)

Figure 1: A word alignment A (shaded grid cells)defines projections σ(ei) and σ(fj), shown as dot-ted lines for each word in each sentence. The ex-traction set R3(A) includes all bispans licensed bythese projections, shown as rounded rectangles.

coarse-to-fine inference approach that allows us toscale our method to long sentences.

Our extraction set model outperforms both un-supervised and supervised word aligners at pre-dicting word alignments and extraction sets. Wealso demonstrate that extraction sets are useful forend-to-end machine translation. Our model im-proves translation quality relative to state-of-the-art Chinese-to-English baselines across two pub-licly available systems, providing total BLEU im-provements of 1.2 in Moses, a phrase-based sys-tem, and 1.4 in a Joshua, a hierarchical system(Koehn et al., 2007; Li et al., 2009)

2 Extraction Set Models

The input to our model is an unaligned sentencepair, and the output is an extraction set of phrasaltranslation rules. Word-level alignments are gen-erated as a byproduct of inference. We first spec-ify the relationship between word alignments andextraction sets, then define our model.

2.1 Extraction Sets from Word Alignments

Rule extraction is a standard concept in machinetranslation: word alignment constellations licenseparticular sets of overlapping rules, from whichsubsets are selected according to limits on phraselength (Koehn et al., 2003), number of gaps (Chi-ang, 2007), count of internal tree nodes (Galley etal., 2006), etc. In this paper, we focus on phrasalrule extraction (i.e., phrase pair extraction), uponwhich most other extraction procedures are based.

Given a sentence pair (e, f), phrasal rule extrac-tion defines a mapping from a set of word-to-word

k

l

g h

2月

15日

2010年

On February 15 2010

2月

15日

2010年

On February 15 2010

σ(ei)

σ(f2)

σ(e1)


Type 2: Role-equivalent pairs that are not lexical equivalents

过

地球

[go over]

[Earth]

over the Earth 65%

31%

被

发现

[passive marker]

[discover]

was discovered


σ(fj)

年

过去

两

中


[past]

[two]

[year]

[in]

PDT

After dinner I slept

在

晚饭

后

我

睡

了

[after]

[dinner]

[after]

[I]

[sleep]

[past tense]

k =2

l =4

g =1 h =3

Figure 2: Examples of two types of possible align-ment links (striped). These types account for 96%of the possible alignment links in our data set.

alignment links A = {(i, j)} to an extraction setof bispans Rn(A) = {[g, h) ⇔ [k, `)}, whereeach bispan links target span [g, h) to source span[k, `).1 The maximum phrase length n ensures thatmax(h− g, `− k) ≤ n.

We can describe this mapping via word-to-phrase projections, as illustrated in Figure 1. Letword ei project to the phrasal span σ(ei), where

σ(ei) =[

minj∈Ji

j , maxj∈Ji

j + 1)

(1)

Ji = {j : (i, j) ∈ A}

and likewise each word fj projects to a span of e.Then, Rn(A) includes a bispan [g, h)⇔ [k, `) iff

σ(ei) ⊆ [k, `) ∀i ∈ [g, h)σ(fj) ⊆ [g, h) ∀j ∈ [k, `)

That is, every word in one of the phrasal spansmust project within the other. This mapping is de-terministic, and so we can interpret a word-levelalignment A as also specifying the phrasal rulesthat should be extracted from a sentence pair.

2.2 Possible and Null Alignment LinksWe have not yet accounted for two special casesin annotated corpora: possible alignments and nullalignments. To analyze these annotations, we con-sider a particular data set: a hand-aligned portion

1We use the fencepost indexing scheme used commonlyfor parsing. Words are 0-indexed. Spans are inclusive on thelower bound and exclusive on the upper bound. For example,the span [0, 2) includes the first two words of a sentence.

of the NIST MT02 Chinese-to-English test set,which has been used in previous alignment experi-ments (Ayan et al., 2005; DeNero and Klein, 2007;Haghighi et al., 2009).

Possible links account for 22% of all alignmentlinks in these data, and we found that most ofthese links fall into two categories. First, possiblelinks are used to align function words that have noequivalent in the other language, but colocate withaligned content words, such as English determin-ers. Second, they are used to mark pairs of wordsor short phrases that are not lexical equivalents,but which play equivalent roles in each sentence.Figure 2 shows examples of these two use cases,along with their corpus frequencies.2

On the other hand, null alignments are usedsparingly in our annotated data. More than 90%of words participate in some alignment link. Theunaligned words typically express content in onesentence that is absent in its translation.

Figure 3 illustrates how we interpret possibleand null links in our projection. Possible links aretypically not included in extraction procedures be-cause most aligners predict only sure links. How-ever, we see a natural interpretation for possiblelinks in rule extraction: they license phrasal rulesthat both include and exclude them. We excludenull alignments from extracted phrases becausethey often indicate a mismatch in content.

We achieve these effects by redefining the pro-jection operator σ. Let A(s) be the subset of Athat are sure links, then let the index set Ji usedfor projection σ in Equation 1 be

Ji =

{j : (i, j) ∈ A(s)

}if ∃j : (i, j) ∈ A(s)

{−1, |f|} if @j : (i, j) ∈ A{j : (i, j) ∈ A} otherwise

Here, Ji is a set of integers, and σ(ei) for nullaligned ei will be [−1, |f|+ 1) by Equation 1.

Of course, the characteristics of our aligned cor-pus may not hold for other annotated corpora orother language pairs. However, we hope that theoverall effectiveness of our modeling approachwill influence future annotation efforts to buildcorpora that are consistent with this interpretation.

2.3 A Linear Model of Extraction SetsWe now define a linear model that scores extrac-tion sets. We restrict our model to score only co-

2We collected corpus frequencies of possible alignmentlink types ourselves on a sample of the hand-aligned data set.

k

l

g h

2月

15日

2010年

On February 15 2010

2月

15日

2010年

On February 15 2010

σ(ei)

σ(f2)

σ(e1)



过

地球

[go over]

[Earth]

over the Earth 65%

31%

被

发现

[passive marker]

[discover]

was discovered


σ(fj)

年

过去

两

中


[past]

[two]

[year]

[in]

PDT


在

晚饭

后

我

睡

了

[after]

[dinner]

[after]

[I]

[sleep]

(past)

k =2

l =4

g =1 h =3

Figure 3: Possible links constrain the word-to-phrase projection of otherwise unaligned words,which in turn license overlapping phrases. In thisexample, σ(f2) = [1, 2) does not include thepossible link at (1, 0) because of the sure link at(1, 1), but σ(e1) = [1, 2) does use the possiblelink because it would otherwise be unaligned. Theword “PDT” is null aligned, and so its projectionσ(e4) = [−1, 4) extends beyond the bounds of thesentence, excluding “PDT” from all phrase pairs.

herent extraction sets Rn(A), those that are li-censed by an underlying word alignment A withsure alignments A(s) ⊆ A. Conditioned on asentence pair (e, f) and maximum phrase lengthn, we score extraction sets via a feature vec-tor φ(A(s), Rn(A)) that includes features on surelinks (i, j) ∈ A(s) and features on the bispans inRn(A) that link [g, h) in e to [k, `) in f :

φ(A(s), Rn(A)) =∑(i,j)∈A(s)

φa(i, j) +∑

[g,h)⇔[k,`)∈Rn(A)

φb(g, h, k, `)

Because the projection operator Rn(·) is adeterministic function, we can abbreviateφ(A(s), Rn(A)) as φ(A) without loss of infor-mation, although we emphasize that A is a setof sure and possible alignments, and φ(A) doesnot decompose as a sum of vectors on individualword-level alignment links. Our model is param-eterized by a weight vector θ, which scores anextraction set Rn(A) as θ · φ(A).

To further limit the space of extraction sets weare willing to consider, we restrict A to blockinverse transduction grammar (ITG) alignments,a space that allows many-to-many alignmentsthrough phrasal terminal productions, but other-wise enforces at-most-one-to-one phrase match-ings with ITG reordering patterns (Cherry and Lin,2007; Zhang et al., 2008). The ITG constraint

k

l

g h

2月

15日

2010年

On February 15 2010

2月

15日

2010年

On February 15 2010

σ(ei)

σ(f2)

σ(e1)



过

地球

[go over]

[Earth]

over the Earth 65%

31%

被

发现

[passive marker]

[discover]

was discovered


σ(fj)

年

过去

两

中


[past]

[two]

[year]

[in]

PDT


在

晚饭

后

我

睡

了

[after]

[dinner]

[after]

[I]

[sleep]

[past tense]

k =2

l =4

g =1 h =3

Figure 4: Above, we show a representative sub-set of the block alignment patterns that serve asterminal productions of the ITG that restricts theoutput space of our model. These terminal pro-ductions cover up to n = 3 words in each sentenceand include a mixture of sure (filled) and possible(striped) word-level alignment links.

is more computationally convenient than arbitrar-ily ordered phrase matchings (Wu, 1997; DeNeroand Klein, 2008). However, the space of blockITG alignments is expressive enough to includethe vast majority of patterns observed in hand-annotated parallel corpora (Haghighi et al., 2009).

In summary, our model scores all Rn(A) forA ∈ ITG(e, f) where A can include block termi-nals of size up to n. In our experiments, n = 3.Unlike previous work, we allow possible align-ment links to appear in the block terminals, as de-picted in Figure 4.

3 Model Estimation

We estimate the weights θ of our extraction setmodel discriminatively using the margin-infusedrelaxed algorithm (MIRA) of Crammer and Singer(2003)—a large-margin, perceptron-style, onlinelearning algorithm. MIRA has been used suc-cessfully in MT to estimate both alignment mod-els (Haghighi et al., 2009) and translation models(Chiang et al., 2008).

For each training example, MIRA requires thatwe find the alignment Am corresponding to thehighest scoring extraction set Rn(Am) under thecurrent model,

Am = arg maxA∈ITG(e,f)θ · φ(A) (2)

Section 4 describes our approach to solving thissearch problem for model inference.

MIRA updates away from Rn(Am) and to-ward a gold extraction set Rn(Ag). Some hand-annotated alignments are outside of the block ITG

model class. Hence, we update toward the ex-traction set for a pseudo-gold alignment Ag ∈ITG(e, f) with minimal distance from the true ref-erence alignment At.

Ag = arg minA∈ITG(e,f)|A ∪ At −A ∩At| (3)

Inference details appear in Section 4.3.GivenAg andAm, we update the model param-

eters away from Am and toward Ag.

θ ← θ + τ · (φ(Ag)− φ(Am))

where τ is the minimal step size that will ensurewe prefer Ag to Am by a margin greater thanthe loss L(Am;Ag), capped at some maximumupdate size C to provide regularization. We useC = 0.01 in experiments. The step size is a closedform function of the loss and feature vectors: τ =

min(

C,L(Am;Ag)− θ · (φ(Ag)− φ(Am))

||φ(Ag)− φ(Am)||22

)We train the model for 30 iterations over the

training set, shuffling the order each time, and weaverage the weight vectors observed after each it-eration to estimate our final model.

3.1 Extraction Set Loss FunctionIn order to focus learning on predicting theright bispans, we use an extraction-level lossL(Am;Ag): an F-measure of the overlap betweenbispans in Rn(Am) and Rn(Ag). This measurehas been proposed previously to evaluate align-ment systems (Ayan and Dorr, 2006). Basedon preliminary translation results during develop-ment, we chose bispan F5 as our loss:

Pr(Am) = |Rn(Am) ∩Rn(Ag)|/|Rn(Am)|Rc(Am) = |Rn(Am) ∩Rn(Ag)|/|Rn(Ag)|

F5(Am;Ag) =(1 + 52) · Pr(Am) · Rc(Am)

52 · Pr(Am) + Rc(Am)L(Am;Ag) = 1− F5(Am;Ag)

F5 favors recall over precision. Previous align-ment work has shown improvements from adjust-ing the F-measure parameter (Fraser and Marcu,2006). In particular, Lacoste-Julien et al. (2006)also chose a recall-biased objective.

Optimizing for a bispan F-measure penalizesalignment mistakes in proportion to their rule ex-traction consequences. That is, adding a wordlink that prevents the extraction of many correctphrasal rules, or which licenses many incorrectrules, is strongly discouraged by this loss.

3.2 Features on Extraction Sets

The discriminative power of our model is drivenby the features on sure word alignment linksφa(i, j) and bispans φb(g, h, k, `). In both cases,the most important features come from the pre-dictions of unsupervised models trained on largeparallel corpora, which provide frequency and co-occurrence information.

To score word-to-word links, we use the poste-rior predictions of a jointly trained HMM align-ment model (Liang et al., 2006). The remainingfeatures include a dictionary feature, an identicalword feature, an absolute position distortion fea-ture, and features for numbers and punctuation.

To score phrasal translation rules in an extrac-tion set, we use a mixture of feature types. Ex-traction set models allow us to incorporate thesame phrasal relative frequency statistics that drivephrase-based translation performance (Koehn etal., 2003). To implement these frequency features,we extract a phrase table from the alignment pre-dictions of a jointly trained unsupervised HMMmodel using Moses (Koehn et al., 2007), and scorebispans using the resulting features. We also in-clude indicator features on lexical templates forthe 50 most common words in each language, asin Haghighi et al. (2009). We include indicatorsfor the number of words and Chinese charactersin rules. One useful indicator feature exploits thefact that capitalized terms in English tend to alignto Chinese words with three or more characters.On 1-by-n or n-by-1 phrasal rules, we include in-dicator features of fertility for common words.3

We also include monolingual phrase featuresthat expose useful information to the model. Forinstance, English bigrams beginning with “the”are often extractable phrases. English trigramswith a hyphen as the second word are typically ex-tractable, meaning that the first and third wordsalign to consecutive Chinese words. When anyconjugation of the word “to be” is followed by averb, indicating passive voice or progressive tense,the two words tend to align together.

Our feature set also includes bias features onphrasal rules and links, which control the num-ber of null-aligned words and number of rules li-censed. In total, our final model includes 4,249individual features, dominated by various instanti-ations of lexical templates.

3Limiting lexicalized features to common words helpsprevent overfitting.

k

l

g h

2月

15日

2010年

On February 15 2010

2月

15日

2010年

On February 15 2010

σ(ei)

σ(f2)

σ(e1)



过

地球

[go over]

[Earth]

over the Earth 65%

31%

被

发现

[passive marker]

[discover]

was discovered


σ(fj)

年

过去

两

中


[past]

[two]

[year]

[in]

PDT


在

晚饭

后

我

睡

了

[after]

[dinner]

[after]

[I]

[sleep]

[past tense]

k =2

l =4

g =1 h =3

or

Figure 5: Both possible ITG decompositions ofthis example alignment will split one of the twohighlighted bispans across constituents.

4 Model Inference

Equation 2 asks for the highest scoring extractionset under our model, Rn(Am), which we also re-quire at test time. Although we have restrictedAm ∈ ITG(e, f), our extraction set model does notfactor over ITG productions, and so the dynamicprogram for a vanilla block ITG will not suffice tofind Rn(Am). To see this, consider the extractionset in Figure 5. An ITG decomposition of the un-derlying alignment imposes a hierarchical brack-eting on each sentence, and some bispan in the ex-traction set for this alignment will cross any suchbracketing. Hence, the score of some licensed bis-pan will be non-local to the ITG decomposition.

4.1 A Dynamic Program for Extraction SetsIf we treat the maximum phrase length n as a fixedconstant, then we can define a dynamic program tosearch the space of extraction sets. An ITG deriva-tion for some alignment A decomposes into twosub-derivations forAL andAR.4 The model scoreof A, which scores extraction set Rn(A), decom-poses over AL and AR, along with any phrasalbispans licensed by adjoining AL and AR.

θ · φ(A) = θ · φ(AL) + θ · φ(AR) + I(AL,AR)

where I(AL,AR) is θ ·∑

φ(g, h, k, l) summedover licensed bispans [g, h) ⇔ [k, `) that overlapthe boundary between AL and AR.5

4We abuse notation in conflating an alignment A with itsderivation. All derivations of the same alignment receive thesame score, and we only compute the max, not the sum.

5We focus on the case of adjoining two aligned bispans.Our algorithm easily extends to include null alignments, butwe focus on the non-null setting for simplicity.

k

l

g h

2月

15日

2010年

On February 15 2010

2月

15日

2010年

On February 15 2010

σ(ei)

σ(f2)

σ(e1)



过

地球

[go over]

[Earth]

over the Earth 65%

31%

被

发现

[passive marker]

[discover]

was discovered


σ(fj)

年

过去

两

中


[past]

[two]

[year]

[in]

PDT


在

晚饭

后

我

睡

了

[after]

[dinner]

[after]

[I]

[sleep]

(past)

k =2

l =4

g =1 h =3

Figure 6: Augmenting the ITG grammar stateswith the alignment configuration in an n− 1 deepperimeter of the bispan allows us to score all over-lapping phrasal rules introduced by adjoining twobispans. The state must encode whether a sure linkappears in each edge column or row, but the spe-cific location of edge links is not required.

In order to compute I(AL,AR), we need cer-tain information about the alignment configura-tions of AL and AR where they adjoin at a corner.The state must represent (a) the specific alignmentlinks in the n − 1 deep corner of each A, and (b)whether any sure alignments appear in the rows orcolumns extending from those corners.6 With thisinformation, we can infer the bispans licensed byadjoining AL and AR, as in Figure 6.

Applying our score recurrence yields apolynomial-time dynamic program. This dynamicprogram is an instance of ITG bitext parsing,where the grammar uses symbols to encodethe alignment contexts described above. Thiscontext-as-symbol augmentation of the grammaris similar in character to augmenting symbols withlexical items to score language models duringhierarchical decoding (Chiang, 2007).

4.2 Coarse-to-Fine Inference and Pruning

Exhaustive inference under an ITG requires O(k6)time in sentence length k, and is prohibitively slowwhen there is no sparsity in the grammar. Main-taining the context necessary to score non-localbispans further increases running time. That is,ITG inference is organized around search statesassociated with a grammar symbol and a bispan;augmenting grammar symbols also augments thisstate space.

To parse quickly, we prune away search statesusing predictions from the more efficient HMM

6The number of configuration states does not depend onthe size ofA because corners have fixed size, and because theposition of links within rows or columns is not needed.

alignment model (Ney and Vogel, 1996). We dis-card all states corresponding to bispans that areincompatible with 3 or more alignment links un-der an intersected HMM—a proven approach topruning the space of ITG alignments (Zhang andGildea, 2006; Haghighi et al., 2009). Pruning inthis way reduces the search space dramatically, butonly rarely prohibits correct alignments. The ora-cle alignment error rate for the block ITG modelclass is 1.4%; the oracle alignment error rate forthis pruned subset of ITG is 2.0%.

To take advantage of the sparsity that resultsfrom pruning, we use an agenda-based parser thatorders search states from small to large, where wedefine the size of a bispan as the total number ofwords contained within it. For each size, we main-tain a separate agenda. Only when the agenda forsize k is exhausted does the parser proceed to pro-cess the agenda for size k + 1.

We also employ coarse-to-fine search to speedup inference (Charniak and Caraballo, 1998). Inthe coarse pass, we search over the space of ITGalignments, but score only features on alignmentlinks and bispans that are local to terminal blocks.This simplification eliminates the need to augmentgrammar symbols, and so we can exhaustively ex-plore the (pruned) space. We then compute out-side scores for bispans under a max-sum semir-ing (Goodman, 1996). In the fine pass with thefull extraction set model, we impose a maximumsize of 10,000 for each agenda. We order states onagendas by the sum of their inside score under thefull model and the outside score computed in thecoarse pass, pruning all states not within the fixedagenda beam size.

Search states that are popped off agendas areindexed by their corner locations for fast look-up when constructing new states. For each cor-ner and size combination, built states are main-tained in sorted order according to their insidescore. This ordering allows us to stop combin-ing states early when the results are falling off theagenda beams. Similar search and beaming strate-gies appear in many decoders for machine trans-lation (Huang and Chiang, 2007; Koehn and Had-dow, 2009; Moore and Quirk, 2007).

4.3 Finding Pseudo-Gold ITG Alignments

Equation 3 asks for the block ITG alignmentAg that is closest to a reference alignment At,which may not lie in ITG(e,f). We search for

k

l

g h

2月

15日

2010年

On February 15 2010

2月

15日

2010年

On February 15 2010

σ(ei)

σ(f2)

σ(e1)



过

地球

[go over]

[Earth]

over the Earth 65%

31%

被

发现

[passive marker]

[discover]

was discovered


σ(fj)

年

过去

两

中


[past]

[two]

[year]

[in]

PDT


在

晚饭

后

我

睡

了

[after]

[dinner]

[after]

[I]

[sleep]

[past tense]

k =1

l =4

g =0 h =3

or

Figure 7: A* search for pseudo-gold ITG align-ments uses an admissible heuristic for bispans thatcounts the number of gold links outside of [k, `)but within [g, h). Above, the heuristic is 1, whichis also the minimal number of alignment errorsthat an ITG alignment will incur using this bispan.

Ag using A* bitext parsing (Klein and Manning,2003). Search states, which correspond to bispans[g, h)⇔ [k, `), are scored by the number of errorswithin the bispan plus the number of (i, j) ∈ At

such that j ∈ [k, `) but i /∈ [g, h) (recall errors).As an admissible heuristic for the future cost ofa bispan [g, h) ⇔ [k, `), we count the number of(i, j) ∈ At such that i ∈ [g, h) but j /∈ [k, `), asdepicted in Figure 7. These links will become re-call errors eventually. A* search with this heuristicmakes no errors, and the time required to computepseudo-gold alignments is negligible.

5 Relationship to Previous Work

Our model is certainly not the first alignment ap-proach to include structures larger than words.Model-based phrase-to-phrase alignment was pro-posed early in the history of phrase-based trans-lation as a method for training translation models(Marcu and Wong, 2002). A variety of unsuper-vised models refined this initial work with priors(DeNero et al., 2008; Blunsom et al., 2009) andinference constraints (DeNero et al., 2006; Birchet al., 2006; Cherry and Lin, 2007; Zhang et al.,2008). These models fundamentally differ fromours in that they stipulate a segmentation of thesentence pair into phrases, and only align the min-imal phrases in that segmentation. Our modelscores the larger overlapping phrases that resultfrom composing these minimal phrases.

Discriminative alignment is also a well-

explored area. Most work has focused on pre-dicting word alignments via partial matching in-ference algorithms (Melamed, 2000; Taskar et al.,2005; Moore, 2005; Lacoste-Julien et al., 2006).Work in semi-supervised estimation has also con-tributed evidence that hand-annotations are usefulfor training alignment models (Fraser and Marcu,2006; Fraser and Marcu, 2007). The ITG gram-mar formalism, the corresponding word alignmentclass, and inference procedures for the class havealso been explored extensively (Wu, 1997; Zhangand Gildea, 2005; Cherry and Lin, 2007; Zhanget al., 2008). At the intersection of these lines ofwork, discriminative ITG models have also beenproposed, including one-to-one alignment mod-els (Cherry and Lin, 2006) and block models(Haghighi et al., 2009). Our model directly ex-tends this research agenda with first-class possi-ble links, overlapping phrasal rule features, and anextraction-level loss function.

Kaariainen (2009) trains a translation modeldiscriminatively using features on overlappingphrase pairs. That work differs from ours inthat it uses fixed word alignments and focuses ontranslation model estimation, while we focus onalignment and translate using standard relative fre-quency estimators.

Deng and Zhou (2009) present an alignmentcombination technique that uses phrasal features.Our approach differs in two ways. First, their ap-proach is tightly coupled to the input alignments,while we perform a full search over the space ofITG alignments. Also, their approach uses greedysearch, while our search is optimal aside frompruning and beaming. Despite these differences,their strong results reinforce our claim that phrase-level information is useful for alignment.

6 Experiments

We evaluate our extraction set model by the bis-pans it predicts, the word alignments it generates,and the translations generated by two end-to-endsystems. Table 1 compares the five systems de-scribed below, including three baselines. All su-pervised aligners were optimized for bispan F5.

Unsupervised Baseline: GIZA++. We trainedGIZA++ (Och and Ney, 2003) using the defaultparameters included with the Moses training script(Koehn et al., 2007). The designated regimen con-cludes by Viterbi aligning under Model 4 in bothdirections. We combined these alignments with

the grow-diag heuristic (Koehn et al., 2003).

Unsupervised Baseline: Joint HMM. Wetrained and combined two HMM alignment mod-els (Ney and Vogel, 1996) using the BerkeleyAligner.7 We initialized the HMM model pa-rameters with jointly trained Model 1 param-eters (Liang et al., 2006), combined word-to-word posteriors by averaging (soft union), and de-coded with the competitive thresholding heuristicof DeNero and Klein (2007), yielding a state-of-the-art unsupervised baseline.

Supervised Baseline: Block ITG. We discrimi-natively trained a block ITG aligner with only surelinks, using block terminal productions up to 3words by 3 words in size. This supervised base-line is a reimplementation of the MIRA-trainedmodel of Haghighi et al. (2009). We use the samefeatures and parser implementation for this modelas we do for our extraction set model to ensure aclean comparison. To remain within the alignmentclass, MIRA updates this model toward a pseudo-gold alignment with only sure links. This modeldoes not score any overlapping bispans.

Extraction Set Coarse Pass. We add possiblelinks to the output of the block ITG model byadding the mixed terminal block productions de-scribed in Section 2.3. This model scores over-lapping phrasal rules contained within terminalblocks that result from including or excluding pos-sible links. However, this model does not scorebispans that cross bracketing of ITG derivations.

Full Extraction Set Model. Our full model in-cludes possible links and features on extractionsets for phrasal bispans with a maximum size of3. Model inference is performed using the coarse-to-fine scheme described in Section 4.2.

6.1 Data

In this paper, we focus exclusively on Chinese-to-English translation. We performed our discrimi-native training and alignment evaluations using ahand-aligned portion of the NIST MT02 test set,which consists of 150 training and 191 test sen-tences (Ayan and Dorr, 2006). We trained thebaseline HMM on 11.3 million words of FBISnewswire data, a comparable dataset to those usedin previous alignment evaluations on our test set(DeNero and Klein, 2007; Haghighi et al., 2009).

7http://code.google.com/p/berkeleyaligner

Our end-to-end translation experiments weretuned and evaluated on sentences up to length 40from the NIST MT04 and MT05 test sets. Forthese experiments, we trained on a 22.1 millionword parallel corpus consisting of sentences up tolength 40 of newswire data from the GALE pro-gram, subsampled from a larger data set to pro-mote overlap with the tune and test sets. This cor-pus also includes a bilingual dictionary. To im-prove performance, we retrained our aligner on aretokenized version of the hand-annotated data tomatch the tokenization of our corpus.8 We traineda language model with Kneser-Ney smoothingon 262 million words of newswire using SRILM(Stolcke, 2002).

6.2 Word and Phrase Alignment

The first panel of Table 1 gives a word-level eval-uation of all five aligners. We use the alignmenterror rate (AER) measure: precision is the frac-tion of sure links in the system output that are sureor possible in the reference, and recall is the frac-tion of sure links in the reference that the systemoutputs as sure. For this evaluation, possible linksproduced by our extraction set models are ignored.The full extraction set model performs the best bya small margin, although it was not tuned for wordalignment.

The second panel gives a phrasal rule-levelevaluation, which measures the degree to whichthese aligners matched the extraction sets of hand-annotated alignments, R3(At).9 To competefairly, all models were evaluated on the full ex-traction sets induced by the word alignments theypredicted. Again, the extraction set model outper-formed the baselines, particularly on the F5 mea-sure for which these systems were trained.

Our coarse pass extraction set model performednearly as well as the full model. We believethese models perform similarly for two reasons.First, most of the information needed to predictan extraction set can be inferred from word linksand phrasal rules contained within ITG terminalproductions. Second, the coarse-to-fine inferencemay be constraining the full phrasal model to pre-dict similar output to the coarse model. This simi-larity persists in translation experiments.

8All alignment results are reported under the annotateddata set’s original tokenization.

9While pseudo-gold approximations to the annotationwere used for training, the evaluation is always performedrelative to the original human annotation.

Word Bispan BLEUPr Rc AER Pr Rc F1 F5 Joshua Moses

Baseline GIZA++ 72.5 71.8 27.8 69.4 45.4 54.9 46.0 33.8 32.6models Joint HMM 84.0 76.9 19.6 69.5 59.5 64.1 59.9 34.5 33.2

Block ITG 83.4 83.8 16.4 75.8 62.3 68.4 62.8 34.7 33.6Extraction Coarse Pass 82.2 84.2 16.9 70.0 72.9 71.4 72.8 35.7 34.2set models Full Model 84.7 84.0 15.6 69.0 74.2 71.6 74.0 35.9 34.4

Table 1: Experimental results demonstrate that the full extraction set model outperforms supervised andunsupervised baselines in evaluations of word alignment quality, extraction set quality, and translation.In word and bispan evaluations, GIZA++ did not have access to a dictionary while all other methodsdid. In the BLEU evaluation, all systems used a bilingual dictionary included in the training corpus. TheBLEU evaluation of supervised systems also included rule counts from the Joint HMM to compensatefor parse failures.

6.3 Translation Experiments

We evaluate the alignments predicted by ourmodel using two publicly available, open-source,state-of-the-art translation systems. Moses is aphrase-based system with lexicalized reordering(Koehn et al., 2007). Joshua (Li et al., 2009) isan implementation of Hiero (Chiang, 2007) usinga suffix-array-based grammar extraction approach(Lopez, 2007).

Both of these systems take word alignments asinput, and neither of these systems accepts possi-ble links in the alignments they consume. To inter-face with our extraction set models, we producedthree sets of sure-only alignments from our modelpredictions: one that omitted possible links, onethat converted all possible links to sure links, andone that includes each possible link with 0.5 prob-ability. These three sets were aggregated and ruleswere extracted from all three.

The training set we used for MT experimentsis quite heterogenous and noisy compared to ouralignment test sets, and the supervised alignersdid not handle certain sentence pairs in our par-allel corpus well. In some cases, pruning basedon consistency with the HMM caused parse fail-ures, which in turn caused training sentences to beskipped. To account for these issues, we addedcounts of phrasal rules extracted from the baselineHMM to the counts produced by supervised align-ers.

In Moses, our extraction set model predicts theset of phrases extracted by the system, and so theestimation techniques for the alignment model andtranslation model both share a common underly-ing representation: extraction sets. Empirically,we observe a BLEU score improvement of 1.2

over the best unsupervised baseline and 0.8 overthe block ITG supervised baseline (Papineni et al.,2002).

In Joshua, hierarchical rule extraction is basedupon phrasal rule extraction, but abstracts awaysub-phrases to create a grammar. Hence, the ex-traction sets we predict are closely linked to therepresentation that this system uses to translate.The extraction model again outperformed both un-supervised and supervised baselines, by 1.4 BLEUand 1.2 BLEU respectively.

7 Conclusion

Our extraction set model serves to coordinate thealignment and translation model components of astatistical translation system by unifying their rep-resentations. Moreover, our model provides an ef-fective alternative to phrase alignment models thatchoose a particular phrase segmentation; instead,we predict many overlapping phrases, both largeand small, that are mutually consistent. In futurework, we look forward to developing extractionset models for richer formalisms, including hier-archical grammars.

Acknowledgments

This project is funded in part by BBN underDARPA contract HR0011-06-C-0022 and by theNSF under grant 0643742. We thank the anony-mous reviewers for their helpful comments.

ReferencesNecip Fazil Ayan and Bonnie J. Dorr. 2006. Going

beyond AER: An extensive analysis of word align-ments and their impact on MT. In Proceedings of

the Annual Conference of the Association for Com-putational Linguistics.

Necip Fazil Ayan, Bonnie J. Dorr, and Christof Monz.2005. Neuralign: combining word alignments us-ing neural networks. In Proceedings of the Confer-ence on Human Language Technology and Empiri-cal Methods in Natural Language Processing.

Alexandra Birch, Chris Callison-Burch, and Miles Os-borne. 2006. Constraining the phrase-based, jointprobability statistical translation model. In Proceed-ings of the Conference for the Association for Ma-chine Translation in the Americas.

Phil Blunsom, Trevor Cohn, Chris Dyer, and Miles Os-borne. 2009. A Gibbs sampler for phrasal syn-chronous grammar induction. In Proceedings of theAnnual Conference of the Association for Computa-tional Linguistics.

Eugene Charniak and Sharon Caraballo. 1998. Newfigures of merit for best-first probabilistic chart pars-ing. In Computational Linguistics.

Colin Cherry and Dekang Lin. 2006. Soft syntacticconstraints for word alignment through discrimina-tive training. In Proceedings of the Annual Confer-ence of the Association for Computational Linguis-tics.

Colin Cherry and Dekang Lin. 2007. Inversion trans-duction grammar for joint phrasal translation mod-eling. In Proceedings of the Annual Conference ofthe North American Chapter of the Association forComputational Linguistics Workshop on Syntax andStructure in Statistical Translation.

David Chiang, Yuval Marton, and Philip Resnik. 2008.Online large-margin training of syntactic and struc-tural translation features. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing.

David Chiang. 2007. Hierarchical phrase-based trans-lation. Computational Linguistics.

Koby Crammer and Yoram Singer. 2003. Ultracon-servative online algorithms for multiclass problems.Journal of Machine Learning Research, 3:951–991.

John DeNero and Dan Klein. 2007. Tailoring wordalignments to syntactic machine translation. In Pro-ceedings of the Annual Conference of the Associa-tion for Computational Linguistics.

John DeNero and Dan Klein. 2008. The complexity ofphrase alignment problems. In Proceedings of theAnnual Conference of the Association for Computa-tional Linguistics: Short Paper Track.

John DeNero, Dan Gillick, James Zhang, and DanKlein. 2006. Why generative phrase models un-derperform surface heuristics. In Proceedings of theNAACL Workshop on Statistical Machine Transla-tion.

John DeNero, Alexandre Bouchard-Cote, and DanKlein. 2008. Sampling alignment structure undera bayesian translation model. In Proceedings of theConference on Empirical Methods in Natural Lan-guage Processing.

Yonggang Deng and Bowen Zhou. 2009. Optimizingword alignment combination for phrase table train-ing. In Proceedings of the Annual Conference of theAssociation for Computational Linguistics: ShortPaper Track.

Alexander Fraser and Daniel Marcu. 2006. Semi-supervised training for statistical word alignment. InProceedings of the Annual Conference of the Asso-ciation for Computational Linguistics.

Alexander Fraser and Daniel Marcu. 2007. Gettingthe structure right for word alignment: Leaf. In Pro-ceedings of the Joint Conference on Empirical Meth-ods in Natural Language Processing and Computa-tional Natural Language Learning.

Michel Galley, Jonathan Graehl, Kevin Knight, DanielMarcu, Steve DeNeefe, Wei Wang, and IgnacioThayer. 2006. Scalable inference and training ofcontext-rich syntactic translation models. In Pro-ceedings of the Annual Conference of the Associa-tion for Computational Linguistics.

Joshua Goodman. 1996. Parsing algorithms and met-rics. In Proceedings of the Annual Meeting of theAssociation for Computational Linguistics.

Aria Haghighi, John Blitzer, John DeNero, and DanKlein. 2009. Better word alignments with super-vised ITG models. In Proceedings of the AnnualConference of the Association for ComputationalLinguistics.

Liang Huang and David Chiang. 2007. Forest rescor-ing: Faster decoding with integrated language mod-els. In Proceedings of the Annual Conference of theAssociation for Computational Linguistics.

Matti Kaariainen. 2009. Sinuhe—statistical machinetranslation using a globally trained conditional ex-ponential family translation model. In Proceedingsof the Conference on Empirical Methods in NaturalLanguage Processing.

Dan Klein and Chris Manning. 2003. A* parsing: Fastexact Viterbi parse selection. In Proceedings of theConference of the North American Chapter of theAssociation for Computational Linguistics.

Philipp Koehn and Barry Haddow. 2009. Edinburghssubmission to all tracks of the WMT2009 sharedtask with reordering and speed improvements toMoses. In Proceedings of the Workshop on Statis-tical Machine Translation.

Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. In Pro-ceedings of the Conference of the North AmericanChapter of the Association for Computational Lin-guistics.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, AlexandraConstantin, and Evan Herbst. 2007. Moses: Opensource toolkit for statistical machine translation. InProceedings of the Annual Conference of the Associ-ation for Computational Linguistics: Demonstrationtrack.

Simon Lacoste-Julien, Ben Taskar, Dan Klein, andMichael I. Jordan. 2006. Word alignment viaquadratic assignment. In Proceedings of the AnnualConference of the North American Chapter of theAssociation for Computational Linguistics.

Zhifei Li, Chris Callison-Burch, Chris Dyer, Juri Gan-itkevitch, Sanjeev Khudanpur, Lane Schwartz, WrenThornton, Jonathan Weese, and Omar Zaidan. 2009.Joshua: An open source toolkit for parsing-basedmachine translation. In Proceedings of the Work-shop on Statistical Machine Translation.

Percy Liang, Ben Taskar, and Dan Klein. 2006. Align-ment by agreement. In Proceedings of the AnnualConference of the North American Chapter of theAssociation for Computational Linguistics.

Adam Lopez. 2007. Hierarchical phrase-based trans-lation with suffix arrays. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing.

Daniel Marcu and Daniel Wong. 2002. A phrase-based, joint probability model for statistical machinetranslation. In Proceedings of the Conference onEmpirical Methods in Natural Language Process-ing.

I. Dan Melamed. 2000. Models of translational equiv-alence among words. Computational Linguistics.

Robert Moore and Chris Quirk. 2007. Fasterbeam-search decoding for phrasal statistical ma-chine translation. In Proceedings of MT Summit XI.

Robert C. Moore. 2005. A discriminative frameworkfor bilingual word alignment. In Proceedings of theConference on Empirical Methods in Natural Lan-guage Processing.

Hermann Ney and Stephan Vogel. 1996. HMM-basedword alignment in statistical translation. In Pro-ceedings of the Conference on Computational lin-guistics.

Franz Josef Och and Hermann Ney. 2003. A sys-tematic comparison of various statistical alignmentmodels. Computational Linguistics, 29:19–51.

Franz Josef Och, Christoph Tillmann, and HermannNey. 1999. Improved alignment models for statisti-cal machine translation. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automaticevaluation of machine translation. In Proceedings ofthe Annual Conference of the Association for Com-putational Linguistics.

Andreas Stolcke. 2002. Srilm an extensible languagemodeling toolkit. In Proceedings of the Interna-tional Conference on Spoken Language Processing.

Ben Taskar, Simon Lacoste-Julien, and Dan Klein.2005. A discriminative matching approach to wordalignment. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing.

Dekai Wu. 1997. Stochastic inversion transductiongrammars and bilingual parsing of parallel corpora.Computational Linguistics, 23:377–404.

Hao Zhang and Daniel Gildea. 2005. Stochastic lex-icalized inversion transduction grammar for align-ment. In Proceedings of the Annual Conference ofthe Association for Computational Linguistics.

Hao Zhang and Daniel Gildea. 2006. Efficient searchfor inversion transduction grammar. In Proceedingsof the Conference on Empirical Methods in NaturalLanguage Processing.

Hao Zhang, Chris Quirk, Robert C. Moore, andDaniel Gildea. 2008. Bayesian learning of non-compositional phrases with synchronous parsing. InProceedings of the Annual Conference of the Asso-ciation for Computational Linguistics.

Discriminative Modeling of Extraction Sets for Machine ...nlp.cs.berkeley.edu/pubs/DeNero-Klein_2010_Discriminative_paper.pdf · Discriminative Modeling of Extraction Sets for Machine

Documents