Top Banner
Two Aspects of the Problem Two Aspects of the Problem of Natural Language Inference of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008
58

Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

Two Aspects of the ProblemTwo Aspects of the Problemof Natural Language Inferenceof Natural Language Inference

Bill MacCartneyNLP Group

Stanford University8 October 2008

Page 2: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

2

Two talks for the price of one!Two talks for the price of one!

Both concern the problem of natural language inference (NLI)

1. Modeling semantic containment and exclusion in NLI• Presented at Coling-08, won best paper award• A computational model of natural logic for NLI• Doesn’t solve all NLI problems, but handles an interesting

subset• Depends on alignments from other sources

2. A phrase-based model of alignment for NLI• To be presented at EMNLP-08• Addresses the problem of alignment for NLI & relates it MT• Made possible by annotated data produced here at MSR

Page 3: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

Modeling Semantic Containment and Modeling Semantic Containment and Exclusion in Natural Language Exclusion in Natural Language

InferenceInference

Bill MacCartney and Christopher D. Manning

NLP GroupStanford University

8 October 2008

Page 4: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

4

Natural language inference (NLI)Natural language inference (NLI)

• Aka recognizing textual entailment (RTE)

• Does premise P justify an inference to hypothesis H?• An informal, intuitive notion of inference: not strict logic• Emphasis on variability of linguistic expression

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

• Necessary to goal of natural language understanding (NLU)

• Can also enable semantic search, question answering, …

P Every firm polled saw costs grow more than expected,even after adjusting for inflation.

H Every big company in the poll reported cost increases.yes

Some

Some no

Page 5: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

5

NLI: a spectrum of approachesNLI: a spectrum of approaches

lexical/semanticoverlap

Jijkoun & de Rijke 2005

patternedrelation

extraction

Romano et al. 2006

semanticgraph

matching

Hickl et al. 2006MacCartney et al. 2006

FOL &theoremproving

Bos & Markert 2006

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

robust,but shallow

deep,but brittle

naturallogic

(this work)

Problem:imprecise easily confounded by negation, quantifiers, conditionals, factive & implicative verbs, etc.

Problem:hard to translate NL to FOLidioms, anaphora, ellipsis, intensionality, tense, aspect, vagueness, modals, indexicals, reciprocals, propositional attitudes, scope ambiguities, anaphoric adjectives, non-intersective adjectives, temporal & causal relations, unselective quantifiers, adverbs of quantification, donkey sentences, generic determiners, comparatives, phrasal verbs, …

Solution?

Page 6: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

6

OutlineOutline

• Introduction

• A Theory of Natural Logic

• The NatLog System

• Experiments with FraCaS

• Experiments with RTE

• Conclusion

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

Page 7: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

7

What is natural logic?What is natural logic? ( ( natural deduction) natural deduction)

• Characterizes valid patterns of inference via surface forms• precise, yet sidesteps difficulties of translating to FOL

• A long history• traditional logic: Aristotle’s syllogisms, scholastics, Leibniz, …• modern natural logic begins with Lakoff (1970)• van Benthem & Sánchez Valencia (1986-91): monotonicity

calculus• Nairn et al. (2006): an account of implicatives & factives

• We introduce a new theory of natural logic• extends monotonicity calculus to account for negation &

exclusion• incorporates elements of Nairn et al.’s model of implicatives

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

Page 8: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

8

7 basic entailment relations7 basic entailment relations

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

Venn symbol

name example

P = Q equivalence couch = sofa

P ⊏ Q forward entailment(strict)

crow ⊏ bird

P ⊐ Q reverse entailment(strict)

European ⊐ French

P ^ Q negation(exhaustive exclusion)

human ^ nonhuman

P | Q alternation(non-exhaustive exclusion)

cat | dog

P _ Q cover(exhaustive non-exclusion)

animal _ nonhuman

P # Q independence hungry # hippo

Relations are defined for all semantic types: tiny ⊏ small, hover ⊏ fly, kick ⊏ strike,this morning ⊏ today, in Beijing ⊏ in China, everyone ⊏ someone, all ⊏ most ⊏ some

Page 9: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

9

Entailment & semantic Entailment & semantic compositioncomposition

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

• Ordinarily, semantic composition preserves entailment relations: eat pork ⊏ eat meat, big bird | big fish

• But many semantic functions behave differently:tango ⊏ dance refuse to tango ⊐ refuse to danceFrench | German not French _ not German

• We categorize functions by how they project entailment• a generalization of monotonicity classes, implication

signatures• e.g., not has projectivity {=:=, ⊏:⊐, ⊐:⊏, ^:^, |:_,

_:|, #:#}• e.g., refuse has projectivity {=:=, ⊏:⊐, ⊐:⊏, ^:|, |:#,

_:#, #:#}

Page 10: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

10

Projecting entailment relations Projecting entailment relations upwardupward

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

• If two compound expressions differ by a single atom, their entailment relation can be determined compositionally• Assume idealized semantic composition trees• Propagate entailment relation between atoms upward,

according to projectivity class of each node on path to root

a shirtnobody can without enter

@

@

@

@

clothesnobody can without enter

@

@

@

@

Page 11: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

11

A (weak) inference procedureA (weak) inference procedure

1. Find sequence of edits connecting P and H• Insertions, deletions, substitutions, …

2. Determine lexical entailment relation for each edit• Substitutions: depends on meaning of substituends: cat | dog

• Deletions: ⊏ by default: red socks ⊏ socks

• But some deletions are special: not ill ^ ill, refuse to go | go

• Insertions are symmetric to deletions: ⊐ by default

• Project up to find entailment relation across each edit

• Join entailment relations across sequence of edits1. à la Tarski’s relation algebra

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

Page 12: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

12

The NatLog systemThe NatLog system

linguistic analysis

alignment

lexical entailment classification

1

2

3

NLI problem

prediction

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

entailment projection

entailment joining

4

5

Page 13: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

13

Running exampleRunning example

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

OK, the example is contrived, but it compactly exhibits containment, exclusion, and implicativity

P Jimmy Dean refused to move without blue jeans.

H James Dean didn’t dance without pantsyes

Page 14: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

14

PP

Step 1: Linguistic analysisStep 1: Linguistic analysis

• Tokenize & parse input sentences (future: & NER & coref & …)

• Identify items w/ special projectivity & determine scope• Problem: PTB-style parse tree semantic structure!

Jimmy Dean refused to move without blue jeans

NNP NNP VBD TO VB IN JJ NNS NP NP

VP S

• Solution: specify scope in PTB trees using Tregex [Levy & Andrew 06]

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

VP

VP S

+ + +–– –+ +

refuse

move

JimmyDean

without

jeans

blue

category: –/o implicativesexamples: refuse, forbid, prohibit, …scope: S complementpattern: __ > (/VB.*/ > VP $. S=arg)projectivity: {=:=, ⊏:⊐, ⊐:⊏, ^:|, |:#, _:#, #:#}

Page 15: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

15

Step 2: AlignmentStep 2: Alignment

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

P Jimmy Dean

refused to move without blue jeans

H James Dean did n’t dance without pants

editindex 1 2 3 4 5 6 7 8

edittype SUB DEL INS INS SUB MAT DEL SUB

• Alignment as sequence of atomic phrase edits• Ordering of edits defines path through intermediate

forms• Need not correspond to sentence order

• Decomposes problem into atomic inference problems

• We haven’t (yet) invested much effort here• Experimental results use alignments from other sources

Page 16: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

16

Step 3: Lexical entailment Step 3: Lexical entailment classificationclassification• Goal: predict entailment relation for each edit, based

solely on lexical features, independent of context

• Approach: use lexical resources & machine learning

• Feature representation:• WordNet features: synonymy (=), hyponymy (⊏/⊐), antonymy (|)• Other relatedness features: Jiang-Conrath (WN-based), NomBank• Fallback: string similarity (based on Levenshtein edit distance)• Also lexical category, quantifier category, implication signature

• Decision tree classifier• Trained on 2,449 hand-annotated lexical entailment problems• E.g., SUB(gun, weapon): ⊏, SUB(big, small): |, DEL(often): ⊏

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

Page 17: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

17

Step 3: Lexical entailment Step 3: Lexical entailment classificationclassification

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

P Jimmy Dean

refused to move without blue jeans

H James Dean did n’t dance without pants

editindex 1 2 3 4 5 6 7 8

edittype SUB DEL INS INS SUB MAT DEL SUB

lexfeats

strsim=

0.67

implic:

–/ocat:a

uxcat:n

eg hypo hyper

lexentrel = | = ^ ⊐ = ⊏ ⊏

Page 18: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

18

inversion

Step 4: Entailment projectionStep 4: Entailment projection

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

P Jimmy Dean

refused to move without blue jeans

H James Dean did n’t dance without pants

editindex 1 2 3 4 5 6 7 8

edittype SUB DEL INS INS SUB MAT DEL SUB

lexfeats

strsim=

0.67

implic:

–/ocat:a

uxcat:n

eg hypo hyper

lexentrel = | = ^ ⊐ = ⊏ ⊏

projec-tivity

atomic

entrel= | = ^ ⊏ = ⊏ ⊏

Page 19: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

19

final answer

Step 5: Entailment joiningStep 5: Entailment joining

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

P Jimmy Dean

refused to move without blue jeans

H James Dean did n’t dance without pants

editindex 1 2 3 4 5 6 7 8

edittype SUB DEL INS INS SUB MAT DEL SUB

lexfeats

strsim=

0.67

implic:

–/ocat:a

uxcat:n

eg hypo hyper

lexentrel = | = ^ ⊐ = ⊏ ⊏

projec-tivity

atomic

entrel= | = ^ ⊏ = ⊏ ⊏

join = | | ⊏ ⊏ ⊏ ⊏ ⊏

fish | human

human ^ nonhuman

fish ⊏ nonhuman

For example:

Page 20: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

20

The FraCaS test suiteThe FraCaS test suite

• FraCaS: a project in computational semantics [Cooper et al. 96]

• 346 “textbook” examples of NLI problems

• 3 possible answers: yes, no, unknown (not balanced!)

• 55% single-premise, 45% multi-premise (excluded)

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

P At most ten commissioners spend time at home.H At most ten commissioners spend a lot of time at home. yes

P Dumbo is a large animal.H Dumbo is a small animal. no

P Smith believed that ITEL had won the contract in 1992.H ITEL won the contract in 1992. unk

Page 21: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

21

27% error reduction

Results on FraCaSResults on FraCaS

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

System #prec %

rec %acc %

most common class 183 55.7100.

055.7

MacCartney & Manning 07

183 68.9 60.8 59.6

this work 183 89.3 65.7 70.5

Page 22: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

22

high precisioneven outside

areas of expertise

27% error reduction

in largest category,all but one correct

high accuracyin sections

most amenableto natural logic

Results on FraCaSResults on FraCaS

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

System #prec %

rec %acc %

most common class 183 55.7100.

055.7

MacCartney & Manning 07

183 68.9 60.8 59.6

this work 183 89.3 65.7 70.5

§ Category #prec %

rec %acc %

1 Quantifiers 44 95.2100.

097.7

2 Plurals 24 90.0 64.3 75.03 Anaphora 6 100.0 60.0 50.04 Ellipsis 25 100.0 5.3 24.05 Adjectives 15 71.4 83.3 80.06 Comparatives 16 88.9 88.9 81.37 Temporal 36 85.7 70.6 58.38 Verbs 8 80.0 66.7 62.59 Attitudes 9 100.0 83.3 88.9

1, 2, 5, 6, 9 108 90.4 85.5 87.0

Page 23: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

23

The RTE3 test suiteThe RTE3 test suite

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

P As leaders gather in Argentina ahead of this weekends regional talks, Hugo Chávez, Venezuela’s populist president is using an energy windfall to win friends and promote his vision of 21st-century socialism.

H Hugo Chávez acts as Venezuela’s president. yes

P Democrat members of the Ways and Means Committee, where tax bills are written and advanced, do not have strong small business voting records.

H Democrat members had strong small business voting records. no

• Somewhat more “natural”, but not ideal for NatLog• Many kinds of inference not addressed by NatLog:

paraphrase, temporal reasoning, relation extraction, …• Big edit distance propagation of errors from atomic model

Page 24: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

24

Results on RTE3: NatLogResults on RTE3: NatLog

System Data % YesPrec %

Rec % Acc %

Stanford RTE dev 50.2 68.7 67.0 67.2

test 50.0 61.8 60.2 60.5

NatLog dev 22.5 73.9 32.4 59.2

test 26.4 70.1 36.1 59.4

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

(each data set contains 800 problems)

• Accuracy is unimpressive, but precision is relatively high• Strategy: hybridize with Stanford RTE system

• As in Bos & Markert 2006• But NatLog makes positive prediction far more often (~25% vs.

4%)

Page 25: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

25

4% gain(significant,p < 0.05)

Results on RTE3: hybrid systemResults on RTE3: hybrid system

System Data % YesPrec %

Rec % Acc %

Stanford RTE dev 50.2 68.7 67.0 67.2

test 50.0 61.8 60.2 60.5

NatLog dev 22.5 73.9 32.4 59.2

test 26.4 70.1 36.1 59.4

Hybrid dev 56.0 69.2 75.2 70.0

test 54.5 64.4 68.5 64.5

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

(each data set contains 800 problems)

Page 26: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

26

Conclusion: what natural logic Conclusion: what natural logic can’t docan’t do

• Not a universal solution for NLI

• Many types of inference not amenable to natural logic• Paraphrase: Eve was let go = Eve lost her job

• Verb/frame alternation: he drained the oil ⊏ the oil drained

• Relation extraction: Aho, a trader at UBS… ⊏ Aho works for UBS

• Common-sense reasoning: the sink overflowed ⊏ the floor got wet

• etc.

• Also, has a weaker proof theory than FOL• Can’t explain, e.g., de Morgan’s laws for quantifiers:

Not all birds fly = Some birds don’t fly

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

Page 27: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

27

Conclusion: what natural logic Conclusion: what natural logic cancan do do

Natural logic enables precise reasoning about containment, exclusion, and implicativity, while sidestepping the difficulties of translating to FOL.

The NatLog system successfully handles a broad range of such inferences, as demonstrated on the FraCaS test suite.

Ultimately, open-domain NLI is likely to require combining disparate reasoners, and a facility for natural logic is a good candidate to be a component of such a system.

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

Page 28: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

A Phrase-Based Model of AlignmentA Phrase-Based Model of Alignmentfor Natural Language Inferencefor Natural Language Inference

Bill MacCartney, Michel Galley,and Christopher D. Manning

Stanford University8 October 2008

Page 29: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

29

Natural language inference (NLI)Natural language inference (NLI) (aka RTE)(aka RTE)

• Does premise P justify an inference to hypothesis H?• An informal notion of inference; variability of linguistic

expression

• Like MT, NLI depends on a facility for alignment• I.e., linking corresponding words/phrases in two related sentences

• Alignment addressed variously by current NLI systems• Implicit alignment: NLI via lexical overlap [Glickman et al. 05, Jijkoun & de Rijke 05]

• Implicit alignment: NLI as proof search [Tatu & Moldovan 07, Bar-Haim et al. 07]

• Explicit alignment entailment classification [Marsi & Kramer 05, MacCartney et al. 06]

P In 1963, JFK was assassinated during a visit to Dallas.

H Kennedy was killed in 1963. yes

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Page 30: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

30

Contributions of this paperContributions of this paper

In this paper, we:

1. Undertake the first systematic study of alignment for NLI• Existing NLI aligners use idiosyncratic methods, are poorly

documented, use proprietary data

2. Propose a new model of alignment for NLI: MANLI• Uses a phrase-based alignment representation• Exploits external lexical resources• Capitalizes on new supervised training data

3. Examine the relation between alignment in NLI and MT• Can existing MT aligners be applied in the NLI setting?

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Page 31: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

31

NLI alignment vs. MT alignmentNLI alignment vs. MT alignment

• Alignment is familiar in MT, with extensive literature

• Can tools & techniques of MT alignment transfer to NLI?

• Doubtful — NLI alignment differs in several respects:1. Monolingual: can exploit resources like WordNet2. Asymmetric: P often longer & has content unrelated to H3. Cannot assume semantic equivalence

• NLI aligner must accommodate frequent unaligned content

4. Little training data available• MT aligners use unsupervised training on massive amounts of

bitext• NLI aligners must rely on supervised training & much less data

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Page 32: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

32

The MSR RTE2 alignment dataThe MSR RTE2 alignment data

• Previously, little supervised data

• Now, MSR gold alignments for RTE2• [Brockett 2007]• dev & test sets, 800 problems each

• Token-based, but many-to-many• allows implicit alignment of phrases

• 3 independent annotators• 3 of 3 agreed on 70% of proposed links• 2 of 3 agreed on 99.7% of proposed

links• merged using majority rule

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Page 33: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

33

The MANLI alignerThe MANLI aligner

A new model of alignment for natural language inference

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

1. Phrase-based representation

2. Feature-based scoring function

3. Decoding using simulated annealing

4. Perceptron learning

Page 34: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

34

Phrase-based alignment Phrase-based alignment representationrepresentation

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

DEL(In1)

DEL(most2)

DEL(Pacific3)

DEL(countries4)

DEL(there5)

EQ(are6, are2)

SUB(very7 few8, poorly3 represented4)

EQ(women9, Women1)

EQ(in10, in5)

EQ(parliament11, parliament6)

EQ(.12, .7)

Represent alignments by sequence of phrase edits: EQ, SUB, DEL, INS

• One-to-one at phrase level (but many-to-many at token level)

• Avoids arbitrary alignment choices; can use phrase-based resources

• For training (only!), converted MSR data to this form

Page 35: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

35

• Score edits as linear combination of features, then sum:

A feature-based scoring functionA feature-based scoring function

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

• Edit type features: EQ, SUB, DEL, INS

• Phrase features: phrase sizes, non-constituents

• Lexical similarity feature: max over similarity scores• WordNet: synonymy, hyponymy, antonymy, Jiang-Conrath• Distributional similarity à la Dekang Lin• Various measures of string/lemma similarity

• Contextual features: distortion, matching neighbors

Page 36: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

36

Decoding using simulated Decoding using simulated annealingannealing

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

1. Start

3. Score

4. Smooth/sharpenP(A) = P(A)1/T

5. Sample

6. Lower temp T = 0.9 T

…2. Generate successors

7. Repeat… 100 times

Page 37: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

37

Perceptron learning of feature Perceptron learning of feature weightsweights

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

We use a variant of averaged perceptron [Collins 2002]

Initialize weight vector w = 0, learning rate R0 = 1

For training epoch i = 1 to 50:For each problem Pj, Hj with gold alignment Ej:

Set Êj = ALIGN(Pj, Hj, w)

Set w = w + Ri ((Ej) – (Êj))

Set w = w / ‖w‖2 (L2 normalization)

Set w[i] = w (store weight vector for this epoch)

Set Ri = 0.8 Ri–1 (reduce learning rate)

Throw away weight vectors from first 20% of epochsReturn average weight vector

Training runs require about 20 hours (for 800 problems)

Page 38: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

38

Evaluation on MSR dataEvaluation on MSR data

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

• We evaluate several systems on MSR data• Baseline, GIZA++ & Cross-EM, Stanford RTE, MANLI• How well do they recover gold-standard alignments?

• We report per-link precision, recall, and F1

• Note that AER = 1 – F1

• For MANLI, two tokens are considered to be aligned iff they are contained within phrases which are aligned

• We also report exact match rate• What proportion of guessed alignments match gold

exactly?

Page 39: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

39

Baseline: bag-of-words alignerBaseline: bag-of-words alignerRTE2 dev RTE2 test

System P % R % F1 % E % P % R % F1 % E %

Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3

• Surprisingly good recall, despite extreme simplicity

• But very mediocre precision, F1, & exact match rate

• Main problem: aligns every token in H

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Match each H token to most similar P token: [cf. Glickman et al. 2005]

Page 40: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

40

MT aligners: GIZA++ & Cross-EM MT aligners: GIZA++ & Cross-EM

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

• Why not use off-the-shelf MT aligner for NLI?

• Run GIZA++ via Moses, with default parameters• Asymmetric alignments in both directions• Then symmetrize using INTERSECTION heuristic

• Initial results are very poor: 56% F1

• Doesn’t even align equal words

• Remedy: add lexicon of equal words as extra training data

• Do similar experiments with Berkeley Cross-EM aligner

Page 41: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

41

Results: MT alignersResults: MT alignersRTE2 dev RTE2 test

System P % R % F1 % E % P % R % F1 % E %

Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3GIZA++ 83.0 66.4 72.1 9.4 85.1 69.1 74.8 11.3Cross-EM 67.6 80.1 72.1 1.3 70.3 81.0 74.1 0.8

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Similar F1, but GIZA++ wins on precision, Cross-EM on recall

• Both do best with lexicon & INTERSECTION heuristic• Also tried UNION, GROW, GROW-DIAG, GROW-DIAG-FINAL,

GROW-DIAG-FINAL-AND, and asymmetric alignments

• All achieve better recall, but much worse precision & F1

• Problem: too little data for unsupervised learning• Need to compensate by exploiting external lexical resources

Page 42: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

42

The Stanford RTE alignerThe Stanford RTE aligner

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

• Token-based alignments: map from H tokens to P tokens• Phrase alignments not directly representable• But named entities, collocations collapsed in pre-

processing

• Exploits external lexical resources• WordNet, LSA, distributional similarity, string sim, …

• Syntax-based features to promote aligning corresponding predicate-argument structures

• Decoding & learning similar to MANLI

Page 43: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

43

Results: Stanford RTE alignerResults: Stanford RTE alignerRTE2 dev RTE2 test

System P % R % F1 % E % P % R % F1 % E %

Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3GIZA++ 83.0 66.4 72.1 9.4 85.1 69.1 74.8 11.3Cross-EM 67.6 80.1 72.1 1.3 70.3 81.0 74.1 0.8Stanford RTE 81.1 75.8 78.4 0.5 82.7 75.8 79.1 0.3

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

• Better F1 than MT aligners — but recall lags precision

• Stanford does poor job aligning function words• 13% of links in gold are prepositions & articles• Stanford misses 67% of these (MANLI only 10%)

• Also, Stanford fails to align multi-word phrasespeace activists ~ protestors, hackers ~ non-authorized personnel

* * * *

* includes (generous) correction for missed punctuation

Page 44: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

44

Results: MANLI alignerResults: MANLI alignerRTE2 dev RTE2 test

System P % R % F1 % E % P % R % F1 % E %

Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3GIZA++ 83.0 66.4 72.1 9.4 85.1 69.1 74.8 11.3Cross-EM 67.6 80.1 72.1 1.3 70.3 81.0 74.1 0.8Stanford RTE 81.1 75.8 78.4 0.5 82.7 75.8 79.1 0.3MANLI 83.4 85.5 84.4 21.7 85.4 85.3 85.3 21.3

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

• MANLI outperforms all others on every measure• F1: 10.5% higher than GIZA++, 6.2% higher than

Stanford

• Good balance of precision & recall

• Matched >20% exactly

Page 45: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

45

MANLI results: discussionMANLI results: discussion

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

• Three factors contribute to success:1. Lexical resources: jail ~ prison, prevent ~ stop , injured ~ wounded

2. Contextual features enable matching function words3. Phrases: death penalty ~ capital punishment, abdicate ~ give up

1. But phrases help less than expected!1. If we set max phrase size = 1, we lose just 0.2% in F1

2. Recall errors: room to improve1. 40%: need better lexical resources: conservation ~ protecting,

organization ~ agencies, bone fragility ~ osteoporosis

3. Precision errors harder to reduce1. function words (49%), be (21%), punct (7%), equal lemmas

(18%)

Page 46: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

46

Can aligners predict RTE Can aligners predict RTE answers?answers?

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

• We’ve been evaluating against gold-standard alignments

• But alignment is just one component of an NLI system

• Does a good alignment indicate a valid inference?• Not necessarily: negations, modals, non-factives &

implicatives, …• But alignment score can be strongly predictive• And many NLI systems rely solely on alignment

• Using alignment score to predict RTE answers:• Predict YES if score > threshold• Tune threshold on development data• Evaluate on test data

Page 47: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

47

Results: predicting RTE answersResults: predicting RTE answersRTE2 dev RTE2 test

System Acc %AvgP

%Acc %

AvgP %

Bag-of-words 61.3 61.5 57.9 58.9Stanford RTE 63.1 64.9 60.9 59.2MANLI 59.3 69.0 60.3 61.0

RTE2 entries (average) — — 58.5 59.1LCC [Hickl et al. 2006] — — 75.4 80.8

• No NLI aligner rivals top LCC system

• But, Stanford & MANLI beat average entry for RTE2

• Many NLI systems could benefit from better alignments!

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Page 48: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

48

Related workRelated work

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

• Lots of past work on phrase-based MT

• But most systems extract phrases from word-aligned data• Despite assumption that many translations are non-

compositional

• Recent work jointly aligns & weights phrases[Marcu & Wong 02, DeNero et al. 06, Birch et al. 06, DeNero & Klein 08]

• However, this is of limited applicability to the NLI task• MANLI uses phrases only when words aren’t appropriate• MT uses longer phrases to realize more dependencies

(e.g. word order, agreement, subcategorization)

• MT systems don’t model word insertions & deletions

Page 49: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

49

ConclusionConclusion

:-) Thanks! Questions?

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

• MT aligners not directly applicable to NLI• They rely on unsupervised learning from massive amount of

bitext• They assume semantic equivalence of P & H

• MANLI succeeds by:• Exploiting (manually & automatically constructed) lexical

resources• Accommodating frequent unaligned phrases

• Phrase-based representation shows potential• But not yet proven: need better phrase-based lexical resources

Page 50: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

50

END

Page 51: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

51

OutlineOutline

• Introduction

• NLI alignment vs. MT alignment

• The MSR data

• The MANLI aligner

• Evaluating aligners on the MSR data

• Using alignment to predict RTE answers

• Conclusion

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Page 52: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

52

The MSR RTE2 alignment dataThe MSR RTE2 alignment data

• Previously, little supervised data

• Now, MSR gold alignments for RTE2• dev & test sets, each with 800 P, H

pairs

• Token-based, but many-to-many• allows implicit alignment of phrases

• SURE vs. POSSIBLE: we use only SURE

• 3 independent annotators• 3 of 3 agreed on 70% of proposed links• 2 of 3 agreed on 99.7% of proposed

links• merged using majority rule

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Page 53: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

53

Baseline: bag-of-words alignerBaseline: bag-of-words aligner

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Match each H token to most similar P token: [cf. Glickman et al. 2005]

Can also generate an alignment score :

Exceedingly simple, but surprisingly robust

Page 54: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

54

Results: bag-of-words alignerResults: bag-of-words alignerRTE2 dev RTE2 test

System P % R % F1 % E % P % R % F1 % E %

Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3

• Good recall, despite simplicity of model

• But very mediocre precision, F1, & exact match rate

• Main problem: aligns every token in H

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Page 55: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

55

Results: Stanford RTE alignerResults: Stanford RTE alignerRTE2 dev RTE2 test

System P % R % F1 % E % P % R % F1 % E %

Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3GIZA++ 83.0 66.4 72.1 9.4 85.1 69.1 74.8 11.3Cross-EM 67.6 80.1 72.1 1.3 70.3 81.0 74.1 0.8Stanford RTE 81.1 61.2 69.7 0.5 82.7 61.2 70.3 0.3

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

• Disappointing — especially, poor recall!

• Explanation: Stanford ignores punctuation (worth 15%)

• But punctuation matters little in inference

• So, let’s ignore these errors…

Page 56: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

56

Results: Stanford RTE alignerResults: Stanford RTE alignerRTE2 dev RTE2 test

System P % R % F1 % E % P % R % F1 % E %

Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3GIZA++ 83.0 66.4 72.1 9.4 85.1 69.1 74.8 11.3Cross-EM 67.6 80.1 72.1 1.3 70.3 81.0 74.1 0.8Stanford RTE 81.1 61.2 69.7 0.5 82.7 61.2 70.3 0.3 ignoring punct. 81.1 75.8 78.4 — 82.7 75.8 79.1 —

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

• Better — but recall is still rather low

• Stanford does poor job aligning function words• 13% of links in gold are prepositions & articles• Stanford misses 67% of these (MANLI only 10%)

• Also, Stanford fails to align multi-word phrasespeace activists ~ protestors, hackers ~ non-authorized personnel

Page 57: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

57

Performance on MSR RTE2 dataPerformance on MSR RTE2 data

System Data P % R % F1 %Exact

%

Bag-of-words dev 57.8 81.2 67.5 3.5(baseline) test 62.1 82.6 70.9 5.3

GIZA++ dev 83.0 66.4 72.1 —(using lex, ) test 85.1 69.1 74.8 —

Cross-EM dev 67.6 80.1 72.1 —(using lex, ) test 70.3 81.0 74.1 —

Stanford RTE dev 81.1 61.2 69.7 0.5test 82.7 61.2 70.3 0.3

Stanford RTE dev 81.1 75.8 78.4 —(punct. corr.) test 82.7 75.8 79.1 —

MANLI dev 83.4 85.5 84.4 21.7(this work) test 85.4 85.3 85.3 21.3

(each data set contains 800 problems)

Page 58: Two Aspects of the Problem of Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008.

58

Performance on MSR RTE2 dataPerformance on MSR RTE2 data

System P % R % F1 % E % P % R % F1 % E %

Bag-of-words baseline

57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3

GIZA++ (lex, ) 83.0 66.4 72.1 — 85.1 69.1 74.8 —Cross-EM (lex, ) 67.6 80.1 72.1 — 70.3 81.0 74.1 —Stanford RTE 81.1 61.2 69.7 0.5 82.7 61.2 70.3 0.3Stanford punct. corr.

81.1 75.8 78.4 — 82.7 75.8 79.1 —

MANLI (this work) 83.4 85.5 84.4 21.7 85.4 85.3 85.3 21.3

(each data set contains 800 problems)