Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Crowdsourcing Inference-Rule Evaluation

Naomi Zeichner, Jonathan Berant, Ido Dagan

Crowdsourcing Inference-Rule Evaluation

Naomi Zeichner, Jonathan Berant, Ido Dagan

Outline

Bar Ilan University @ ACL 2012 2

Inference-Rule EvaluationWe addressWe address

Crowdsourcing Rule Applications Annotation

Empirically Compare Different Resources

Allowing us toAllowing us to

1

2

3

ByBy


Inference-Rule EvaluationWe addressWe address


By

1

2


Allowing us to3

Inference Rules – important component in semantic applications


Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

X brought up in Y X raised in Y

Q Where was Reagan raised?

A Reagan was brought up in Dixon.







Hiring Event

PERSON ROLE

Bob worked as an analyst for Dell







Hiring Event

PERSON ROLE


X work as Y X hired as Y







Hiring Event

PERSON ROLE


X work as Y X hired as Y

analystBob

Evaluation - What are the options?

4Bar Ilan University @ ACL 2012



4

Impact on end task QA, IE, RTEPro: What interests an inference system developer

Con: Many components, address multiple phenomena Hard to asses the effect of a single resource.

1

Bar Ilan University @ ACL 2012



4



1


Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Judge rule correctness directlyPro: Theoretically most intuitive

Con: In fact hard to do Often results in low inter-annotator agreement.

2



4



1





2


X reside in Y X live in Y

X reside in Y X born in Y


4



1





2




X criticize Y X attack Y


4



1

Instance-based evaluation(Szpektor et al 2007., Bhagat et al. 2007)

Pro: Simulates utility of rules in an application

Yields high inter-annotator agreement.

3





2






4



1

Instance-based evaluation(Szpektor et al 2007., Bhagat et al. 2007)

Pro: Simulates utility of rules in an application

Yields high inter-annotator agreement.

3





2






Target: Judge if a rule application is valid or not


Instance Based Evaluation – Decisions






Rule: X teach Y X explain to Y







Rule: X teach Y X explain to YLHS: Steve teaches kids






Rule: X teach Y X explain to YLHS: Steve teaches kidsRHS: Steve explains to kids











Rule: X resides in Y X born in YLHS: He resides in ParisRHS: He born in Paris














Rule: X turn in Y X bring in YLHS: humans turn in bedRHS: humans bring in bed




















Our Goal:

Robust Replicable

Crowdsourcing



• Recent trend of using crowdsourcing for

annotation tasks

• Previous Works

(Snow et al., 2008; Wang and Callison-Burch, 2010;

Mehdad et al., 2010; Negri et al., 2011)

• Focused on

RTE text-hypothesis pairs

• Didn’t address

annotation and evaluation of rules

Crowdsourcing




annotation tasks

• Previous Works



• Focused on




Challenges

Crowdsourcing




annotation tasks

• Previous Works



• Focused on




Challenges

• Simplify

Crowdsourcing




annotation tasks

• Previous Works



• Focused on




Challenges

• Simplify

• Communicate


Inference-Rule EvaluationWe address

Crowdsourcing Rule Applications AnnotationByBy

2


Allowing us to3

1



Simplify Process




Simplify Process


Simple

Tasks


Is a phrase meaningful?1


Simplify Process


Simple

Tasks




Simplify Process


Simple

Tasks







Simplify Process


Steve teaches kidsSteve explains to kids

He born in ParisHe resides in Paris

humans turn in bedhumans bring in bed

Simple

Tasks







Simplify Process


Steve teaches kids

Steve explains to kids

He born in Paris

He resides in Paris

humans turn in bed

humans bring in bed

Simple

Tasks







Simplify Process


He born in Paris

He resides in Paris

humans turn in bed

humans bring in bed

Simple

Tasks

2 Judge if one phrase is true given another.


Steve teaches kids







Simplify Process


Steve teaches kids


He born in Paris

He resides in Paris

humans turn in bed

humans bring in bed

Simple

Tasks


He resides in Paris

He born in Paris


Steve teaches kids








Simplify Process


they observe holidays

they celebrate holidays

He born in Paris

He resides in Paris

humans turn in bed

humans bring in bed

Simple

Tasks


Steve teaches kids


He born in Paris

He resides in Paris






Communicate Entailment



Communicate Entailment Gold Standard




Educating “Confusing” examples used as gold with feedback if Turkers get them wrong

1

Gold Standard


2 Enforcing Unanimous examples used as gold to estimate Turker reliability



Educating “Confusing” examples used as gold with feedback if Turkers get them wrong

1

Gold Standard


Inference-Rule Evaluation

Without With

Agreement with Gold 0.79

Kappa with gold 0.54

False-positive rate 18%

False-negative rate 4%

Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

Communicate - Effect of Communication



Without With







0.9

0.79

6%

5%



Without With







0.9

0.79

6%

5%

63% of annotations judged unanimously between annotators and with our annotation


Inference-Rule EvaluationWe address


By

1


AllowingAllowing usus toto3

2

Case Study – Data Set


Executed four entailment rule learning methods on a set of 1B extractions extracted by ReVerb (Fader et al. 2011)

Applied rules on randomly sampled extractions to get 20,000 rule applications

Annotated each rule application using our framework


Case Study – Algorithm Comparison



Algorithm AUC

DIRT (Lin and Pantel, 2001) 0.40

Cover (Weeds andWeir, 2003) 0.43

BInc (Szpektor and Dagan, 2008) 0.44

Berant (Berant et al., 2010) 0.52


Case Study – Output

14

• Task 1• 1,012 meaningful LHS; meaningless RHS

• 8,264 both sides were judged meaningful

•Task 2• 2,447 positive entailment

• 3,108 negative entailment

• Overall• 6,567 rule applications

• Annotated for $1000

• About a week

• Task 1• 1,012 meaningful LHS; meaningless RHS

• 8,264 both sides were judged meaningful

•Task 2• 2,447 positive entailment

• 3,108 negative entailment

• Overall• 6,567 rule applications

• Annotated for $1000

• About a week



non-entailment

passed to Task 2

Summary

15

A framework for crowdsourcing inference rule evaluation

• Simplifies instance-based evaluation

• Communicates entailment decision across to Turkers

• Proposed framework can be beneficial for– resource developers – inference system developers

Crowdsourcing forms and annotated extractions can be found at:

BIU NLP downloads: http://www.cs.biu.ac.il/~nlp/downloads









Summary

15















Thank

You


Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Documents

kidsbar ilan university

evaluation decisionslhs

x work

jonathan berant