Top Banner
Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan
51

Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Dec 31, 2015

Download

Documents

Aubrey Pearson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Crowdsourcing Inference-Rule Evaluation

Naomi Zeichner, Jonathan Berant, Ido Dagan

Crowdsourcing Inference-Rule Evaluation

Naomi Zeichner, Jonathan Berant, Ido Dagan

Page 2: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Outline

Bar Ilan University @ ACL 2012 2

Inference-Rule EvaluationWe addressWe address

Crowdsourcing Rule Applications Annotation

Empirically Compare Different Resources

Allowing us toAllowing us to

1

2

3

ByBy

Page 3: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Bar Ilan University @ ACL 2012 2

Inference-Rule EvaluationWe addressWe address

Crowdsourcing Rule Applications Annotation

By

1

2

Empirically Compare Different Resources

Allowing us to3

Page 4: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Inference Rules – important component in semantic applications

Bar Ilan University @ ACL 2012 3

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

X brought up in Y X raised in Y

Q Where was Reagan raised?

A Reagan was brought up in Dixon.

Page 5: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Inference Rules – important component in semantic applications

Bar Ilan University @ ACL 2012 3

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

X brought up in Y X raised in Y

Q Where was Reagan raised?

A Reagan was brought up in Dixon.

Hiring Event

PERSON ROLE

Bob worked as an analyst for Dell

Page 6: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Inference Rules – important component in semantic applications

Bar Ilan University @ ACL 2012 3

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

X brought up in Y X raised in Y

Q Where was Reagan raised?

A Reagan was brought up in Dixon.

Hiring Event

PERSON ROLE

Bob worked as an analyst for Dell

X work as Y X hired as Y

Page 7: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Inference Rules – important component in semantic applications

Bar Ilan University @ ACL 2012 3

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

X brought up in Y X raised in Y

Q Where was Reagan raised?

A Reagan was brought up in Dixon.

Hiring Event

PERSON ROLE

Bob worked as an analyst for Dell

X work as Y X hired as Y

analystBob

Page 8: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Evaluation - What are the options?

4Bar Ilan University @ ACL 2012

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

Page 9: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Evaluation - What are the options?

4

Impact on end task QA, IE, RTEPro: What interests an inference system developer

Con: Many components, address multiple phenomena Hard to asses the effect of a single resource.

1

Bar Ilan University @ ACL 2012

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

Page 10: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Evaluation - What are the options?

4

Impact on end task QA, IE, RTEPro: What interests an inference system developer

Con: Many components, address multiple phenomena Hard to asses the effect of a single resource.

1

Bar Ilan University @ ACL 2012

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Judge rule correctness directlyPro: Theoretically most intuitive

Con: In fact hard to do Often results in low inter-annotator agreement.

2

Empirically Compare Different Resources

Page 11: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Evaluation - What are the options?

4

Impact on end task QA, IE, RTEPro: What interests an inference system developer

Con: Many components, address multiple phenomena Hard to asses the effect of a single resource.

1

Bar Ilan University @ ACL 2012

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Judge rule correctness directlyPro: Theoretically most intuitive

Con: In fact hard to do Often results in low inter-annotator agreement.

2

Empirically Compare Different Resources

X reside in Y X live in Y

X reside in Y X born in Y

Page 12: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Evaluation - What are the options?

4

Impact on end task QA, IE, RTEPro: What interests an inference system developer

Con: Many components, address multiple phenomena Hard to asses the effect of a single resource.

1

Bar Ilan University @ ACL 2012

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Judge rule correctness directlyPro: Theoretically most intuitive

Con: In fact hard to do Often results in low inter-annotator agreement.

2

Empirically Compare Different Resources

X reside in Y X live in Y

X reside in Y X born in Y

X criticize Y X attack Y

Page 13: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Evaluation - What are the options?

4

Impact on end task QA, IE, RTEPro: What interests an inference system developer

Con: Many components, address multiple phenomena Hard to asses the effect of a single resource.

1

Instance-based evaluation(Szpektor et al 2007., Bhagat et al. 2007)

Pro: Simulates utility of rules in an application

Yields high inter-annotator agreement.

3

Bar Ilan University @ ACL 2012

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Judge rule correctness directlyPro: Theoretically most intuitive

Con: In fact hard to do Often results in low inter-annotator agreement.

2

Empirically Compare Different Resources

X reside in Y X live in Y

X reside in Y X born in Y

X criticize Y X attack Y

Page 14: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Evaluation - What are the options?

4

Impact on end task QA, IE, RTEPro: What interests an inference system developer

Con: Many components, address multiple phenomena Hard to asses the effect of a single resource.

1

Instance-based evaluation(Szpektor et al 2007., Bhagat et al. 2007)

Pro: Simulates utility of rules in an application

Yields high inter-annotator agreement.

3

Bar Ilan University @ ACL 2012

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Judge rule correctness directlyPro: Theoretically most intuitive

Con: In fact hard to do Often results in low inter-annotator agreement.

2

Empirically Compare Different Resources

X reside in Y X live in Y

X reside in Y X born in Y

X criticize Y X attack Y

Page 15: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

5Bar Ilan University @ ACL 2012

Target: Judge if a rule application is valid or not

Empirically Compare Different Resources

Instance Based Evaluation – Decisions

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Page 16: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

5Bar Ilan University @ ACL 2012

Target: Judge if a rule application is valid or not

Empirically Compare Different Resources

Instance Based Evaluation – Decisions

Rule: X teach Y X explain to Y

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Page 17: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

5Bar Ilan University @ ACL 2012

Target: Judge if a rule application is valid or not

Empirically Compare Different Resources

Instance Based Evaluation – Decisions

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Rule: X teach Y X explain to YLHS: Steve teaches kids

Page 18: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

5Bar Ilan University @ ACL 2012

Target: Judge if a rule application is valid or not

Empirically Compare Different Resources

Instance Based Evaluation – Decisions

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Rule: X teach Y X explain to YLHS: Steve teaches kidsRHS: Steve explains to kids

Page 19: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

5Bar Ilan University @ ACL 2012

Target: Judge if a rule application is valid or not

Empirically Compare Different Resources

Instance Based Evaluation – Decisions

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Rule: X teach Y X explain to YLHS: Steve teaches kidsRHS: Steve explains to kids

Page 20: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

5Bar Ilan University @ ACL 2012

Target: Judge if a rule application is valid or not

Empirically Compare Different Resources

Instance Based Evaluation – Decisions

Rule: X resides in Y X born in YLHS: He resides in ParisRHS: He born in Paris

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Rule: X teach Y X explain to YLHS: Steve teaches kidsRHS: Steve explains to kids

Page 21: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

5Bar Ilan University @ ACL 2012

Target: Judge if a rule application is valid or not

Empirically Compare Different Resources

Instance Based Evaluation – Decisions

Rule: X resides in Y X born in YLHS: He resides in ParisRHS: He born in Paris

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Rule: X teach Y X explain to YLHS: Steve teaches kidsRHS: Steve explains to kids

Page 22: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

5Bar Ilan University @ ACL 2012

Target: Judge if a rule application is valid or not

Empirically Compare Different Resources

Instance Based Evaluation – Decisions

Rule: X turn in Y X bring in YLHS: humans turn in bedRHS: humans bring in bed

Rule: X resides in Y X born in YLHS: He resides in ParisRHS: He born in Paris

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Rule: X teach Y X explain to YLHS: Steve teaches kidsRHS: Steve explains to kids

Page 23: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

5Bar Ilan University @ ACL 2012

Target: Judge if a rule application is valid or not

Empirically Compare Different Resources

Instance Based Evaluation – Decisions

Rule: X turn in Y X bring in YLHS: humans turn in bedRHS: humans bring in bed

Rule: X resides in Y X born in YLHS: He resides in ParisRHS: He born in Paris

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Rule: X teach Y X explain to YLHS: Steve teaches kidsRHS: Steve explains to kids

Page 24: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

5Bar Ilan University @ ACL 2012

Target: Judge if a rule application is valid or not

Empirically Compare Different Resources

Instance Based Evaluation – Decisions

Rule: X turn in Y X bring in YLHS: humans turn in bedRHS: humans bring in bed

Rule: X resides in Y X born in YLHS: He resides in ParisRHS: He born in Paris

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Rule: X teach Y X explain to YLHS: Steve teaches kidsRHS: Steve explains to kids

Our Goal:

Robust Replicable

Page 25: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Crowdsourcing

Bar Ilan University @ ACL 2012 6

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

• Recent trend of using crowdsourcing for

annotation tasks

• Previous Works

(Snow et al., 2008; Wang and Callison-Burch, 2010;

Mehdad et al., 2010; Negri et al., 2011)

• Focused on

RTE text-hypothesis pairs

• Didn’t address

annotation and evaluation of rules

Page 26: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Crowdsourcing

Bar Ilan University @ ACL 2012 6

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

• Recent trend of using crowdsourcing for

annotation tasks

• Previous Works

(Snow et al., 2008; Wang and Callison-Burch, 2010;

Mehdad et al., 2010; Negri et al., 2011)

• Focused on

RTE text-hypothesis pairs

• Didn’t address

annotation and evaluation of rules

Challenges

Page 27: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Crowdsourcing

Bar Ilan University @ ACL 2012 6

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

• Recent trend of using crowdsourcing for

annotation tasks

• Previous Works

(Snow et al., 2008; Wang and Callison-Burch, 2010;

Mehdad et al., 2010; Negri et al., 2011)

• Focused on

RTE text-hypothesis pairs

• Didn’t address

annotation and evaluation of rules

Challenges

• Simplify

Page 28: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Crowdsourcing

Bar Ilan University @ ACL 2012 6

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

• Recent trend of using crowdsourcing for

annotation tasks

• Previous Works

(Snow et al., 2008; Wang and Callison-Burch, 2010;

Mehdad et al., 2010; Negri et al., 2011)

• Focused on

RTE text-hypothesis pairs

• Didn’t address

annotation and evaluation of rules

Challenges

• Simplify

• Communicate

Page 29: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Bar Ilan University @ ACL 2012 7

Inference-Rule EvaluationWe address

Crowdsourcing Rule Applications AnnotationByBy

2

Empirically Compare Different Resources

Allowing us to3

1

Page 30: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

8Bar Ilan University @ ACL 2012

Empirically Compare Different Resources

Simplify Process

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Page 31: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

8Bar Ilan University @ ACL 2012

Empirically Compare Different Resources

Simplify Process

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Simple

Tasks

Page 32: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

8Bar Ilan University @ ACL 2012

Is a phrase meaningful?1

Empirically Compare Different Resources

Simplify Process

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Simple

Tasks

Page 33: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

8Bar Ilan University @ ACL 2012

Is a phrase meaningful?1

Empirically Compare Different Resources

Simplify Process

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Simple

Tasks

Rule: X resides in Y X born in YLHS: He resides in ParisRHS: He born in Paris

Rule: X turn in Y X bring in YLHS: humans turn in bedRHS: humans bring in bed

Rule: X teach Y X explain to YLHS: Steve teaches kidsRHS: Steve explains to kids

Page 34: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

8Bar Ilan University @ ACL 2012

Is a phrase meaningful?1

Empirically Compare Different Resources

Simplify Process

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Steve teaches kidsSteve explains to kids

He born in ParisHe resides in Paris

humans turn in bedhumans bring in bed

Simple

Tasks

Rule: X resides in Y X born in YLHS: He resides in ParisRHS: He born in Paris

Rule: X turn in Y X bring in YLHS: humans turn in bedRHS: humans bring in bed

Rule: X teach Y X explain to YLHS: Steve teaches kidsRHS: Steve explains to kids

Page 35: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

8Bar Ilan University @ ACL 2012

Is a phrase meaningful?1

Empirically Compare Different Resources

Simplify Process

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Steve teaches kids

Steve explains to kids

He born in Paris

He resides in Paris

humans turn in bed

humans bring in bed

Simple

Tasks

Rule: X resides in Y X born in YLHS: He resides in ParisRHS: He born in Paris

Rule: X turn in Y X bring in YLHS: humans turn in bedRHS: humans bring in bed

Rule: X teach Y X explain to YLHS: Steve teaches kidsRHS: Steve explains to kids

Page 36: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

8Bar Ilan University @ ACL 2012

Is a phrase meaningful?1

Empirically Compare Different Resources

Simplify Process

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

He born in Paris

He resides in Paris

humans turn in bed

humans bring in bed

Simple

Tasks

2 Judge if one phrase is true given another.

Steve explains to kids

Steve teaches kids

Rule: X resides in Y X born in YLHS: He resides in ParisRHS: He born in Paris

Rule: X turn in Y X bring in YLHS: humans turn in bedRHS: humans bring in bed

Rule: X teach Y X explain to YLHS: Steve teaches kidsRHS: Steve explains to kids

Page 37: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

8Bar Ilan University @ ACL 2012

Is a phrase meaningful?1

Empirically Compare Different Resources

Simplify Process

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Steve teaches kids

Steve explains to kids

He born in Paris

He resides in Paris

humans turn in bed

humans bring in bed

Simple

Tasks

2 Judge if one phrase is true given another.

He resides in Paris

He born in Paris

Steve explains to kids

Steve teaches kids

Rule: X resides in Y X born in YLHS: He resides in ParisRHS: He born in Paris

Rule: X turn in Y X bring in YLHS: humans turn in bedRHS: humans bring in bed

Rule: X teach Y X explain to YLHS: Steve teaches kidsRHS: Steve explains to kids

Steve explains to kids

Page 38: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

8Bar Ilan University @ ACL 2012

Is a phrase meaningful?1

Empirically Compare Different Resources

Simplify Process

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

they observe holidays

they celebrate holidays

He born in Paris

He resides in Paris

humans turn in bed

humans bring in bed

Simple

Tasks

2 Judge if one phrase is true given another.

Steve teaches kids

Steve explains to kids

He born in Paris

He resides in Paris

Rule: X resides in Y X born in YLHS: He resides in ParisRHS: He born in Paris

Rule: X turn in Y X bring in YLHS: humans turn in bedRHS: humans bring in bed

Rule: X teach Y X explain to YLHS: Steve teaches kidsRHS: Steve explains to kids

Page 39: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

9Bar Ilan University @ ACL 2012

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

Communicate Entailment

Page 40: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

9Bar Ilan University @ ACL 2012

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

Communicate Entailment Gold Standard

Page 41: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

9Bar Ilan University @ ACL 2012

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

Communicate Entailment

Educating “Confusing” examples used as gold with feedback if Turkers get them wrong

1

Gold Standard

Page 42: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

9Bar Ilan University @ ACL 2012

2 Enforcing Unanimous examples used as gold to estimate Turker reliability

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

Communicate Entailment

Educating “Confusing” examples used as gold with feedback if Turkers get them wrong

1

Gold Standard

Page 43: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

10Bar Ilan University @ ACL 2012

Inference-Rule Evaluation

Without With

Agreement with Gold 0.79

Kappa with gold 0.54

False-positive rate 18%

False-negative rate 4%

Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

Communicate - Effect of Communication

Page 44: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

10Bar Ilan University @ ACL 2012

Inference-Rule Evaluation

Without With

Agreement with Gold 0.79

Kappa with gold 0.54

False-positive rate 18%

False-negative rate 4%

Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

Communicate - Effect of Communication

0.9

0.79

6%

5%

Page 45: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

10Bar Ilan University @ ACL 2012

Inference-Rule Evaluation

Without With

Agreement with Gold 0.79

Kappa with gold 0.54

False-positive rate 18%

False-negative rate 4%

Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

Communicate - Effect of Communication

0.9

0.79

6%

5%

63% of annotations judged unanimously between annotators and with our annotation

Page 46: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Bar Ilan University @ ACL 2012 11

Inference-Rule EvaluationWe address

Crowdsourcing Rule Applications Annotation

By

1

Empirically Compare Different Resources

AllowingAllowing usus toto3

2

Page 47: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Case Study – Data Set

Bar Ilan University @ ACL 2012 12

Executed four entailment rule learning methods on a set of 1B extractions extracted by ReVerb (Fader et al. 2011)

Applied rules on randomly sampled extractions to get 20,000 rule applications

Annotated each rule application using our framework

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

Page 48: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Case Study – Algorithm Comparison

13Bar Ilan University @ ACL 2012

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Algorithm AUC

DIRT (Lin and Pantel, 2001) 0.40

Cover (Weeds andWeir, 2003) 0.43

BInc (Szpektor and Dagan, 2008) 0.44

Berant (Berant et al., 2010) 0.52

Empirically Compare Different Resources

Page 49: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Case Study – Output

14

• Task 1• 1,012 meaningful LHS; meaningless RHS

• 8,264 both sides were judged meaningful

•Task 2• 2,447 positive entailment

• 3,108 negative entailment

• Overall• 6,567 rule applications

• Annotated for $1000

• About a week

• Task 1• 1,012 meaningful LHS; meaningless RHS

• 8,264 both sides were judged meaningful

•Task 2• 2,447 positive entailment

• 3,108 negative entailment

• Overall• 6,567 rule applications

• Annotated for $1000

• About a week

Bar Ilan University @ ACL 2012

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

non-entailment

passed to Task 2

Page 50: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Summary

15

A framework for crowdsourcing inference rule evaluation

• Simplifies instance-based evaluation

• Communicates entailment decision across to Turkers

• Proposed framework can be beneficial for– resource developers – inference system developers

Crowdsourcing forms and annotated extractions can be found at:

BIU NLP downloads: http://www.cs.biu.ac.il/~nlp/downloads

A framework for crowdsourcing inference rule evaluation

• Simplifies instance-based evaluation

• Communicates entailment decision across to Turkers

• Proposed framework can be beneficial for– resource developers – inference system developers

Crowdsourcing forms and annotated extractions can be found at:

BIU NLP downloads: http://www.cs.biu.ac.il/~nlp/downloads

Bar Ilan University @ ACL 2012

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

Page 51: Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Summary

15

A framework for crowdsourcing inference rule evaluation

• Simplifies instance-based evaluation

• Communicates entailment decision across to Turkers

• Proposed framework can be beneficial for– resource developers – inference system developers

Crowdsourcing forms and annotated extractions can be found at:

BIU NLP downloads: http://www.cs.biu.ac.il/~nlp/downloads

A framework for crowdsourcing inference rule evaluation

• Simplifies instance-based evaluation

• Communicates entailment decision across to Turkers

• Proposed framework can be beneficial for– resource developers – inference system developers

Crowdsourcing forms and annotated extractions can be found at:

BIU NLP downloads: http://www.cs.biu.ac.il/~nlp/downloads

Bar Ilan University @ ACL 2012

Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

Thank

You

Empirically Compare Different Resources