Top Banner
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences
32

Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Mar 26, 2015

Download

Documents

Diana Cross
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

The Significance of Result Differences

Page 2: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Why Significance Tests?

• everybody knows we have to test the significance of our results• but do we really?

• evaluation results are valid for• data from specific corpus• extracted with specific methods• for a particular type of collocations• according to the intuitions of one

particular annotator (or two)

Page 3: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Why Significance Tests?

• significance tests are about generalisations

• basic question:"If we repeated the evaluation experiment (on similar data), would we get the same results?"

• influence of source corpus, domain, collocation type and definition, annotation guidelines, ...

Page 4: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Evaluation of Association Measures

Page 5: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Evaluation of Association Measures

Page 6: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

A Different Perspective

• pair types are described by tables (O11, O12, O21, O22) coordinates in 4-D space

• O22 is redundant becauseO11 + O12 + O21 + O22 = N

• can also describe pair type by joint and marginal frequencies(f, f1, f2) = "coordinates" coordinates in 3-D space

Page 7: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

A Different Perspective

• data set = cloud of points in three-dimensional space

• visualisation is "challenging"• many association measures

depend on O11 and E11 only(MI, gmean, t-score, binomial)

• projection to (O11, E11) coordinates in 2-D space(ignoring the ratio f1 / f2)

Page 8: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

The Parameter Space of Collocation Candidates

Page 9: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

The Parameter Space of Collocation Candidates

Page 10: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

The Parameter Space of Collocation Candidates

Page 11: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

The Parameter Space of Collocation Candidates

Page 12: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

The Parameter Space of Collocation Candidates

Page 13: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

N-best Lists in Parameter Space

• N-best List for AM includes all pair types where score c(threshold c obtained from data)

• { c} describes a subset of the parameter space

• for a sound association measure isoline { = c} is lower boundary(because scores should increase with O11 for fixed value of E11)

Page 14: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

N-Best Isolines in the Parameter Space

MI

Page 15: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

N-Best Isolines in theParameter Space

MI

Page 16: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

N-Best Isolines in theParameter Space

t-score

Page 17: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

N-Best Isolines in theParameter Space

t-score

Page 18: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

95% Confidence Interval

Page 19: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

99% Confidence Interval

Page 20: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

95% Confidence Interval

Page 21: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Comparing Precision Values

• number of TPs and FPs for 1000-best lists

tbl t-score frequency

TPs 322 283

FPs 678 717

Page 22: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

McNemar's Test

+ = in 1000-best list – = not in 1000-best list• ideally: all TPs in 1000-best list (possible!)

• H0: differences between AMs are random

tbl – t-score + t-score

– freq 610 46

+ freq 7 276

Page 23: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

McNemar's Test

+ = in 1000-best list – = not in 1000-best list> mcnemar.test(tbl)

• p-value < 0.001 highly significant

tbl – t-score + t-score

– freq 610 46

+ freq 7 276

Page 24: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Significant Differences

Page 25: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Significant Differences

Page 26: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Significant Differences

= significant = relevant (2%)

Page 27: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Lowest-Frequency Data: Samples

• Too much data for full manual evaluation random samples

• AdjN data• 965 pairs with f = 1 (15% sample)• manually identified 31 TPs (3.2%)

• PNV data• 983 pairs with f < 3 (0.35% sample)• manually identified 6 TPs (0.6%)

Page 28: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Lowest-Frequency Data: Samples

• Estimate proportion p of TPs among all lowest-frequency data

• Confidence set from binomial test• AdjN: 31 TPs among 965 items

• p 5% with 99% confidence• at most 320 TPs

• PNV: 6 TPs among 983-items • p 1.5% with 99% confidence• there might still be 4200 TPs !!

Page 29: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

N-best Lists for Lowest-Frequency Data

• evaluate 10,000-best lists• to reduce manual annotation work,

take 10% sample from each list(i.e. 1,000 candidates for each AM)

• precision graphs for N-best lists• up to N = 10,000 for the PNV data

• 95% confidence estimates for precision of best-performing AM (from binomial test)

Page 30: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Random Sample Evaluation

Page 31: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Random Sample Evaluation

Page 32: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Random Sample Evaluation