Top Banner
C.Watters cs6403 1 Evaluation Measures
38

C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

Jan 18, 2016

Download

Documents

Jemima Lewis
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 1

Evaluation Measures

Page 2: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 2

Evaluation?

• Effectiveness?• For whom?• For what?• Efficiency?• Time?• Computational Cost?• Cost of missed information? Too much info?

Page 3: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 3

Studies of Retrieval Effectiveness

• The Cranfield Experiments, Cyril W. Cleverdon, Cranfield College of Aeronautics, 1957 –1968 (hundreds of docs)

• SMART System, Gerald Salton, Cornell University, 1964-1988 (thousands of docs,)

• TREC, Donna Harman, National Institute of Standards and Technology (NIST), 1992 - (millions of docs, 100k to 7.5M per set, training Q’s and test Q’s, 150 each)

Page 4: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 4

What can we measure?• ???• Algorithm (Efficiency)

– Speed of algorithm– Update potential of indexing scheme– Size of storage required– Potential for distribution & parallelism

• User Experience (Effectiveness)– How many of all relevant docs were found– How many were missed– How many errors in selection– How many need to be scanned before get good ones

Page 5: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 5

Measures based on relevance

RR

NN

NR RN

not retrieved not relevant

retrieved not relevant

retrieved relevant

not retrieved relevant

retrievednot re

trieved

not relevantrelevant

Doc set

Page 6: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 6

Relevance

• System always correct!!• Who judges relevance?• Inter-rater reliability• Early evaluations

– Done by panel of experts– 1-2000 abstracts of docs

• TREC experiments– Done automatically– thousands of docs– Pooling + people

Page 7: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 7

Defining the universe of relevant docs

• Manual inspection• Manual exhaustive search• Pooling (TREC)

– Relevant set is the union of multiple techniques

• Sampling– Take a random sample – Inspect– Estimate from the sample for the whole set

Page 8: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 8

Defining the relevant docs in a retrieved set (hit list)

• Panel of judges

• Individual users

• Automatic detection techniques– Vocabulary overlap with known relevant docs– metadata

Page 9: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 9

Estimates of RecallPooled system used by TREC depends on the quality of the set of nominated documents. Are there relevant documents not in the pool?

Page 10: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 10

Measures based on relevance

RR

NN

NR RN

not retrieved not relevant

retrieved not relevant

retrieved relevant

not retrieved relevant

retrievednot re

trieved

not relevantrelevant

Doc set

Page 11: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 11

Measures based on relevance RR RR + NR

RR RR + RN

RN RN + NN

recall =

precision =

fallout =

Page 12: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 12

Relevance Evaluation Measures

• Recall and Precision

• Single valued measures– Macro and micro averages– R-precision– E-measure– Swet’s measure

Page 13: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 13

Recall and Precision

• Recall– Proportion of relevant docs retrieved

• Precision– Proportion of retrieved docs that are relevant

Page 14: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 14

Formula (what do we have to work with?)

• Rq= number of docs in whole data set relevant to query, q

• Rr = number of docs in hit set (retrieved docs) that are relevant

• Rh= number of docs in hit set

• Recall = Precision =

Page 15: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 15

• Recall = Precision =

collection

RhRq

Rr

Page 16: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 16

• Recall = Precision =

• Rq={d3, d5, d9, d25, d39, d44, d56, d71, d89, d123}

• Rh={d6, d8, d9, d84, d123}

• Rr={ }

• Recall = Precision=

• ???what does that tell us??

Page 17: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 17

Recall Precision Graphs

precision

Recall

100%

100%

Page 18: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 18

Typical recall-precision graph

1.0

0.75

0.5

0.25

1.00.750.50.25

precision

recall

narrow, specific query

Broad, general query

Page 19: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 19

Recall-precision after retrieval of n documents

n relevant recall precision1 yes 0.2 1.02 yes 0.4 1.03 no 0.4 0.674 yes 0.6 0.755 no 0.6 0.606 yes 0.8 0.677 no 0.8 0.578 no 0.8 0.509 no 0.8 0.4410 no 0.8 0.4011 no 0.8 0.3612 no 0.8 0.3313 yes 1.0 0.3814 no 1.0 0.36

SMART system using Cranfield data, 200 documents in aeronautics of which 5 are relevant

Page 20: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 20

Recall-precision graph

1.0

0.75

0.5

0.25

1.00.750.50.25

recall

precision

1

23

45

612

13200

Page 21: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 21

Recall-precision graph

1.0

0.75

0.5

0.25

1.00.750.50.25

recall

precision

The red system appears better than the black, but is the difference statistically significant?

Page 22: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 22

Consider Rank

• Recall = Precision =

• Rq={d3, d5, d9, d25, d39, d44, d56, d71, d89, d123}

• Rh={d123, d84, d56, d6, d8 , d 9, d 511, d129 , d 187, d 25, d38 , d48 , d250 , d113 , d3}

• Rr={ }

• Recall = Precision=• What happens as we go through the hits?

Page 23: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 23

Standard Recall Levels

• Plot Precision for Recall =0%, 10%,….100%

P

R 100

100

Page 24: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 24

Consider two queries

• Rq={d3, d5, d9, d25, d39, d44, d56, d71, d89, d123}

• Rh={d123, d84, d56, d8 , d 9, d 511, d129 , d 25, d38, d3}

• Rq={d3, d5, d56, d89 , d90 , d 94, d129 , d206 , d500 d502}

• Rh={d12 ,d84, d56, d6, d8 , d 3, d 511, d129 ,d44 ,d89}

Page 25: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 25

Comparison of Query results

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8

recall

precision

Q1

Q2

Page 26: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 26

P-R for Multiple Queries

• For each recall level average precision

• Avg Prec at recall r

• Nq is number of queries

• Pi(r) is prec at recall level r for ith query

Page 27: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 27

Comparison of two systems

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11

recall

Avg

Ag2

Page 28: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 28

Macro and Micro Averaging

• Micro – average over each point

• Macro – average of averages per query

• example

Page 29: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 29

Statistical testsSuppose that a search is carried out on systems i and jSystem i is superior to system j if

recall(i) >= recall(j)precisions(i) >= precision(j)

Page 30: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 30

Statistical tests• The t-test is the standard statistical test for comparing two table of numbers, but depends on statistical assumptions of independence and normal distributions that do not apply to this data.

• The sign test makes no assumptions of normality and uses only the sign (not the magnitude) of the the differences in the sample values, but assumes independent samples.

• The Wilcoxon signed rank uses the ranks of the differences, not their magnitudes, and makes no assumption of normality but but assumes independent samples.

Page 31: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 31

II Single Value Measures

• E-Measure & F1 measure

• Swet’s Measure

• Expected Search Length

• etc

Page 32: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 32

E Measure & F1 Measure

• Weighted average of recall and precision

• Can increase importance of recall or precision

• F=1-E (bigger is better)

• Beta often =1

• Increase beta ???

Page 33: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 33

Normalized Recall

• recall is normalized against all relevant documents.

• Suppose there are N documents in

• the collection and out of which n are relevant. These n documents are ranked as i1, i2,…, in.

• normalized recall is calculated:

• Normalize recall = (ij - j) / (n * (N – n))

Page 34: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

J.Allan cs6403 34

Normalized recall measure

5 10 15 200195

ideal ranks

actual ranks

worst ranks

recall

ranks of retrieved documents

Page 35: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

J.Allan cs6403 35

Normalized recall area between actual and worst area between best and worstNormalized recall =

Rnorm = 1 - ri - i

n(N - n)

i = 1

n

i = 1

n

Page 36: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 36

Example: N=200 n=5 at 1,3,5,10,14

Page 37: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 37

Expected Search Length

• Number of non-relevant docs before relevant doc(s) are found

• Assume weak or no ordering

• Lq= j + i . s/(r+1)

• s is required # of relevant docs

• j is # non-relevant docs before get required #

• ??

Page 38: C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 38

Problems with testing

• Determining relevant docs

• Setting up test questions

• Comparing results

• Understanding relevance of the results