C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

C.Watters cs6403 1

Evaluation Measures

C.Watters cs6403 2

Evaluation?

• Effectiveness?• For whom?• For what?• Efficiency?• Time?• Computational Cost?• Cost of missed information? Too much info?

C.Watters cs6403 3

Studies of Retrieval Effectiveness

• The Cranfield Experiments, Cyril W. Cleverdon, Cranfield College of Aeronautics, 1957 –1968 (hundreds of docs)

• SMART System, Gerald Salton, Cornell University, 1964-1988 (thousands of docs,)

• TREC, Donna Harman, National Institute of Standards and Technology (NIST), 1992 - (millions of docs, 100k to 7.5M per set, training Q’s and test Q’s, 150 each)

C.Watters cs6403 4

What can we measure?• ???• Algorithm (Efficiency)

– Speed of algorithm– Update potential of indexing scheme– Size of storage required– Potential for distribution & parallelism

• User Experience (Effectiveness)– How many of all relevant docs were found– How many were missed– How many errors in selection– How many need to be scanned before get good ones

C.Watters cs6403 5

Measures based on relevance

RR

NN

NR RN

not retrieved not relevant

retrieved not relevant

retrieved relevant

not retrieved relevant

retrievednot re

trieved

not relevantrelevant

Doc set

C.Watters cs6403 6

Relevance

• System always correct!!• Who judges relevance?• Inter-rater reliability• Early evaluations

– Done by panel of experts– 1-2000 abstracts of docs

• TREC experiments– Done automatically– thousands of docs– Pooling + people

C.Watters cs6403 7

Defining the universe of relevant docs

• Manual inspection• Manual exhaustive search• Pooling (TREC)

– Relevant set is the union of multiple techniques

• Sampling– Take a random sample – Inspect– Estimate from the sample for the whole set

C.Watters cs6403 8

Defining the relevant docs in a retrieved set (hit list)

• Panel of judges

• Individual users

• Automatic detection techniques– Vocabulary overlap with known relevant docs– metadata

C.Watters cs6403 9

Estimates of RecallPooled system used by TREC depends on the quality of the set of nominated documents. Are there relevant documents not in the pool?

•

C.Watters cs6403 10

Measures based on relevance

RR

NN

NR RN

not retrieved not relevant

retrieved not relevant

retrieved relevant

not retrieved relevant

retrievednot re

trieved

not relevantrelevant

Doc set

C.Watters cs6403 11

Measures based on relevance RR RR + NR

RR RR + RN

RN RN + NN

recall =

precision =

fallout =

C.Watters cs6403 12

Relevance Evaluation Measures

• Recall and Precision

• Single valued measures– Macro and micro averages– R-precision– E-measure– Swet’s measure

C.Watters cs6403 13

Recall and Precision

• Recall– Proportion of relevant docs retrieved

• Precision– Proportion of retrieved docs that are relevant

C.Watters cs6403 14

Formula (what do we have to work with?)

• Rq= number of docs in whole data set relevant to query, q

• Rr = number of docs in hit set (retrieved docs) that are relevant

• Rh= number of docs in hit set

• Recall = Precision =

C.Watters cs6403 15


collection

RhRq

Rr

C.Watters cs6403 16


• Rq={d3, d5, d9, d25, d39, d44, d56, d71, d89, d123}

• Rh={d6, d8, d9, d84, d123}

• Rr={ }

• Recall = Precision=

• ???what does that tell us??

C.Watters cs6403 17

Recall Precision Graphs

precision

Recall

100%

100%

C.Watters cs6403 18

Typical recall-precision graph

1.0

0.75

0.5

0.25

1.00.750.50.25

precision

recall

narrow, specific query

Broad, general query

C.Watters cs6403 19

Recall-precision after retrieval of n documents

n relevant recall precision1 yes 0.2 1.02 yes 0.4 1.03 no 0.4 0.674 yes 0.6 0.755 no 0.6 0.606 yes 0.8 0.677 no 0.8 0.578 no 0.8 0.509 no 0.8 0.4410 no 0.8 0.4011 no 0.8 0.3612 no 0.8 0.3313 yes 1.0 0.3814 no 1.0 0.36

SMART system using Cranfield data, 200 documents in aeronautics of which 5 are relevant

C.Watters cs6403 20

Recall-precision graph

1.0

0.75

0.5

0.25

1.00.750.50.25

recall

precision

1

23

45

612

13200

C.Watters cs6403 21

Recall-precision graph

1.0

0.75

0.5

0.25

1.00.750.50.25

recall

precision

The red system appears better than the black, but is the difference statistically significant?

C.Watters cs6403 22

Consider Rank


• Rq={d3, d5, d9, d25, d39, d44, d56, d71, d89, d123}

• Rh={d123, d84, d56, d6, d8 , d 9, d 511, d129 , d 187, d 25, d38 , d48 , d250 , d113 , d3}

• Rr={ }

• Recall = Precision=• What happens as we go through the hits?

C.Watters cs6403 23

Standard Recall Levels

• Plot Precision for Recall =0%, 10%,….100%

P

R 100

100

C.Watters cs6403 24

Consider two queries

• Rq={d3, d5, d9, d25, d39, d44, d56, d71, d89, d123}

• Rh={d123, d84, d56, d8 , d 9, d 511, d129 , d 25, d38, d3}

• Rq={d3, d5, d56, d89 , d90 , d 94, d129 , d206 , d500 d502}

• Rh={d12 ,d84, d56, d6, d8 , d 3, d 511, d129 ,d44 ,d89}

C.Watters cs6403 25

Comparison of Query results

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8

recall

precision

Q1

Q2

C.Watters cs6403 26

P-R for Multiple Queries

• For each recall level average precision

• Avg Prec at recall r

• Nq is number of queries

• Pi(r) is prec at recall level r for ith query

C.Watters cs6403 27

Comparison of two systems

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11

recall

Avg

Ag2

C.Watters cs6403 28

Macro and Micro Averaging

• Micro – average over each point

• Macro – average of averages per query

• example

C.Watters cs6403 29

Statistical testsSuppose that a search is carried out on systems i and jSystem i is superior to system j if

recall(i) >= recall(j)precisions(i) >= precision(j)

C.Watters cs6403 30

Statistical tests• The t-test is the standard statistical test for comparing two table of numbers, but depends on statistical assumptions of independence and normal distributions that do not apply to this data.

• The sign test makes no assumptions of normality and uses only the sign (not the magnitude) of the the differences in the sample values, but assumes independent samples.

• The Wilcoxon signed rank uses the ranks of the differences, not their magnitudes, and makes no assumption of normality but but assumes independent samples.

C.Watters cs6403 31

II Single Value Measures

• E-Measure & F1 measure

• Swet’s Measure

• Expected Search Length

• etc

C.Watters cs6403 32

E Measure & F1 Measure

• Weighted average of recall and precision

• Can increase importance of recall or precision

• F=1-E (bigger is better)

• Beta often =1

• Increase beta ???

C.Watters cs6403 33

Normalized Recall

• recall is normalized against all relevant documents.

• Suppose there are N documents in

• the collection and out of which n are relevant. These n documents are ranked as i1, i2,…, in.

• normalized recall is calculated:

• Normalize recall = (ij - j) / (n * (N – n))

J.Allan cs6403 34

Normalized recall measure

5 10 15 200195

ideal ranks

actual ranks

worst ranks

recall

ranks of retrieved documents

J.Allan cs6403 35

Normalized recall area between actual and worst area between best and worstNormalized recall =

Rnorm = 1 - ri - i

n(N - n)

i = 1

n

i = 1

n

C.Watters cs6403 36

Example: N=200 n=5 at 1,3,5,10,14

C.Watters cs6403 37

Expected Search Length

• Number of non-relevant docs before relevant doc(s) are found

• Assume weak or no ordering

• Lq= j + i . s/(r+1)

• s is required # of relevant docs

• j is # non-relevant docs before get required #

• ??

C.Watters cs6403 38

Problems with testing

• Determining relevant docs

• Setting up test questions

• Comparing results

• Understanding relevance of the results

C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

Documents