Top Banner
IR Evaluation Mihai Lupu [email protected] Chapter 8 of the Introduction to IR book M. Sanderson. Test Collection Based Evaluation of Information Retrieval Systems Foundations and Trends in IR, 2010 1
132

Information Retrieval Evaluation

Jan 21, 2017

Download

Data & Analytics

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Information Retrieval Evaluation

1

IR Evaluation

Mihai [email protected]

Chapter 8 of the Introduction to IR bookM. Sanderson. Test Collection Based Evaluation of Information

Retrieval Systems Foundations and Trends in IR, 2010

Page 2: Information Retrieval Evaluation

2

Outline

Introduction– Introduction to IR

Kinds of evaluation Retrieval Effectiveness evaluation

– Measures, Experimentation– Test Collections

User-based evaluation Discussion on Evaluation Conclusion

Page 3: Information Retrieval Evaluation

3

Introduction

• Why?– Put a figure on the benefit we get from a system – Because without evaluation, there is no research

Objective measurements

Page 4: Information Retrieval Evaluation

Information Retrieval

“Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.” (Salton, 1968)

General definition that can be applied to many types of information and search applications

Primary focus of IR since the 50s has been on text and documents

Page 5: Information Retrieval Evaluation

Information Retrieval

Page 6: Information Retrieval Evaluation

Information Retrieval

Page 7: Information Retrieval Evaluation

Information Retrieval

Key insights of/for information retrieval– text has no meaning

ฉันมรีถสแีดง– but it is still the most informative source

ฉันมรีถสฟีา้ is more similar to the above than คณุมรีถไฟฟา้– text is not random

I drive a red car is more probable than – I drive a red horse– A red car I drive– Car red a drive I

– meaning is defined by usage I drive a truck / I drive a car / I drive the bus truck / car / bus

are similar in meaning

Page 8: Information Retrieval Evaluation

Information Retrieval

Key insights of/for information retrieval– text has no meaning

ฉันมรีถสแีดง– but it is still the most informative source

ฉันมรีถสฟีา้ is more similar to the above than คณุมรีถไฟฟา้– text is not random

I drive a red car is more probable than – I drive a red horse– A red car I drive– Car red a drive I

– meaning is defined by usage I drive a truck / I drive a car / I drive the bus truck / car / bus

are similar in meaning

term frequency (TF), document frequency (DF)TF-IDF, BM25 (Best match 25)

language models (uni-gram, bi-gram, n-gram)

statistical semantics (latent semantic analysis, random indexing, deep learning)

Page 9: Information Retrieval Evaluation

Big Issues in IR

Relevance– What is it?– Simple (and simplistic) definition: A relevant document contains

the information that a person was looking for when they submitted a query to the search engine

– Many factors influence a person’s decision about what is relevant: e.g., task, context, novelty, style

– Topical relevance (same topic) vs. user relevance (everything else)

Page 10: Information Retrieval Evaluation

Relevance– Retrieval models define a view of relevance– Ranking algorithms used in search engines are based

on retrieval models– Most models describe statistical properties of text rather

than linguistic i.e. counting simple text features such as words

instead of parsing and analyzing the sentences Statistical approach to text processing started with

Luhn in the 50s Linguistic features can be part of a statistical model

Big Issues in IR

Page 11: Information Retrieval Evaluation

Big Issues in IR

Evaluation– Experimental procedures and measures for comparing system

output with user expectations Originated in Cranfield experiments in the 60s

– IR evaluation methods now used in many fields– Typically use test collection of documents, queries, and relevance

judgments Most commonly used are TREC collections

– Recall and precision are two examples of effectiveness measures

Page 12: Information Retrieval Evaluation

Big Issues in IR

Users and Information Needs– Search evaluation is user-centered– Keyword queries are often poor descriptions of actual information

needs– Interaction and context are important for understanding user intent– Query refinement techniques such as query expansion, query

suggestion, relevance feedback improve ranking

Page 13: Information Retrieval Evaluation

13

Introduction

• Why?– Put a figure on the benefit we get from a system – Because without evaluation, there is no research

• Why is this a research field in itself?– Because there are many kinds of IR

• With different evaluation criteria– Because it’s difficult

• Why?– Because it involves human subjectivity (document

relevance)– Because of the amount of data involved (who can sit down

and evaluate 1,750,000 documents returned by Google for ‘university vienna’?)

Page 14: Information Retrieval Evaluation

14

Kinds of evaluation

Page 15: Information Retrieval Evaluation

15

Kinds of evaluation

• “Efficient and effective system”• Time and space: efficiency

– Generally constrained by pre-development specification • E.g. real-time answers vs. batch jobs• E.g. index-size constraints

– Easy to measure• Good results: effectiveness

– Harder to define --> more research into it• And…

Page 16: Information Retrieval Evaluation

16

Kinds of evaluation (cont.)

• User studies– Does a 2% increase in some retrieval performance measure actually

make a user happier?– Does displaying a text snippet improve usability even if the

underlying method is 10% weaker than some other method?

– Hard to do– Mostly anecdotal examples– Many IR people don’t like to do it (though it’s starting to change)

Page 17: Information Retrieval Evaluation

17

Kinds of evaluation (cont.)

Intrinsic– “internal”– ultimate goal is the retrieved set

Extrinsic– “external”– in the context of the usage of the retrieval tool

Page 18: Information Retrieval Evaluation

18

What to measure in an IR system?

1966, Cleverdon:1. coverage – the extent to which relevant matter exists in the

system2. time lag ~ efficiency3. presentation 4. effort on the part of the user to answer his information

need5. recall 6. precision

Page 19: Information Retrieval Evaluation

19

What to measure in an IR system?

1966, Cleverdon:1. coverage – the extent to which relevant matter exists in the

system2. time lag ~ efficiency3. presentation 4. effort on the part of the user to answer his information

need5. recall 6. precision

Effectiveness

A desirable measure of retrieval performance would have the following properties: 1, it would be a measure of effectiveness. 2, it would not be

confounded by the relative willingness of the system to emit items. 3, it would be a single number – in preference, for example, to a pair of numbers which

may co-vary in a loosely specified way, or a curve representing a table of several pairs of numbers 4, it would allow complete ordering of different

performances, and assess the performance of any one system in absolute terms. Given a measure with these properties, we could be confident of

having a pure and valid index of how well a retrieval system (or method) were performing the function it was primarily designed to accomplish, and we could

reasonably ask questions of the form “Shall we pay X dollars for Y units of effectiveness?” (Swets, 1967)

Page 20: Information Retrieval Evaluation

20

Outline

• Introduction• Kinds of evaluation• Retrieval Effectiveness evaluation

– Measures– Test Collections

User-based evaluation• Discussion on Evaluation• Conclusion

Page 21: Information Retrieval Evaluation

21

Efficiency Metrics

Page 22: Information Retrieval Evaluation

22

Retrieval Effectiveness

Precision– How happy are we with what we’ve got

Recall– How much more we could have had

Precision =Number of relevant documents

retrievedNumber of documents retrieved

Recall =Number of relevant documents

retrievedNumber of relevant documents

Page 23: Information Retrieval Evaluation

23

Retrieval Effectiveness

Retrieved documents

Relevant documents

Universe of documents

Page 24: Information Retrieval Evaluation

24

Precision and Recall

Page 25: Information Retrieval Evaluation

25

Retrieval effectiveness

What if we don’t like this twin-measure approach? A solution:

– Van Rijsbergen’s E-Measure:

– With a special case: Harmonic mean

Page 26: Information Retrieval Evaluation

26

Retrieval effectiveness

What if we don’t like this twin-measure approach? A solution:

– Van Rijsbergen’s E-Measure:

– With a special case: Harmonic mean

Page 27: Information Retrieval Evaluation

27

Retrieval effectiveness

Tools we need:– A set of documents (the “dataset”)– A set of questions/queries/topics– For each topic, and for each document, a decision: relevant

or not relevant Let’s assume for the moment that’s all we need and

that we have it

Page 28: Information Retrieval Evaluation

28

Retrieval Effectiveness

• Precision and Recall generally plotted as a “Precision-Recall curve”

0

1

1

precision

recall

size of retrieved set increases

• They do not play well together

Page 29: Information Retrieval Evaluation

29

Precision-Recall Curves

How to build a Precision-Recall Curve?– For one query at a time– Make checkpoints on the recall-axis

0

1

1

precision

recall

Page 30: Information Retrieval Evaluation

30

Precision-Recall Curves

How to build a Precision-Recall Curve?– For one query at a time– Make checkpoints on the recall-axis

0

1

1

precision

recall

Page 31: Information Retrieval Evaluation

31

Precision-Recall Curves

• How to build a Precision-Recall Curve?– For one query at a time– Make checkpoints on the recall-axis– Repeat for all queries

0

1

1

precision

recall

Page 32: Information Retrieval Evaluation

32

Precision-Recall Curves

• And the average is the system’s P-R curve

0

1

1

precision

recall

# retrieved documents increases

• We can compare systems by comparing the curves

Page 33: Information Retrieval Evaluation

33

Precision-Recall Graph--reality check--

Page 34: Information Retrieval Evaluation

34

Interpolation

To average graphs, calculate precision at standard recall levels:

– where S is the set of observed (R,P) points Defines precision at any recall level as the maximum

precision observed in any recall-precision point at a higher recall level– produces a step function– defines precision at recall 0.0

Page 35: Information Retrieval Evaluation

35

Interpolation

Page 36: Information Retrieval Evaluation

36

Average Precision at Standard Recall Levels

• Recall-precision graph plotted by simply joining the average precision points at the standard recall levels

Page 37: Information Retrieval Evaluation

37

Average Recall-Precision Graph

Page 38: Information Retrieval Evaluation

38

Graph for 50 Queries

Page 39: Information Retrieval Evaluation

39

Retrieval Effectiveness

• Not quite done yet…– When to stop retrieving?

• Both P and R imply a cut-off value– How about graded relevance

• Some documents may be more relevant to the question than others

– How about ranking?• A document retrieved at position 1,234,567 can still be

considered useful?– Who says which documents are relevant and which not?

Page 40: Information Retrieval Evaluation

40

Single-value measures

• Fix a “reasonable” cutoff– R-precision

Precision at R, where R is the number of relevant documents. Fix the number of desired documents

– Reciprocal rank (RR) 1/rank of first relevant document in the ranked list returned

Make it less sensitive to the cutoff• Average precision

– For each query: R= # relevant documents i = rank k = # retrieved documents P(i) precision at rank i

• rel(i)=1 if document at rank i relevant, 0 otherwise– For each system:

• Compute the mean of these averages: Mean Average Precision (MAP) – one of the most used measures

Page 41: Information Retrieval Evaluation

41

R- Precision

Precision at the R-th position in the ranking of results for a query that has R relevant documents.

n doc # relevant1 588 x2 589 x3 5764 590 x5 9866 592 x7 9848 9889 57810 98511 10312 59113 772 x14 990

R = # of relevant docs = 6

R-Precision = 4/6 = 0.67

Page 42: Information Retrieval Evaluation

42

Averaging Across Queries

Page 43: Information Retrieval Evaluation

43

Average Precision

Page 44: Information Retrieval Evaluation

44

MAP

Page 45: Information Retrieval Evaluation

45

Retrieval Effectiveness

• Not quite done yet…– When to stop retrieving?

• Both P and R imply a cut-off value– How about graded relevance

• Some documents may be more relevant to the question than others

– How about ranking?• A document retrieved at position 1,234,567 can still be

considered useful?– Who says which documents are relevant and which not?

Page 46: Information Retrieval Evaluation

46

Cumulative Gain

• For each document d, and query q, definerel(d,q) >= 0

• The higher the value, the more relevant the document is to the query

• Pitfalls:– Graded relevance introduces even more ambiguity in practice

With great flexibility comes great responsibility to justify parameter values

Page 47: Information Retrieval Evaluation

47

Retrieval Effectiveness

• Not quite done yet…– When to stop retrieving?

• Both P and R imply a cut-off value– How about graded relevance

• Some documents may be more relevant to the question than others

– How about ranking?• A document retrieved at position 1,234,567 can still be

considered useful?– Who says which documents are relevant and which not?

Page 48: Information Retrieval Evaluation

48

Discounted Cumulative Gain

Popular measure for evaluating web search and related tasks

Two assumptions:– Highly relevant documents are more useful than marginally relevant

document– the lower the ranked position of a relevant document, the less useful

it is for the user, since it is less likely to be examined

Page 49: Information Retrieval Evaluation

49

Discounted Cumulative Gain

Uses graded relevance as a measure of the usefulness, or gain, from examining a document

Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks

Typical discount is 1/log (rank)– With base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3

Page 50: Information Retrieval Evaluation

50

Discounted Cumulative Gain

DCG is the total gain accumulated at a particular rank p:

Alternative formulation:

– used by some web search companies– emphasis on retrieving highly relevant documents

[Jarvelin:2000]

[Borges:2005]

Page 51: Information Retrieval Evaluation

51

Discounted Cumulative Gain

• Neither CG, nor DCG can be used for comparison across topics! depends on the # relevant documents per topic

Page 52: Information Retrieval Evaluation

52

Normalised Discounted Cumulative Gain

Compute CG / DCG for the optimal return setEg: (5,5,5,4,4,3,3,3,3,2,2,2,1,1,1,1,1,1,0,0,0,0..)has the Ideal Discounted Cumulative Gain: IDCG

Normalise:

Page 53: Information Retrieval Evaluation

53

some more variations

Eg: (5,5,5,4,4,3,3,3,3,2,2,2,1,1,1,1,1,1,0,0,0,0..)has the Ideal Discounted Cumulative Gain: IDCG

“our rank”: (5,2,0,0,5,2,4,0,0,1,4,…) two ranked lists

– rank correlation measures kendall Tau (similarity of orderings) pearson Rho (linear correlation between variables) spearman Rho (Pearson for ranks)

Page 54: Information Retrieval Evaluation

54

some more variations

rank biased precision (RBP)– “log-based discount is not a good model of users’ behaviour”– imagine the probability p of the user moving on to the next document

p~0.95 p~0.0

Page 55: Information Retrieval Evaluation

55

Time-based calibration

Assumption– The objective of the search engine is to improve the efficiency of an

information seeking task Extend nDCG to replace discount with a time-based

function

(Smucker and Clarke:2011)

Normalization

Gain Decay, as a function of time to reach item k in

the ranked list

Page 56: Information Retrieval Evaluation

56

The water filling model (Luo et al, 2013)

and the corresponding Cube Test (CT)

also for professional search– to capture embedded subtopics

no assumption of linear traversal of documents– takes into account time

potential cap on the amount of information taken into account

high discriminative power

Page 57: Information Retrieval Evaluation

57

Other diversity metrics

several aspects of the topic might [need to] be covered– Aspectual recall/precision

discount may take into account previously seen aspects– α-NDCG = NDCG where

Page 58: Information Retrieval Evaluation

58

Other measures

• There are many IR measures!• trec_eval is a little program that computes many of them

– 37 in v9.0, many of which are multi-point (e.g. Precision @10, @20…)

• http://trec.nist.gov/trec_eval/• “there is a measure to make anyone a winner”

– Not really true, but still…

Page 59: Information Retrieval Evaluation

59

Other measures

• How about correlations between measures?

• Kendal Tau values • From Voorhees and Harman,2004

• Overall they correlate

P(30) R-Prec MAP .5 prec R(1,1000) Rel Ret MRR

P(10) 0.88 0.81 0.79 0.78 0.78 0.77 0.77

P(30) 0.87 0.84 0.82 0.80 0.79 0.72

R-Prec 0.93 0.87 0.83 0.83 0.67

MAP 0.88 0.85 0.85 0.64

.5 prec 0.77 0.78 0.63

R(1,1000) 0.92 0.67

Rel ret 0.66

Page 60: Information Retrieval Evaluation

60

Topic sets

Topic selection– In early TREC candidates rejected if ambiguous

Are all topics equal?– Mean Average Precision uses arithmetic mean

– Classical Test Theory experiments (Bodoff and Li,2007) identified outliers that could change the rankings

MAP: a change in AP from 0.05 to 0.1 has the same effect as a change from 0.25 to 0.3GMAP: a change in AP from 0.05 to 0.1 has the same effect as a change from 0.25 to 0.5

Page 61: Information Retrieval Evaluation

61

Measure measures

What is the best measure?– What makes a measure better?

Match to task– E.g.

Known item search: MRR Something more quantitative?

– Correlations between measures Does the system ranking change when using different measures Useful to group measures

– Ability to distinguish between runs– Measure stability

Page 62: Information Retrieval Evaluation

62

Ad-hoc quiz

It was necessary to normalize the discounted cumulative gain (NDCG) because… of the assumption for normal probability distribution to be able to compare across topics normalization is always better to be able to average across topics

Page 63: Information Retrieval Evaluation

63

Ad-hoc quiz

It was necessary to normalize the discounted cumulative gain (NDCG) because… of the assumption for normal probability distribution to be able to compare across topics normalization is always better to be able to average across topics

Page 64: Information Retrieval Evaluation

64

Measure stability

Success criteria:– A measure is good if it is able to predict differences between

systems (on the average of future queries) Method

– Split collection in 21. Use as train collection to rank runs2. Use as test collection to compute how many pair-wise

comparisons hold Observations

– Cut-off measures less stable than MAP

Page 65: Information Retrieval Evaluation

65

Measure stability

Success criteria:– A measure is good if it is able to predict differences between

systems (on the average of future queries) Method

– Split collection in 21. Use as train collection to rank runs2. Use as test collection to compute how many pair-wise

comparisons hold Observations

– Cut-off measures less stable than MAP

Any other criteria for measure quality? ?

Page 66: Information Retrieval Evaluation

66

Measure measures

started with opinions from ’60s, seen some measures – have the targets changed?

7 numeric properties of effectiveness metrics (Moffat 2013)

Page 67: Information Retrieval Evaluation

68

7 properties of effectiveness metrics

Boundedness – the set of scores attainable by the metric is bounded, usually in [0,1]

Monotonicity – if a ranking of length k is extended so that k+1 elements are included, the score never decreases

Convergence – if a document outside the top k is swapped with a less relevant document inside the top k, the score strictly increases

Top-weightedness – if a document within the top k is swapped with a less relevant one higher in the ranking, the score strictly increases

Localization – a score at depth k can be compute based solely on knowledge of the documents that appear in top k

Completeness – a score can be calculated even if the query has no relevant documents

Realizability – provided that the collection has at least one relevant document, it is possible for the score at depth k to be maximal.

Page 68: Information Retrieval Evaluation

69

So far

introduction metrics

we are now able to say “System A is better than System B” or are we?Remember

- we only have limited data- potential future applications unbounded

a very strong statement!

Page 69: Information Retrieval Evaluation

70

Statistical validity

Whatever evaluation metric used, all experiments must be statistically valid– i.e. differences must not be the result of chance

00.020.040.060.080.1

0.120.140.160.180.2

MAP

Page 70: Information Retrieval Evaluation

71

Statistical validity

• Ingredients of a significance test– A test statistic (e.g. the differences between AP values) – A null hypothesis (e.g. “there is no difference between the two

systems) This gives us a particular distribution of the test statistic

– An alternative hypothesis (one or two-tailed tests) don’t change it after the test

– A significance level computed by taking the actual value of the test statistic and determining how likely it is to see this value given the distribution implied by the null hypothesis

• P-value• If the p-value is low, we can feel confident that we can reject

the null hypothesis the systems are different

Page 71: Information Retrieval Evaluation

72

Statistical validity

Common practice is to declare systems different when the p-value <= 0.05

A few tests– Randomization tests

Wilcoxon Signed Rank test Sign test

– Boostrap test– Student’s Paired t-test

See recent discussion in SIGIR Forum– T. Sakai - Statistical Reform in Information Retrieval?

effect sizes confidence intervals

Page 72: Information Retrieval Evaluation

73

Statistical validity

How do we increase the statistical validity of an experiment?

By increasing the number of topics– The more topics, the more confident we are that the

difference between average scores will be significant What’s the minimum number of topics?

42

• Depends, but• TREC started with 50• Below 25 is generally considered

not significant

Page 73: Information Retrieval Evaluation

74

Example Experimental Results

Page 74: Information Retrieval Evaluation

75

t-Test

Assumption is that the difference between the effectiveness values is a sample from a normal distribution

Null hypothesis is that the mean of the distribution of differences is zero

Test statistic

– for the example,

Page 75: Information Retrieval Evaluation

76

t-Testt=2.33

Page 76: Information Retrieval Evaluation

77

t-Testt=2.33

Page 77: Information Retrieval Evaluation

78

Statistical Validity - example

Page 78: Information Retrieval Evaluation

79

Page 79: Information Retrieval Evaluation

80

Page 80: Information Retrieval Evaluation

81

Page 81: Information Retrieval Evaluation

82

Page 82: Information Retrieval Evaluation

83

Summary

so far– introduction– metrics

next– where to get ground truth

some more metrics– discussion

Page 83: Information Retrieval Evaluation

84

Retrieval Effectiveness

• Not quite done yet…– When to stop retrieving?

• Both P and R imply a cut-off value– How about graded relevance

• Some documents may be more relevant to the question than others

– How about ranking?• A document retrieved at position 1,234,567 can still be

considered useful?– Who says which documents are relevant and which not?

Page 84: Information Retrieval Evaluation

85

Relevance assessments

• Ideally– Sit down and look at all documents

• Practically– The ClueWeb09 collection has

• 1,040,809,705 web pages, in 10 languages • 5 TB, compressed. (25 TB, uncompressed.)

– No way to do this exhaustively– Look only at the set of returned documents

• Assumption: if there are enough systems being tested and not one of them returned a document – the document is not relevant

Page 85: Information Retrieval Evaluation

86

Relevance assessments - Pooling Combine the results retrieved by all systems Choose a parameter k (typically 100) Choose the top k documents as ranked in each

submitted run The pool is the union of these sets of docs

– Between k and (# submitted runs) × k documents in pool– (k+1)st document returned in one run either irrelevant or

ranked higher in another run Give pool to judges for relevance assessments

Page 86: Information Retrieval Evaluation

87 From Donna Harman

Page 87: Information Retrieval Evaluation

88

Relevance assessments - Pooling Conditions under which pooling works [Robertson]

– Range of different kinds of systems, including manual systems

– Reasonably deep pools (100+ from each system) But depends on collection size

– The collections cannot be too big. Big is so relative…

Page 88: Information Retrieval Evaluation

89

Relevance assessments - Pooling Advantage of pooling:

– Fewer documents must be manually assessed for relevance Disadvantages of pooling:

– Can’t be certain that all documents satisfying the query are found (recall values may not be accurate)

– Runs that did not participate in the pooling may be disadvantaged

– If only one run finds certain relevant documents, but ranked lower than 100, it will not get credit for these.

Page 89: Information Retrieval Evaluation

90

Relevance assessments

Pooling with randomized sampling As the data collection grows, the top 100 may not be

representative of the entire result set– (i.e. the assumption that everything after is not relevant

does not hold anymore) Add, to the pool, a set of documents randomly

sampled from the entire retrieved set– If the sampling is uniform easy to reason about, but may

be too sparse as the collection grows– Stratified sampling: get more from the top of the ranked

list [Yilmaz et al.:2008]

Page 90: Information Retrieval Evaluation

91

Relevance assessments - incomplete

• The unavoidable conclusion is that we have to handle incomplete relevance assessments– Consider unjudged = non relevant– Do not consider unjudged at all (i.e. compress the ranked lists)

• A new measure:– BPref (binary preference)

r = a relevant returned document R = # documents judged relevant N = # documents judged non-relevant n = a non-relevant document

Page 91: Information Retrieval Evaluation

92

Relevance assessments - incomplete

• BPref was designed to mimic MAP• soon after, induced AP and inferred AP were proposed

• if data complete – equal to MAP

expectation of precision at rank k

Page 92: Information Retrieval Evaluation

93

not only are we incomplete, but we might also be inconsistent in our judgments

Page 93: Information Retrieval Evaluation

94

Relevance assessment - subjectivity In TREC-CHEM’09 we had each topic evaluated by

two students– “conflicts” ranged between 2% and 33% (excluding a topic

with 60% conflict)– This all increased if we considered “strict disagreement”

In general, inter-evaluator agreement is rarely above 80%

There is little one can do about it

Page 94: Information Retrieval Evaluation

95

Relevance assessment - subjectivity Good news:

– “idiosyncratic nature of relevance judgments does not affect comparative results” (E. Voorhees)

– Mean Kendall Tau between system rankings produced from different query relevance sets: 0.938

– Similar results held for: Different query sets Different evaluation measures Different assessor types Single opinion vs .group opinion judgments

Page 95: Information Retrieval Evaluation

96

No assessors

Pooling assumes all relevant documents found by systems– Take this assumption further

Voting based- relevance assessments– Consider top K only

Soboroff et al:2001

Page 96: Information Retrieval Evaluation

97

Test Collections

Generally created as the result of an evaluation campaign– TREC – Text Retrieval Conference (USA)– CLEF – Cross Language Evaluation Forum (EU)– NTCIR - NII Test Collection for IR Systems (JP)– INEX – Initiative for evaluation of XML Retrieval– …

First one and paradigm definer:– The Cranfield Collection

In the 1950s Aeronautics 1400 queries, about 6000 documents Fully evaluated

Page 97: Information Retrieval Evaluation

98

TREC

Started in 1992 Always organised in the States, on the NIST campus

As leader, introduced most of the jargon used in IR Evaluation:– Topic = query / request for information– Run = a ranked list of results – Qrel = relevance judgements

Page 98: Information Retrieval Evaluation

99

TREC

Organised as a set of tracks that focus on a particular sub-problem of IR– E.g.

Patient records, Session, Chemical, Genome, Legal, Blog, Spam,Q&A, Novelty, Enterprise, Terabyte, Web, Video, Speech, OCR, Chinese, Spanish, Interactive, Filtering, Routing, Million Query, Ad-Hoc, Robust

– Set of tracks in a year depends on Interest of participants Fit to TREC Needs of sponsors Resource constraints

Page 99: Information Retrieval Evaluation

100

TREC

C all forpartic ipation Task

defin ition

D ocum entprocurem ent

Top ic defin ition

IRexperim ents

R elevance assessm ents

R esultsevaluation

R esultsanalysis

TR E Cconference

Proceedingspublication

Page 100: Information Retrieval Evaluation

101

TREC – Task definition

Each Track has a set of Tasks: Examples of tasks from the Blog track:

– 1. Finding blog posts that contain opinions about the topic – 2. Ranking positive and negative blog posts – 3. (A separate baseline task to just find blog posts relevant

to the topic) – 4. Finding blogs that have a principal, recurring interest in

the topic

Page 101: Information Retrieval Evaluation

102

TREC - Topics

For TREC, topics generally have a specific format (not always though)– <ID>– <title>

Very short– <description>

A brief statement of what would be a relevant document– <narrative>

A long description, meant also for the evaluator to understand how to judge the topic

Page 102: Information Retrieval Evaluation

103

TREC - Topics

Example:– <ID>

312– <title>

Hydroponics– <description>

Document will discuss the science of growing plants in water or some substance other than soil

– <narrative> A relevant document will contain specific information on

the necessary nutrients, experiments, types of substrates, and/or any other pertinent facts related to the science of hydroponics. Related information includes, but is not limited to, the history of hydro- …

Page 103: Information Retrieval Evaluation

104

CLEF

Cross Language Evaluation Forum– From 2010: Conference on Multilingual and Multimodal

Information Access Evaluation– Supported by the PROMISE Network of Excellence

Started in 2000 Grand challenge:

– Fully multilingual, multimodal IR systems Capable of processing a query in any medium and any

language Finding relevant information from a multilingual

multimedia collection And presenting it in the style most likely to be useful

for the user

Page 104: Information Retrieval Evaluation

105

CLEF

• Previous tracks:• Mono-, bi- multilingual text retrieval• Interactive cross language retrieval• Cross language spoken document retrieval• QA in multiple languages• Cross language retrieval in image collections• CL geographical retrieval• CL Video retrieval• Multilingual information filtering• Intellectual property• Log file analysis• Large scale grid experiments

• From 2010– Organised as a series of “labs”

Page 105: Information Retrieval Evaluation

106

MediaEval

dedicated to evaluating new algorithms for multimedia access and retrieval.

emphasizes the 'multi' in multimedia focuses on human and social aspects of multimedia tasks

– speech recognition, multimedia content analysis, music and audio analysis, user-contributed information (tags, tweets), viewer affective response, social networks, temporal and geo-coordinates.

http://www.multimediaeval.org/

Page 106: Information Retrieval Evaluation

107

Test collections - summary

it is important to design the right experiment for the right IR task– Web retrieval is very different from legal retrieval

The example of Patent retrieval– High Recall: a single missed document can invalidate a

patent– Session based: single searches may involve days of cycles

of results review and query reformulation– Defendable: Process and results may need to be defended

in court

Page 107: Information Retrieval Evaluation

108

Outline

Introduction Kinds of evaluation Retrieval Effectiveness evaluation

– Measures, Experimentation– Test Collections

User-based evaluation Discussion on Evaluation Conclusion

Page 108: Information Retrieval Evaluation

109

User-based evaluation

Different levels of user involvement– Based on subjectivity levels1. Relevant/non-relevant assessments

Used largely in lab-like evaluation as described before2. User satisfaction evaluation

Some work on 1., very little on 2.– User satisfaction is very subjective

UIs play a major role Search dissatisfaction can be a result of the non-existence of

relevant documents

Page 109: Information Retrieval Evaluation

110

User-based evaluation

User-based relevance assessments– Focus the user on each query-document pair

Page 110: Information Retrieval Evaluation

111

User-based evaluation

User-based relevance assessments– Focus the user one each query-document pair

Page 111: Information Retrieval Evaluation

112

User-based evaluation

User-based relevance assessments– Focus the user on each query-document pair– Focus the user on query-document-document

Page 112: Information Retrieval Evaluation

113

User-based evaluation

User-based relevance assessments– Focus the user on each query-document pair– Focus the user on query-document-document

Relative judgements of documents“Is document X more relevant than document Y for

the given query?”

- Many more assessments needed- Better inter-annotator agreement [Rees and Schultz,

1967]

Page 113: Information Retrieval Evaluation

114

User-based evaluation

User-based relevance assessments– Focus the user on each query-document pair– Focus the user on query-document-document – Focus the user on lists of results

Page 114: Information Retrieval Evaluation

115

User-based evaluation

User-based relevance assessments– Focus the user one each query-document pair– Focus the user on query-document-document – Focus the user on lists of results

Image from Thomas and Hawking, Evaluation by comparing result sets in context, CIKM2006

Page 115: Information Retrieval Evaluation

116

User-based evaluation

User-based relevance assessments– Focus the user on each query-document pair– Focus the user on query-document-document – Focus the user on lists of results

Some issues, alternatives– Control for all sorts of user-based biases

Page 116: Information Retrieval Evaluation

117

User-based evaluation

User-based relevance assessments– Focus the user one each query-document pair– Focus the user on lists of results– Focus the user on query-document-document

Some issues, alternatives– Control for all sorts of user-based biases

Image from Bailey, Thomas and Hawking, Does brandname influence perceived search result quality?, ADCS2007

Page 117: Information Retrieval Evaluation

118

User-based evaluation

User-based relevance assessments– Focus the user on each query-document pair– Focus the user on query-document-document – Focus the user on lists of results

Some issues, alternatives– Control for all sorts of user-based biases– Two-panel evaluation

– limits the number of systems which can be evaluated– Is unusable in real-life contexts

– Interspersed ranked list with click monitoring

Page 118: Information Retrieval Evaluation

119

Effectiveness evaluationlab-like vs. user-focused

Results are mixed: some experiments show correlations, some not

Do user preferences and Evaluation Measures Line up? SIGIR 2010: Sanderson, Paramita, Clough, Kanoulas– shows the existence of correlations

User preferences is inherently user dependent Domain specific IR will be different

The relationship between IR effectiveness measures and user satisfaction, SIGIR 2007, Al-Maskari, Sanderson, Clough– strong correlation between user satisfaction and DCG, which

disappeared when normalized to NDCG.

Page 119: Information Retrieval Evaluation

120

Predicting performance

Future data and queries not absolute, but relative performance

– ad-hoc evaluations suffer in particular– no comparison between lab and operational settings

for justified reasons, but still none– how much better must a system be?

generally, require statistical significance

[ Trip

pe:2

011

]

Page 120: Information Retrieval Evaluation

121

Predictive performance

Future systems Test collections are often used to prove we have a better

system than the state of the art– not all documents were evaluated

Page 121: Information Retrieval Evaluation

122

Predictive performance

Future systems Test collections are often used to prove we have a better

system than the state of the art– not all documents were evaluated– “retrofit” metrics that are not considered resilient to such evolution

RBP [Webber:2009] Precision@n [Lipani:2014], Recall@n […]

Why do this?- Precision@n and Recall@n are loved in industry- Also in industry, technology migration steps are high (i.e. hold on to a

system that ‘works’ until it is patently obvious it affects business performance)

Page 122: Information Retrieval Evaluation

123

Are Lab evals sufficient?

Patent search is an active process where the end-user engages in a process of understanding and interacting with the information

evaluation needs a definition of success– success ~ lower risk

partly precision and recall partly (some argue the most important part) the intellectual and

interactive role of the patent search system as a whole series of evaluation layers

– lab evals are now the lowest level– to elevate them, they must measure risk and incentivize systems to

provide estimates of confidence in the results they provide

[ Trip

pe:2

011

]

Page 123: Information Retrieval Evaluation

124

Outline

Introduction Kinds of evaluation Retrieval Effectiveness evaluation

– Measures, Experimentation– Test Collections

User-based evaluation Discussion on Evaluation Conclusion

Page 124: Information Retrieval Evaluation

125

Discussion on evaluation

Laboratory evaluation – good or bad?– Rigorous testing– Over-constrained

I usually make the comparison to a tennis racket:– No evaluation of the device will tell you how

well it will perform in real life – that largely depends on the user

– But the user will chose the device based on the lab evaluation

Page 125: Information Retrieval Evaluation

126

Discussion on evaluation

There is bias to account for– E.g. number of relevant documents per topic

Page 126: Information Retrieval Evaluation

127

Discussion on evaluation

Recall and recall-related measures are often contested [cooper:73,p95]

– “The involvement of unexamined documents in a performance formula has long been taken for granted as a perfectly natural thing, but if one stops to ponder the situation, it begins to appear most peculiar. … Surely a document which the system user has not been shown in any form, to which he has not devoted the slightest particle of time or attention during his use of the system output, and of whose very existence he is unaware, does that user neither harm nor good in his search”

Clearly not true in the legal & patent domains

Page 127: Information Retrieval Evaluation

128

Discussion on Evaluation

Realistic tasks and user models– Evaluation has to be based on the available data sets.

This creates the user model Tasks need to correspond to available techniques

Much literature on generating tasks– Experts describe typical tasks– Use of log files of various sorts

IR Research decades behind sociology in terms of user modeling – there is a place to learn from

Page 128: Information Retrieval Evaluation

129

Discussion on Evaluation

Competitiveness– Most campaigns take pain in explaining “This is not a

competition – this is an evaluation” Competitions are stimulating, but

– Participants wary of participating if they are not sure to win Particularly commercial vendors

– Without special care from organizers, it stifles creativity: Best way to win is to take last year’s method and

improve a bit Original approaches are risky

Page 129: Information Retrieval Evaluation

130

Discussion on Evaluation

Topical Relevance What other kinds of relevance factors are there?

– diversity of information– quality– credibility– ease of reading

Page 130: Information Retrieval Evaluation

131

Conclusion

• IR Evaluation is a research field in itself• Without evaluation, research is pointless

– IR Evaluation research included • statistical significance testing is a must to validate results

• Most IR Evaluation exercises are laboratory experiments– As such, care must be taken to match, to the extent possible, real

needs of the users• Experiments in the wild are rare, small and domain specific:

– VideOlympics (2007-2009)– PatOlympics (2010-2012)

Page 131: Information Retrieval Evaluation

132

Bibliography

Test Collection Based Evaluation of Information Retrieval Systems– M. Sanderson 2010

TREC – Experiment and Evaluation in Information Retrieval– E. Voorhees, D. Harman (eds.)

On the history of evaluation in IR – S. Robertson, 2008, Journal of Information Science

A Comparison of Statistical Significance Tests for Information Retrieval Evaluation– M. Smucker, J. Allan, B. Carterette (CIKM’07)

A Simple and Efficient Sampling Methodfor Estimating AP and NDCG– E. Yilmaz, E. Kanoulas, J. Aslam (SIGIR’08)

Page 132: Information Retrieval Evaluation

133

Bibliography

Do User Preferences and Evaluation Measures Line Up?, M. Sanderson and M. L. Paramita and P. Clough and E. Kanoulas 2010 A Review of Factors Influencing User Satisfaction in Information Retrieval, A. Al-Maskari and M. Sanderson 2010 Towards higher quality health search results: Automated quality rating of depression websites, D. Hawking and T. Tang and R.

Sankaranarayana and K. Griffiths and N. Craswell and P. Bailey 2007 Evaluating Sampling Methods for Uncooperative Collections, P. Thomas and D. Hawking 2007 Comparing the Sensitivity of Information Retrieval Metrics, F. Radlinski and N. Craswell 2010 Redundancy, Diversity and Interdependent Document Relevance, F. Radlinski and P. Bennett and B. Carterette and T. Joachims

2009 Does Brandname influence perceived search result quality? Yahoo!, Google, and WebKumara, P. Bailey and P. Thomas and D.

Hawking 2007 Methods for Evaluating Interactive Information Retrieval Systems with Users, D. Kelly 2009 C-TEST: Supporting Novelty and Diversity in TestFiles for Search Tuning, D. Hawking and T. Rowlands and P. Thomas 2009 Live Web Search Experiments for the Rest of Us, T. Jones and D. Hawking and R. Sankaranarayana 2010 Quality and relevance of domain-specific search: A case study in mental health, T. Tang and N. Craswell and D. Hawking and K.

Griffiths and H. Christensen 2006 New methods for creating testfiles: Tuning enterprise search with C-TEST, D. Hawking and P. Thomas and T. Gedeon and T.

Jones and T. Rowlands 2006 A Field Experimental Approach to the Study of Relevance Assessments in Relation to Document Searching, A. M. Rees and D. G.

Schultz, Final Report to the National Science Foundation. Volume II, Appendices. Clearing- house for Federal Scientific and Technical Information, October 1967

The Water Filling Model and the Cube Test: Multi-dimensional Evaluation for Professional Search , J. Luo, C. Wing, H. Yang and M. Hearst, CIKM 2013

On sample sizes for non-matched-pair IR experiments, S. Robertson, 1990, Information Processing & Management Lipani A, Lupu M, Hanbury A, Splitting Water: Precision and Anti-Precision to Reduce Pool Bias, SIGIR 2015 W. Webber and L. A. F. Park. Score adjustment for correction of pooling bias. In Proc. of SIGIR, 2009