Evaluating Effectiveness of Information Retrieval System

Evaluating Effectiveness ofInformation Retrieval System

Jing He, Yue LuApril 21, 2023

Outline

• System Oriented Evaluation– Cranfield Paradigm– Measure– Test Collection– Incomplete Relevance Judgment Evaluation

• User Oriented Evaluation

Cranfield Paradigm

• established by [Cleverdon etal. 66]• Test collection

– Document collection– Topic set– Relevance judgment

• Measures

Measure: Binary Relevance• Binary retrieval

– precision and recall• Ranking retrieval

– P-R curve• full information• measures below are summary of it

– P@N: insensitive, local information, not average well– Average Precision

• Geometric interpretation• Utility interpretation

– R-precision: break even point; approximate area– RR: appropriate for known-item search

Measure: Graded Relevance

• Discounted Cumulated Gain

– discountedFun is always logb

• RBP– assume stop probability p at any document

DGi discountFun(relevanceFun(di),i)

DCG DGii

RBP (1 p) ri pi 1

i

Measure: Topic Set Integration

• Arithmetic Average– MAP, MRR, average P-R curve– P@n and DCG: normalize

• Geometric Average– GMAP: focus on difficulty topic

• Standardize [Webber etal SIGIR08]

– Average and standardize as normal distribution

Measure: Compare Systems

• integration score difference– Depend on topic number, difficulty, etc

• Significance tests– factors

• null hypothesis: two systems have identical performance• test criterion: integration score difference• significance level

– t-test, randomization test, bootstrap test agree to each other and more powerful (than sign and Wilcoxon test)[Smucher etal CIKM07, Cormack etal SIGIR07]

Measure: How good

• Relationship– Correlation

• Correlation between system performance rankings by different measures: use kendall’s τ or some variant [Yilmaz etal SIGIR08]

• All measures are highly correlated, especially AP, R-precision and nDCG with fair weight setting [Voorhees TREC99, Kekalainen IPM05]

– Inference Ability [Aslam etal SIGIR05]

• measure m1 score measure m2 score ?• AP and R-precision has inference ability for P@n, not hold

on the contrary

Test Collection: document and topic

• Document collection– Newspaper, newswire, etc Web Page Set– 1 billion pages(25TB) for TREC09 Web track– But no much research on how to construct document

collection for IR evaluation• Topic set

– Human designed search engine query log– How to select discriminative topics?

• [Mizzaro and Robetson SIGIR07] proposes a method, but it can only applied posteriori

Test collection: relevance judgment

• judgment agreement– agreement rate is low– System performance ranking is stable between

topic originator and experts for topic but not for the others [Bailey etal SIGIR08]; and also stable between TREC assessors [Voorhees and Harman 05]

Test Collection: relevance judgment

• How to select document to judge– Pooling (by [Jones etal 76])– Limitation of pooling

• Bias to contributed systems• Bias to title words• Not efficient enough

Incomplete Relevance Judgment Evaluation

• Motivation– dynamic, growing collection v.s constant human

labor

• Problems– Does traditional measures still work?– How to select document to judge?

Incomplete Problem:Measures

• Buckley and Voorhees’s bpref[Buckley and Voorhees SIGIR04]

– Penalize by |irrelevant doc above relevant doc| • Sakai’s condensed measures[Sakai SIGIR07]

– Just remove the unjudged documents• Yilmaz and Aslam’s infAP[Yilmaz etal CIKM06, SIGIR08]

– Estimate average precision with uniform distribution• Results

– infAP, condensed measure, nDCG are more robust than bpref for random sampling judgment from pooling

– infAP is more appropriate to estimate absolute AP value

Incomplete Problem: select document to judge(1)

• Aslam’s statAP[Aslam etal SIGIR06, Allan etal TREC07, Yilzma etal SIGIR08]

– Extension of infAP (based on uniform sampling)– uniform sampling: too few relevant document– Stratified sampling

• Higher sampling probability for document ranked higher by more retrieval system (like voting)

– Estimate

Incomplete Problem: select document to judge(2)

• Carterett’s minimal test collection– Select most “discriminative” document to judge– How to define “discriminative”?

• By how the AP difference boundary changes with the relevance knowledge of this document

– Estimate AP

Incomplete Problem

• It’s more reliable to handle incomplete problem with more queries with less judgment each

• statAP is more appropriate for estimating absolute AP value

• Minimal test collection is more appropriate for discriminating systems

User Oriented Evaluation - Alternative to batch-mode evaluation• Conduct user studies [Kagolovsky&Moehr 03]

o where actual users would use system and assess the quality of the search process and results.

• Advantage: • allows us to see actual utility system, and provides more

interpretability in terms of the usefulness of the system. • Deficiencies:

o difficult to compare two system reliably in the same context. o expensive to invite many users to participate in the

experiments.

Criticism of batch-mode evaluation[Kagolovsky&Moehr 03][Harter&Hert ARIST97] • Expensive judgments

o obtaining relevance judgments is time consumingo How to overcome? predict relevance with implicit

information which is easy to collect with real systems.• Judgment = user need?

o judgment may not represent real users' information needs thus the evaluation results may not reflect the real utility of the system

o whether batch evaluation correlates well with user evaluation?

Expensive judgments (1)

• [Carterette&Jones 07NIPS] • Predict the relevance score (nDCG) using clicks after an

initial training phase.• Can identify the better of two ranking 82% of the time with

no relevance judgment and 94% of the time with only two judgment for each query

• [Joachims 03TextMining] • Compare two systems by using click-through data on the

mixture ranking list which is generated by interleaving the results from the two systems.

• Results closely followed the relevance judgments using P@n

Expensive judgments (2)

• [Radlinski et al 08CIKM] • "absolute usage metrics" (such as clicks per quary,

frequency of query reformulations) fail to reflect the retrieval quality

• "paired comparison test" produces reliable predictions• Summary

o reliable pair-wise comparison availableo reliable absolute prediction of relevance scores is still an

open research question

Judgment = user need ? (1)

• Negative correlationo [Hersh et al 00SIGIR] 24 users for 6 instance-recall tasks. o [Turpin&Hersh 01SIGIR] 24 users for 6 QA tasks. o Both no significant difference in user task effectiveness

found between systems with significantly different MAP. o small number of topics may explain why no correlation was

detectedo Mixed correlation

o [Turpin&Scholer 06SIGIR] two exp on 50 queries: o one precision-based user task(finding the first relevant doc)o one recall-based user task(# of relevant doc found in five min)o Results: no significant relationship between system

effectiveness and user effectiveness in precision task, and significant but week relationship wtih recall-based task.

http://images.google.com/imgres?imgurl=http://www.flreia.com/clubportal/images/clubimages/2076/webpages/Question%2520mark%2520funny%2520face.jpg&imgrefurl=http://www.flreia.com/clubportal/ClubStatic.cfm%3FclubID%3D2076%26pubmenuoptID%3D24765&usg=__DCQDAWZaTt-xpQeRSiRtPFa-7Lo=&h=579&w=513&sz=31&hl=en&start=7&sig2=7K00hOnby8aRtBI5oziatQ&um=1&tbnid=fO9pDpET8ah0WM:&tbnh=134&tbnw=119&prev=/images%3Fq%3Dquestion%2Bface%26imgtype%3Dclipart%26as_st%3Dy%26hl%3Den%26sa%3DG%26um%3D1&ei=tVbUSeqhFsnxnQfR0N2LDw


• Positive correlationo [Allan et al 05SIGIR] 33 users, 45 topics, differences in bpref

(0.5-0.98) could result in significant differences in user effectiveness of retrieving faceted doc passages.

o [Huffman&Hochster 07SIGIR] 7 participants, 200 Google queries, satisfaction of assessors correlate fairly strongly with relevance among top three doc measured using a version of nCDG.

o [Al-Maskari et al 08SIGIR] 56 users, recall-based task on 56 queries on top of "good" and "bad" systems. The authors showed that user effectiveness (time consumed, relevant doc collected, queries input, satisfaction etc) and system effectiveness(P@n, MAP) are highly corrective.


• Summary• Although there are limitations of the batch-

mode relevance evaluation, most recent studies showed high correlation between user evaluation and system evaluation using relevance measures.

Evaluating Effectiveness of Information Retrieval System

Documents

Evaluating Effectiveness of Information Retrieval System