Evaluating Effectiveness ofInformation Retrieval System
Jing He, Yue LuApril 21, 2023
Outline
• System Oriented Evaluation– Cranfield Paradigm– Measure– Test Collection– Incomplete Relevance Judgment Evaluation
• User Oriented Evaluation
Cranfield Paradigm
• established by [Cleverdon etal. 66]• Test collection
– Document collection– Topic set– Relevance judgment
• Measures
Measure: Binary Relevance• Binary retrieval
– precision and recall• Ranking retrieval
– P-R curve• full information• measures below are summary of it
– P@N: insensitive, local information, not average well– Average Precision
• Geometric interpretation• Utility interpretation
– R-precision: break even point; approximate area– RR: appropriate for known-item search
Measure: Graded Relevance
• Discounted Cumulated Gain
– discountedFun is always logb
• RBP– assume stop probability p at any document
DGi discountFun(relevanceFun(di),i)
DCG DGii
RBP (1 p) ri pi 1
i
Measure: Topic Set Integration
• Arithmetic Average– MAP, MRR, average P-R curve– P@n and DCG: normalize
• Geometric Average– GMAP: focus on difficulty topic
• Standardize [Webber etal SIGIR08]
– Average and standardize as normal distribution
Measure: Compare Systems
• integration score difference– Depend on topic number, difficulty, etc
• Significance tests– factors
• null hypothesis: two systems have identical performance• test criterion: integration score difference• significance level
– t-test, randomization test, bootstrap test agree to each other and more powerful (than sign and Wilcoxon test)[Smucher etal CIKM07, Cormack etal SIGIR07]
Measure: How good
• Relationship– Correlation
• Correlation between system performance rankings by different measures: use kendall’s τ or some variant [Yilmaz etal SIGIR08]
• All measures are highly correlated, especially AP, R-precision and nDCG with fair weight setting [Voorhees TREC99, Kekalainen IPM05]
– Inference Ability [Aslam etal SIGIR05]
• measure m1 score measure m2 score ?• AP and R-precision has inference ability for P@n, not hold
on the contrary
Test Collection: document and topic
• Document collection– Newspaper, newswire, etc Web Page Set– 1 billion pages(25TB) for TREC09 Web track– But no much research on how to construct document
collection for IR evaluation• Topic set
– Human designed search engine query log– How to select discriminative topics?
• [Mizzaro and Robetson SIGIR07] proposes a method, but it can only applied posteriori
Test collection: relevance judgment
• judgment agreement– agreement rate is low– System performance ranking is stable between
topic originator and experts for topic but not for the others [Bailey etal SIGIR08]; and also stable between TREC assessors [Voorhees and Harman 05]
Test Collection: relevance judgment
• How to select document to judge– Pooling (by [Jones etal 76])– Limitation of pooling
• Bias to contributed systems• Bias to title words• Not efficient enough
Incomplete Relevance Judgment Evaluation
• Motivation– dynamic, growing collection v.s constant human
labor
• Problems– Does traditional measures still work?– How to select document to judge?
Incomplete Problem:Measures
• Buckley and Voorhees’s bpref[Buckley and Voorhees SIGIR04]
– Penalize by |irrelevant doc above relevant doc| • Sakai’s condensed measures[Sakai SIGIR07]
– Just remove the unjudged documents• Yilmaz and Aslam’s infAP[Yilmaz etal CIKM06, SIGIR08]
– Estimate average precision with uniform distribution• Results
– infAP, condensed measure, nDCG are more robust than bpref for random sampling judgment from pooling
– infAP is more appropriate to estimate absolute AP value
Incomplete Problem: select document to judge(1)
• Aslam’s statAP[Aslam etal SIGIR06, Allan etal TREC07, Yilzma etal SIGIR08]
– Extension of infAP (based on uniform sampling)– uniform sampling: too few relevant document– Stratified sampling
• Higher sampling probability for document ranked higher by more retrieval system (like voting)
– Estimate
Incomplete Problem: select document to judge(2)
• Carterett’s minimal test collection– Select most “discriminative” document to judge– How to define “discriminative”?
• By how the AP difference boundary changes with the relevance knowledge of this document
– Estimate AP
Incomplete Problem
• It’s more reliable to handle incomplete problem with more queries with less judgment each
• statAP is more appropriate for estimating absolute AP value
• Minimal test collection is more appropriate for discriminating systems
User Oriented Evaluation - Alternative to batch-mode evaluation• Conduct user studies [Kagolovsky&Moehr 03]
o where actual users would use system and assess the quality of the search process and results.
• Advantage: • allows us to see actual utility system, and provides more
interpretability in terms of the usefulness of the system. • Deficiencies:
o difficult to compare two system reliably in the same context. o expensive to invite many users to participate in the
experiments.
Criticism of batch-mode evaluation[Kagolovsky&Moehr 03][Harter&Hert ARIST97] • Expensive judgments
o obtaining relevance judgments is time consumingo How to overcome? predict relevance with implicit
information which is easy to collect with real systems.• Judgment = user need?
o judgment may not represent real users' information needs thus the evaluation results may not reflect the real utility of the system
o whether batch evaluation correlates well with user evaluation?
Expensive judgments (1)
• [Carterette&Jones 07NIPS] • Predict the relevance score (nDCG) using clicks after an
initial training phase.• Can identify the better of two ranking 82% of the time with
no relevance judgment and 94% of the time with only two judgment for each query
• [Joachims 03TextMining] • Compare two systems by using click-through data on the
mixture ranking list which is generated by interleaving the results from the two systems.
• Results closely followed the relevance judgments using P@n
Expensive judgments (2)
• [Radlinski et al 08CIKM] • "absolute usage metrics" (such as clicks per quary,
frequency of query reformulations) fail to reflect the retrieval quality
• "paired comparison test" produces reliable predictions• Summary
o reliable pair-wise comparison availableo reliable absolute prediction of relevance scores is still an
open research question
Judgment = user need ? (1)
• Negative correlationo [Hersh et al 00SIGIR] 24 users for 6 instance-recall tasks. o [Turpin&Hersh 01SIGIR] 24 users for 6 QA tasks. o Both no significant difference in user task effectiveness
found between systems with significantly different MAP. o small number of topics may explain why no correlation was
detectedo Mixed correlation
o [Turpin&Scholer 06SIGIR] two exp on 50 queries: o one precision-based user task(finding the first relevant doc)o one recall-based user task(# of relevant doc found in five min)o Results: no significant relationship between system
effectiveness and user effectiveness in precision task, and significant but week relationship wtih recall-based task.
Judgment = user need ? (2)
• Positive correlationo [Allan et al 05SIGIR] 33 users, 45 topics, differences in bpref
(0.5-0.98) could result in significant differences in user effectiveness of retrieving faceted doc passages.
o [Huffman&Hochster 07SIGIR] 7 participants, 200 Google queries, satisfaction of assessors correlate fairly strongly with relevance among top three doc measured using a version of nCDG.
o [Al-Maskari et al 08SIGIR] 56 users, recall-based task on 56 queries on top of "good" and "bad" systems. The authors showed that user effectiveness (time consumed, relevant doc collected, queries input, satisfaction etc) and system effectiveness(P@n, MAP) are highly corrective.
Judgment = user need ? (3)
• Summary• Although there are limitations of the batch-
mode relevance evaluation, most recent studies showed high correlation between user evaluation and system evaluation using relevance measures.