1 1 Mandl: Current Developments in Information Retrieval Evaluation Thomas Mandl Information Science University of Hildesheim [email protected]Tutorial @ ECIR Toulouse 6 th Apr. 2009 Current Developments in Information Retrieval Evaluation Mandl: Current Developments in Information Retrieval Evaluation Who Who am I? am I? • Assistant Professor at University of Hildesheim • Studies at University of Regensburg, Germany and University of Illinois at UC, USA • PhD on Neural Networks in IR from University of Hildesheim • Postdoc Thesis (Habilitation) 2006 on Quality in Web IR from University of Hildesheim • Research on IR – Participant at CLEF since 2002 – Track Coordinator at CLEF since 2006 3 Mandl: Current Developments in Information Retrieval Evaluation • Which system is better? • Management approach? Mandl: Current Developments in Information Retrieval Evaluation Different Query Different Query types types – – Different Different evaluation evaluation • Navigational – In search of a homepage of company X • Informational – Yellow-Pages-Queries – question answering – Ad-hoc (Searching everything concerning topic X) 5 Mandl: Current Developments in Information Retrieval Evaluation „There must be some fundamental understanding of what it means to be good and what it means to be better“ (Bollmann/Cherniavsky 1983,3) 6 Mandl: Current Developments in Information Retrieval Evaluation documents (objects) Author Information Seeker Query Indexing Object- Attribute Matrix Document Corpus Query- Representation Result- Documents similarity- calculation representation Creation Formulation Evaluation IR Indexing
12
Embed
Tutorial @ ECIR Toulouse Who am I? › ~mandl › events › TutorialEvaluationECIR2009 › ECIR... · 1 Mandl: Current Developments in Information Retrieval Evaluation 1Thomas Mandl
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
1Mandl: Current Developments in Information Retrieval Evaluation
5Mandl: Current Developments in Information Retrieval Evaluation
„There must be some fundamental
understanding of what it means to be good
and what it means to be better“
(Bollmann/Cherniavsky 1983,3)
6Mandl: Current Developments in Information Retrieval Evaluation
documents
(objects)
Author
Information
Seeker
Query
Indexing
Object-
Attribute
Matrix
Document
Corpus
Query-
Representation
Result-
Documentssimilarity-
calculation
representation
Creation
Formulation
Evaluation IR
Indexing
2
Mandl: Current Developments in Information Retrieval Evaluation
RoughRough OutlineOutline
•Cranfield
•Metrics
•Topics
•Users
8Mandl: Current Developments in Information Retrieval Evaluation
OverviewOverview
Cranfield Paradigm
Introduction
Validity
Evaluation Metrics
Binary relevance
Multi level relevance
Evaluation Initiatives
Topic Specific Analysis
Results
Optimization
User Studies
Bonus:
Site Search
Evaluation
Hands on
Activities
Mandl: Current Developments in Information Retrieval Evaluation
PART 1PART 1
Perspectives on the
Cranfield paradigm
Mandl: Current Developments in Information Retrieval Evaluation
WhyWhy evaluationevaluation??
• IR systems: numerous components, models and approaches
• not possible to predict the effectivity for a certaincollection
• No general prefercne for model or a certaincomponent has been proven
• The evaluation of effectivity is crucial
• A holistic evaluation of retrieval processes is difficult
• Success and satisfaction of the users should be theideal benchmark
Mandl: Current Developments in Information Retrieval Evaluation
WhyWhy evaluationevaluation??
• User satisfaction
– Proven documents help to supply the user's information need
– User interface
– System reaction time
– Adaptivity
• User-oriented evaluation is very complex and difficult
– individual and subjective impacts
• Mostly evaluation of Retrieval Systems
– User as „constant“
– Replaced by prototypical user (experts)
– Cranfield-Paradigm of evaluation
12Mandl: Current Developments in Information Retrieval Evaluation
RecallRecall and and PrecisionPrecision
DokumenterelevanterAnzahl
DokumenterelevanterrgefundendeAnzahlRecall =
Precision =DokumentegefundenerAnzahl
DokumenterelevanterrgefundendeAnzahl
• „The ability of the retrieval system to uncover
relevant documents is known as the recall
power of the system“ (Lancaster 1968,55)
3
13Mandl: Current Developments in Information Retrieval Evaluation
• Which Retrieval model is the basis for Recall
and Precision?
Mandl: Current Developments in Information Retrieval Evaluation
ExamplesExamples
CLEF
year
Tas k Type Topic
language
number
runs
correlation
2000 Multilingual Eng lish 21 0.26
2001 Bilingual German 9 0.44
2001 Multilingual German 5 0.19
2001 Bilingual Eng lish 3 0.20
2001 Multilingual Eng lish 17 -0.34
2002 Bilingual German 4 0.33
2002 Multilingual German 4 0.43
2002 Bilingual Eng lish 51 0.40
2002 Monolingual German 21 0.45
2002 Monolingual Spanish 28 0.21
2003 Monolingual German 30 0.37
2003 Monolingual Spanish 38 0.39
2003 Monolingual Eng lish 11 0.16
2002 Multilingual Eng lish 32 0.29
2003 Bilingual German 24 0.21
2003 Bilingual Eng lish 8 0.41
2003 Multilingual Eng lish 74 0.31
0
0.2
0.4
0.6
0.8
1
1.2
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
15Mandl: Current Developments in Information Retrieval Evaluation
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
0 0,1 0,2 0,3 0,4 0,5
Determination „measuring point“
mostly Precision at a Recall of
0,1 0,2 0,3 ...
-> mean (arithmetic) ->
Average Precision (AP)
16Mandl: Current Developments in Information Retrieval Evaluation
OverviewOverview
• Cranfield Paradigm
– Introduction
– Validity
• Evaluation Metrics
– Binary relevance
– Multi-level relevance
• Evaluation initiative
• View on single queries
– Analysis
– Topic specific Optimisation
• User studies
17Mandl: Current Developments in Information Retrieval Evaluation
RelevanceRelevance
• Situational relevance describes the
(actual) utility of documents concerning
the information needs
– virtually hardly to capture
– rather a theoretical construct
• Pertinence is the utility observed by the
user concerning her/his information
need
cf. Fuhr 2003 18Mandl: Current Developments in Information Retrieval Evaluation
RelevanceRelevance
• Objective relevance is the relation betweenthe information need and the document, thatwas judged by one or several neutral observers
– Common basis of system evaluation!
– How objective can this be?
• System relevance marks the relevance of thedocument concerning the formal query, thatwas guessed by a system (= similarity – commonly described as: Retrieval value
– (english: Retrieval Status Value (RSV)
cf. Fuhr 2003
4
19Mandl: Current Developments in Information Retrieval Evaluation
EstimationEstimation of of thethe recallrecall
• Precision is directly evident for every user of
an IR-System
• Recall however is neither evident for the user
nor is is possible to define it precisely with
adequate effort
– The number of relevant documents is unknown
– This is especially problematic for information
needs which aim at a high Recall (e.g. Patent
Novelty search)
cf. Fuhr 2003 20Mandl: Current Developments in Information Retrieval Evaluation
EstimationEstimation of of thethe RecallRecall
• Pooling-Method (Retrieval withseveral systems) – Apply several IR-Systems to the same set
of documents and mix the results of different systems
– Mostly strong overlapping in the sets of answers of the different systems, so thatthe effort doesn't increase linearly with theamount of analysed systems
cf. Fuhr 2003
21Mandl: Current Developments in Information Retrieval Evaluation
Relevant / Relevant / notnot relevantrelevant
• Binary Relevance decisions are often
criticised
• New Metrics for multi-level relevance being
are discussed
– Binary judgments prevail
– Lead to in similar results often
– More later
22Mandl: Current Developments in Information Retrieval Evaluation
EvaluationEvaluation
Cranfield-Paradigm of evaluation in the
Information Retrieval• To find objective evaluation standards for a comparison of
systems
• Maintain conditions for comparison constant
• Systems work with the same document corpus, same
information needs and same relevance judgments
• Abstraction from usage situation and context
23Mandl: Current Developments in Information Retrieval Evaluation
EvaluationEvaluation
Cranfield-Paradigm for evaluation in
Information Retrieval• Objective relevance is judged by a neutral user
• Relation between the expressed information need and the
document
• no individual and subjective relevance assessment in
situational context
• Currently, the basis of all evaluation initiatives in
Information Retrieval (TREC, CLEF, NTCIR, INEX, ...)
24Mandl: Current Developments in Information Retrieval Evaluation
TREC: Text Retrieval TREC: Text Retrieval ConferenceConference
• „TREC is a new ballgame for IR research and
development“ (Sparck Jones 1994)
• Evaluation initiative of the National Institute of
Standards and Technology (NIST) in the USA
• 1992: TREC-1 (Proceedings 1993)
5
25Mandl: Current Developments in Information Retrieval Evaluation
CrossCross--LanguageLanguage Evaluation Forum Evaluation Forum
EU Förderung: DELOS NoEfor Digital Libraries Mandl et al. @ CLEF 2003 - 2006
Research on Evaluation
System development
Test environmentResearch on cross- and multi-lingual Information Retrieval Systems
Benchmarks
26Mandl: Current Developments in Information Retrieval Evaluation 26
ExampleExample TopicTopic
<num>10.2452/89-GC</num> <title>Trade fairs in Lower Saxony </title> <desc>Documents reporting about industrial or cultural
fairs in Lower Saxony. </desc> <narr>Relevant documents should contain information
about trade or industrial fairs which take place in the German federal state of Lower Saxony, i.e. name, type and place of the fair. The capital of Lower Saxony is Hanover. Other cities include Braunschweig, Osnabrück, Oldenburg and Göttingen. </narr> </top>
27Mandl: Current Developments in Information Retrieval Evaluation
ObjectivesObjectives of of evaluationevaluation initiativesinitiatives
• To find consistent evaluation standards for
retrieval systems (Standardisation)
• To provide comparison between different
systems
• To advance further development of IR
systems
• To consider the needs of the community
• To advance the evaluation methodology
28Mandl: Current Developments in Information Retrieval Evaluation
ProceedingsProceedings
• Test basis
– objects (documents, ....)
– queries (Topics)
• relevant information needs for potential users
– consistent weighting
• Time frame
– Release of topics
– Submission of results
– Publication of results
Mandl: Current Developments in Information Retrieval Evaluation
DocumentDocument collectioncollection
• Representative for a real world task
– Large
– Diverse
• Often used: News agency and news paper
collection
30Mandl: Current Developments in Information Retrieval Evaluation
RelevanceRelevance JudgmentJudgment
• Abstraction of individual user and his context
• Consistent evaluation
• Objective jurors, who are not in the user's
situation
• Objektive conclusions about relation between
content between topic and document
6
31Mandl: Current Developments in Information Retrieval Evaluation
PoolingPooling MethodMethod
Jurors create topics
Relevance assessment by jurors
Pooling of all once found documents
Systems provide the Top 1000 documents for every topic
Ellen Voorhees – CLEF 2001 Workshop
32Mandl: Current Developments in Information Retrieval Evaluation
ProcedureProcedure
• Intellectual Evaluation
– relevant or not relevant
• statistical analysis
Mandl: Current Developments in Information Retrieval Evaluation Mandl: Current Developments in Information Retrieval Evaluation
35Mandl: Current Developments in Information Retrieval Evaluation
OverviewOverview
Cranfield Paradigm
Introduction
Validity
Evaluation Metrics
Binary relevance
Multi level relevance
Evaluation Initiatives
Topic Specific Analysis
Results
Optimization
User Studies
36Mandl: Current Developments in Information Retrieval Evaluation
How reliable is the
evaluation according to
the Cranfield-Paradigm?
7
37Mandl: Current Developments in Information Retrieval Evaluation
GeoCLEF Monolingual EnglishGeoCLEF Monolingual English
3737
Bilingual 76% wrt Monolingual
Mandl: Current Developments in Information Retrieval Evaluation
Relevance AssessmentRelevance Assessment
Indirect Information
„foreign aid in Sub-Saharan Africa „
Is a document on the kidnapping of an aid worker
relevant?
„natural desasters in the Western USA“
Is a document on the insurance
costs caused by a
natural desaster relevant?
Mandl: Current Developments in Information Retrieval Evaluation
InterraterInterrater--ReliabilityReliability
• Isn’t relevance a rather subjective concept?
• Is there actually a consistency/agreement, if
several jurors evaluate the same set of
documents?
• Wouldn’t this lead to totally different
results?
• Asian approach?
40Mandl: Current Developments in Information Retrieval Evaluation
ComparisonComparison
• Several assessments could be created• By independent jurors
• Several rankings of systems are created
– Using alternative sets of relevance assessments
• How strongly do they differ/vary?
• How to compare rankings?
Mandl: Current Developments in Information Retrieval Evaluation
ComparisonComparison of of rankingsrankings
• Rank correlation coefficient
– Number of position changes (swaps)
– Kendalls Tau
– Spearman-Coefficient
42Mandl: Current Developments in Information Retrieval Evaluation
SubjektivitySubjektivity JurorsJurors
• Topic Developer is the primary juror
• For TREC 4, the documents in the pool were
evaluated several times
– Evaluation by primary jurors
– 200 relevant (as far as available) + 200 randomly