Evaluating Cross- language Information Retrieval Systems Carol Peters IEI-CNR
Jan 04, 2016
Evaluating Cross-language Information Retrieval Systems
Carol Peters
IEI-CNR
SPINN Seminar, Copenhagen26-27 October 2001
Outline
Why IR System Evaluation is Important
Evaluation programs
An Example
SPINN Seminar, Copenhagen26-27 October 2001
What is an IR System Evaluation Campaign?
An activity which tests the performance of different systems on a given task (or set of tasks) under standard conditions
Permits contrastive analysis of approaches/technologies
SPINN Seminar, Copenhagen26-27 October 2001
How well does system meet information need?
System evaluation:
how good are document rankings?
User-based evaluation:
how satisfied is the user?
SPINN Seminar, Copenhagen26-27 October 2001
Why we need Evaluation
evaluation permits hypotheses to be validated and progress assessed
evaluation helps to identify areas where more R&D is needed
evaluation saves developers time and money
CLIR systems are still in experimental stageEvaluation is particularly important!
SPINN Seminar, Copenhagen26-27 October 2001
CLIR System Evaluation is Complex
CLIR systems consist of integration of components and technologies
need to evaluate single components need to evaluate overall system
performance need to distinguish methodological aspects
from linguistic knowledge
SPINN Seminar, Copenhagen26-27 October 2001
Technology vs. Usage Evaluation
Usage Evaluation: shows value of a technology for user determines the technology thresholds that are
indispensable for specific usage provides directions for choice of criteria for
technology evaluation
Influence of language and culture on usability of technology needs to be understood
SPINN Seminar, Copenhagen26-27 October 2001
Organising an Evaluation Activity
select control task(s) provide data to test and tune systems define protocol and metrics to be used
in results assessment
Aim is an objective comparison between systems and approaches
SPINN Seminar, Copenhagen26-27 October 2001
Test Collection
Set of documents - must be representative of task of interest; must be large
Set of “topics” - statement of user needs from which system data structure (query) is extracted
Relevance judgments – judgments vary by assessor but no evidence that differences affect comparative evaluation of systems
SPINN Seminar, Copenhagen26-27 October 2001
Using Pooling to Create Large Test Collections
Assessors create topics.
Systems are evaluated using relevance judgments.
Form pools of unique documents from all submissions which the assessors judge for relevance.
A variety of different systems retrieve the top 1000 documents for each topic.
Ellen Voorhees – CLEF 2001 Workshop
SPINN Seminar, Copenhagen26-27 October 2001
Cross-language Test Collections
Consistency harder to obtain than for monolingualparallel or comparable document collectionsmultiple assessors per topic creation and relevance
assessment (for each language)must take care when comparing different language
evaluations (e.g., cross run to mono baseline)
Pooling harder to coordinateneed to have large, diverse pools for all languages retrieval results are not balanced across languages
Taken from Ellen Voorhees – CLEF 2001 Workshop
SPINN Seminar, Copenhagen26-27 October 2001
Evaluation Measures
Recall: measures ability of system to find all relevant items
recall =
Precision: measures ability of system to find only relevant items
precision =
no. of rel. items retrieved----------------------------------no. of rel. items in collection
no. of rel. items retrieved----------------------------------total no. of items retrieved
Recall-Precision Graph is used to compare systems
SPINN Seminar, Copenhagen26-27 October 2001
Main CLIR Evaluation Programs
TIDES: sponsors TREC (Text REtrieval Conferences) and TDT (Topic Detection and Tracking) - Chinese-English tracks in 2000; TREC focussing on English/French - Arabic in 2001
NTCIR: Nat.Inst. for Informatics, Tokyo. Chinese-English; Japanese-English C-L tracks
AMARYLLIS: focused on French; 98-99 campaign included C-L track; 3rd campaign begins Sept.01
CLEF: Cross Language Evaluation Forum - C-L evaluation for European languages
SPINN Seminar, Copenhagen26-27 October 2001
Cross-Language Evaluation Forum
Funded by DELOS Network of Excellence for Digital libraries and US National Institute for Standards and Technology (200-2001)
Extension of CLIR track at TREC (1997-1999)
Coordination is distributed - national sites for each language in multilingual collection
SPINN Seminar, Copenhagen26-27 October 2001
CLEF Partners (2000-2001)
Eurospider, Zurich, Switzerland (Peter Schäuble, Martin Braschler)
IEEC-UNED, Madrid, Spain (Felisa Verdejo, Julio Gonzalo) IEI-CNR, Pisa, Italy (Carol Peters) IZ Sozialwissenschaften, Bonn, Germany (Michael Kluck) NIST, Gaithersburg MD, USA (Donna Harman, Ellen
Voorhees) University of Hildesheim, Germany (Christa Womser-
Hacker) University of Twente, The Netherlands (Djoerd Hiemstra)
SPINN Seminar, Copenhagen26-27 October 2001
CLEF - Main Goals
Promote research by providing an appropriate infrastructure for: CLIR system evaluation, testing and tuning comparison and discussion of results building of test-suites for system developers
SPINN Seminar, Copenhagen26-27 October 2001
CLEF 2001Task Description
Four main evaluation tracks in CLEF 2001: multilingual information retrieval bilingual IR monolingual (non-English) IR domain-specific IR
plus experimental track for interactive C-L systems
SPINN Seminar, Copenhagen26-27 October 2001
CLEF 2001Data Collection
Multilingual comparable corpus of news agencies and newspaper documents for six languages (DE,EN,FR,IT,NL,SP). Nearly 1 million documents
Common set of 50 topics (from which queries are extracted) created in 9 European languages (DE,EN,FR,IT,NL,SP+FI,RU,SV) and 3 Asian languages (JP,TH,ZH)
SPINN Seminar, Copenhagen26-27 October 2001
CLEF 2001 Creating the Queries
Title: European Industry Description: What factors damage the competitiveness of
European industry on the world's markets? Narrative: Relevant documents discuss factors that
render European industry and manufactured goods less competitive with respect to the rest of the world, e.g. North America or Asia. Relevant documents must report data for Europe as a whole rather than for single European nations.
Queries are extracted from topics: 1 or more fields
SPINN Seminar, Copenhagen26-27 October 2001
CLEF 2001 Creating the Queries
Distributed activity (Bonn, Gaithersburg, Pisa, Hildesheim, Twente, Madrid)
Each group produced 13-15 queries (topics), 1/3 local, 1/3 European, 1/3 international
Topic selection at meeting in Pisa (50 topics) Topics were created in DE, EN,FR,IT,NL,SP and
additionally translated to SV,RU,FI and TH,JP,ZH Cleanup after topic translation
SPINN Seminar, Copenhagen26-27 October 2001
Topics either DE,EN,FR,IT FI,NL,SP,SV,RU,ZH,JP,TH
English German
French Italian
Participant’s Cross-Language Information Retrieval System
documents
CLEF 2001 Multilingual IR
One result list of DE, EN, FR,IT and SP documents ranked in decreasing
order of estimated relevance
Spanish
SPINN Seminar, Copenhagen26-27 October 2001
CLEF 2001 Bilingual IR
Task: query English or Dutch target document collections
Goal: retrieve documents for target language, listing results in ranked list
Easier task for beginners !
SPINN Seminar, Copenhagen26-27 October 2001
CLEF 2001 Monolingual IR
Task: querying document collections in FR|DE|IT|NL|SP
Goal: acquire better understanding of language- dependent retrieval problems
different languages present different retrieval problems
issues involved include word order, morphology, diacritic characters, language variants
SPINN Seminar, Copenhagen26-27 October 2001
CLEF 2001Domain-Specific IR
Task: querying a structured database from a vertical domain (social sciences) in German
German/English/Russian thesaurus and English translations of document titles
Monolingual or cross-language task Goal: understand implications of querying in
domain-specific context
SPINN Seminar, Copenhagen26-27 October 2001
CLEF 2001Interactive C-L
Task: interactive document selection in an “unknown” target language
Goal: evaluation of results presentation rather than system performance
SPINN Seminar, Copenhagen26-27 October 2001
CLEF 2001: Participation
N.America Asia
Europe
34 participants, 15 different countries
SPINN Seminar, Copenhagen26-27 October 2001
Details of ExperimentsTrack # Participants # Runs/Experiments
Multilingual 8 26
Bilingual to EN 19 61
Bilingual to NL 3 3
Monolingual DE 12 25
Monolingual ES 10 22
Monolingual FR 9 18
Monolingual IT 8 14
Monolingual NL 9 19
Domain-specific 1 4
Interactive 3 6
SPINN Seminar, Copenhagen26-27 October 2001
Runs per Topic Language
20
20
38
40
17
33
912 6 2 4
DutchEnglishFrenchGermanItalianSpanishChineseFinnishJapaneseRussianSwedishThai
SPINN Seminar, Copenhagen26-27 October 2001
Topic Fields
63
108
13 3 5
TDNTDTDN
SPINN Seminar, Copenhagen26-27 October 2001
CLEF 2001Participation
CMU Eidetica Eurospider * Greenwich U HKUST Hummingbird IAI * IRIT * ITC-irst * JHU-APL * Kasetsart U KCSL Inc.
Medialab Nara Inst. of Tech. National Taiwan U OCE Tech. BV SICS/Conexor SINAI/U Jaen Thomson Legal * TNO TPD * U Alicante U Amsterdam U Exeter
U Glasgow * U Maryland * (interactive only) U Montreal/RALI * U Neuchâtel U Salamanca * U Sheffield * (interactive only) U Tampere * U Twente (*) UC Berkeley (2 groups) * UNED (interactive only)
(* = also participated in 2000)
SPINN Seminar, Copenhagen26-27 October 2001
CLEF 2001Approaches
All traditional approaches used: commercial MT systems (Systran, Babelfish,
Globalink Power Translator, ) both query and document translation tried
bilingual dictionary look-up (on-line and in-house tools) aligned parallel corpora (web-derived) comparable corpora (similarity thesaurus) conceptual networks (Eurowordnet, ZH-EN wordnet) multilingual thesaurus (domain-specific task)
SPINN Seminar, Copenhagen26-27 October 2001
CLEF 2001Techniques Tested
Text processing for multiple languages: Porter stemmer, Inxight commercial stemmer, on-site tools
simple generic “quick&dirty” stemming language independent stemming
separate stopword lists vs single list morphological analysis n-gram indexing, word segmentation, decompounding
(e.g. Chinese, German) use of NLP methods, e.g. phrase identification,
morphosyntactic analysis
SPINN Seminar, Copenhagen26-27 October 2001
CLEF 2001Techniques Tested
Cross-language strategies included: integration of methods (MT, corpora and MRDs) pivot language to translate from L1 -> L2 (DE ->
FR,SP,IT via EN) N-gram based technique to match untranslatable words prior and post-translation pseudo-relevance feedback
(query expanded by associating frequent cooccurrences) vector-based semantic analysis (query expanded by
associating semantically similar terms)
SPINN Seminar, Copenhagen26-27 October 2001
CLEF 2001Techniques Tested
Different strategies experimented for results merging
This remains still an unsolved problem
SPINN Seminar, Copenhagen26-27 October 2001
CLEF 2001 Workshop
Results of CLEF 2001 campaign presented at Workshop, 3-4 September 2001, Darmstadt, Germany
50 researchers and system developers from academia and industry participated.
Working Notes containing preliminary reports and statistics on CLEF2001 experiments distributed.
SPINN Seminar, Copenhagen26-27 October 2001
CLEF-2001 vs. CLEF-2000
Most participants were back Less MT More Corpus-Based People really start to try each other’s
ideas/methods: corpus-based approaches (parallel web,
alignments) n-grams combination approaches
SPINN Seminar, Copenhagen26-27 October 2001
“Effect” of CLEF
Many more European groups Dramatic increase of work in
stemming/decompounding (for languages other than English)
Work on mining the web for parallel texts Work on merging (breakthrough still
missing?) Work on combination approaches
SPINN Seminar, Copenhagen26-27 October 2001
CLEF 2002
Accompanying Measure under IST Accompanying Measure under IST programme: Contract programme: Contract No. IST-2000-31002No. IST-2000-31002. . October 2001October 2001
CLEF ConsortiumIEI-CNR, Pisa; ELRA/ELDA, Paris; Eurospider, Zurich; UNED, Madrid; NIST, USA; IZ Sozialwissenschaften, Bonn
Associated MembersUniversity of Hildesheim, University of Twente, University of Tampere (?)
SPINN Seminar, Copenhagen26-27 October 2001
CLEF 2002Task Description
Similar to CLEF 2001: multilingual information retrieval bilingual IR (not to English!) monolingual (non-English) IR domain-specific IR interactive track
Plus feasibility study for spoken document track (within DELOS – results reported at CLEF)
Possible cooordination with Amaryllis
SPINN Seminar, Copenhagen26-27 October 2001
CLEF 2002Schedule
Call for Participation - November 2001 Document release – 1 February 2002 Topic Release – 1 April 2002 Runs received - 15 June 2002 Results communicated – 1 August 2002 Paper for Working Notes - 1 September 2002 Workshop - 19-20 September
SPINN Seminar, Copenhagen26-27 October 2001
Evaluation - Summing up
system evaluation is not a competition to find the best
evaluation provides opportunity to test, tune, and compare approaches in order to improve system performance
an evaluation campaign creates a community interested in examining the same issues and comparing ideas and experiences
SPINN Seminar, Copenhagen26-27 October 2001
Cross-Language Evaluation Forum
For further information see:
http://www.clef-campaign.org
or contact:
Carol Peters - IEI-CNR
E-mail: [email protected]