CEA LIST ELDA Univ. Lille 3 - Geriico 1 01/10/09 CLEF @ 1 INFILE Overview of the INFILE Overview of the INFILE track at CLEF 2009 track at CLEF 2009 multilingual INformation multilingual INformation FILtering Evaluation FILtering Evaluation Romaric Besançon (1), Djamel Mostefa, Olivier Hamon, Khalid Choukri (2), Stéphane Chaudiron,Ismaïl Timimi (3) (1) (2) (3)
22
Embed
1 01/10/09 CLEF @ 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CEA LIST ELDA Univ. Lille 3 - Geriico 101/10/09
CLEF@
1
INFILE
Overview of the INFILE Overview of the INFILE track at CLEF 2009track at CLEF 2009
multilingual INformation FILtering Evaluationmultilingual INformation FILtering Evaluation
Information Filtering EvaluationFilter documents from a document stream according to
long-term information needs (user profiles)
Second edition of the INFILE track in CLEF1 participant in 2008use same data in 2009
CEA LIST ELDA Univ. Lille 3 - Geriico 301/10/09
CLEF@
3
INFILE
Presentation of the INFILE track
Mutlilingual
English, French, Arabic for both documents and topics
Two tasks
batch filteringthe whole corpus is given to the participants, which
must return a list of filtered documents for each topic
adaptive filteringdocuments are provided to the participants one at a
time through an interactive procedure, with possible automated feedback to adapt the filtering system
closer to real usage in a context of competitive intelligence
CEA LIST ELDA Univ. Lille 3 - Geriico 401/10/09
CLEF@
4
INFILE
Document Collection
Built from a corpus of news from the AFP (Agence France Presse)
almost 1.5 million news in French, English and Arabic
For the information filtering task:
100 000 documents to filter, in each language NewsML format
standard XML format for news (IPTC)
CEA LIST ELDA Univ. Lille 3 - Geriico 501/10/09
CLEF@
5
INFILE
Document example
document identifier
keywords
headline
CEA LIST ELDA Univ. Lille 3 - Geriico 601/10/09
CLEF@
6
INFILE
Document example
IPTC category
AFP category
content
CEA LIST ELDA Univ. Lille 3 - Geriico 701/10/09
CLEF@
7
INFILE
Topics
50 interest profiles
20 profiles in the domain of science and technology
developped by CI professionals from French institutes INIST, ARIS, Oto Research, Digiport
30 profiles of general interest Profiles developed in French/English Translated into Arabic
CEA LIST ELDA Univ. Lille 3 - Geriico 801/10/09
CLEF@
8
INFILE
Topics
Each profile contains 5 fields:
title: a few words description
description: a one-sentence description
narrative: a longer description of what is considered a relevant document
keywords: a set of key words, key phrases or named entities
sample: a sample of relevant document (one paragraph)
Participants may use any subset of the fields for their filtering
CEA LIST ELDA Univ. Lille 3 - Geriico 901/10/09
CLEF@
9
INFILE
Topic Example
CEA LIST ELDA Univ. Lille 3 - Geriico 1001/10/09
CLEF@
10
INFILE
Some topic examples
101102107113115118119127129
Fight against doping in sportsport economyElectronic votingDigital DivideThe free museumsRising oil pricesthe subprimes crisisthe crisis in DarfurThe FARC rebelion
131132136137138140143144149
E-government stakesWireless network and healthAir pollution and air qualityFight against climate changeDrugs and biotechnologyFruits and vegetables intakes and cancer preventionAvian influenzaNanotechnologies and nanosciencesScientific research in Arctic
in general domain
in scientific information domain
CEA LIST ELDA Univ. Lille 3 - Geriico 1101/10/09
CLEF@
11
INFILE
Constitution of the corpus
Same corpus as INFILE@CLEF 2008
With simulated feedback, we need the ground truth before the campaign
To build the corpus of documents to filter:find relevant documents for the profiles in the original
corpususe a pooling technique with results of IR tools
4 IR engines (Lucene, Indri, Zettair and CEA search engine), on several query fields combinations
iterative pooling using Mixture-of-Experts model
CEA LIST ELDA Univ. Lille 3 - Geriico 1201/10/09
CLEF@
12
INFILE
Constitution of the corpus (2)
keep all documents assessed
documents returned by IR systems by judged not relevant form a set of difficult documents
choose random documents (noise)
collection
retrieved
assessed
relevant
test collection
random
CEA LIST ELDA Univ. Lille 3 - Geriico 1301/10/09
CLEF@
13
INFILE
Corpus1
01
10
21
03
10
41
05
10
61
07
10
81
09
11
01
11
11
21
13
11
41
15
11
61
17
11
81
19
12
01
21
12
21
23
12
41
25
12
61
27
12
81
29
13
01
31
13
21
33
13
41
35
13
61
37
13
81
39
14
01
41
14
21
43
14
41
45
14
61
47
14
81
49
15
0
0
50
100
150
200engfreara
ara7312 7886 51241597 2421 1195
31,94 48,42 23,928,45 47,82 23,08
[0,107] [0,202] [0,101]
eng frenumber of documents assessednumber of relevant documentsavg number of relevant docs / topicstd deviation on number of relevant docs / topic[min,max] number of relevant docs / topics
Number of relevant documents for each topic, in each language
CEA LIST ELDA Univ. Lille 3 - Geriico 1401/10/09
CLEF@
14
INFILE
Tasks
Batch filtering (02/04 - 30/05)documents and topics available to participantsreturn list of filtered documents per topic (unordered)
Adaptive filtering (03/06 - 10/07)topics available to participantsdocuments available one at a time (one pass test)
interactive protocol using a client-server architecture (webservice communication)
new document available only if previous one has been filtered
available simulated user feedbackfor adapatationlimited number of feedbacks (200)
Filteringadapted Information Retrieval tools (Lucene)SVM classifier with external ressources (GoogleNews)textual similarity measures with thresholds reasoning model (human plausible reasoning)
Adaptationadaptation of selection thresholdsuser feedback as parameter in reasoning model