Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University, Melbourne, Australia SIGIR 2007(Collection representation in distributed IR) 2009-03-13 Presented by JongHeum Yeon, IDS Lab., Seoul National University
17
Embed
Federated text retrieval from uncooperative overlapped collections
Federated text retrieval from uncooperative overlapped collections. Milad Shokouhi , RMIT University, Melbourne, Australia Justin Zobel , RMIT University, Melbourne, Australia SIGIR 2007 (Collection representation in distributed IR) 2009-03-13 - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Federated text retrieval from uncooperative overlapped col-
lections
Milad Shokouhi, RMIT University, Melbourne, Australia
Justin Zobel, RMIT University, Melbourne, Australia
SIGIR 2007(Collection representation in distributed IR)
2009-03-13
Presented by JongHeum Yeon, IDS Lab., Seoul National University
Copyright 2008 by CEBT
Abstract
Federated information retrieval (FIR)
Send query to multiple collections
Central broker merges the results and ranks them
Duplicated documents in collections
Final results contains high number of duplicates potentially
Authors propose a method for estimat-ing the rate of overlap among collec-tions based on sampling
Using the estimated overlap statistics, they propose two collection selection methods that aim to maximize the number of unique relevant documents in the final results
2
Broker
Collection
Collection
Collection
User
Copyright 2008 by CEBT
Federated Information Retrieval (FIR)
Query is sent simultaneously to several collections
Each collection evaluates the query and returns the re-sults to the broker
Advantage
No need to access the index of the collections
Search over the latest version of documents without crawl-ing and indexing
Broker selects collections that are most likely to return relevant documents
Collection selection problem
Collection representation problem
Result merging problem
3
Copyright 2008 by CEBT
Collection Selection Problem
FIR techniques assume that the degree of overlap among collections is either none or negligible
However, there are many collections that have a significant degree of overlap
Bibliographic databases
News resources
Selecting collections that are likely to return the same results by intro-ducing duplicate documents into the final results
Wastes costly resources
Degrades search effectiveness
Authors propose …
A method that estimates the degree of overlap among collections by sam-pling from each collection using random queries
two collection selection techniques that use the estimated overlap statis-tics to maximize the number of unique relevant documents in the final re-sults
4
Copyright 2008 by CEBT
Related Work
Cooperative collection selection techniques
Collections provide the broker with their index statistics and other useful information
CORI, GlOSS, CVV
Uncooperative collection selection techniques
Collections do not provide their index statistics to the bro-ker
The broker samples documents from each collection
ReDDE uses sampled documents for …
– Estimates the number of relevant documents in collections
– Ranks collections according to the number of highly ranked sampled documents
5
Copyright 2008 by CEBT
Overlap Estimation
Using the documents down-loaded by query-based sam-pling for estimating the rate of overlap and does not re-quire any additional informa-tion
Subset of sample documents
Size of m
The probability of any given document from m1 to be available in m2
6
C1 C2
S2S1
K
Expected number of docu-ments
Copyright 2008 by CEBT
Overlap Estimation (cont’d)
P(i) follows binomial distribution
7
Copyright 2008 by CEBT
Overlap Estimation (cont’d)
Binomial theorem
Expected number of documents in m1 ∩ m2
The number of overlap documents is independent of the collection size
8
Copyright 2008 by CEBT
The ‘RELAX’ Selection Method
Graph G = {(u,v) | vertex u, v are collections, edges indi-cates overlap documents between vertices}
Output : final merged document lists that minimized du-plicates
9
Copyright 2008 by CEBT
The ‘RELAX’ Selection Method (cont’d)
10
Copyright 2008 by CEBT
Overlap Filtering for ReDDE
F-ReDDE
1. The overlaps among collections are estimated as described for the Relax selection
2. Collections are ranked using a resource selection algorithm such as ReDDE
3. Each collection is compared with the previously selected collections. It is removed from the list if it has a high over-lap (greater than γ) with any of the previously selected col-lections. We empirically choose γ = 30% and leave meth-ods for finding the optimum value as future work
11
Copyright 2008 by CEBT
Testbeds
Authors create three new testbeds with overlapping collections based on the documents available in the TREC GOV dataset
Qprobed-280
360 most frequent queries in a search engine in the .gov
A random number of documents (between 5000 and 20000) are downloaded as a collection
Generate 280 collections with average size of 12194 documents
Qprobed-300
every twentieth collection is merged into a single large collection
Sliding-115
Using a sliding window of 30 000 documents
Generate 112 collections
12
Copyright 2008 by CEBT
Testbeds (cont’d)
Qprobed-280
74492 collection pairs < 10% overlap
79 pairs < 90%
1.1% of collection pairs > 50% overlap
Qprobed-300
1.9% of collection pairs > 50% overlap
Sliding-115
2.5% of collection pairs > 50% overlap
13
Copyright 2008 by CEBT
Results
The initial estimated values for D(i, j) suggested that the degree of overlap among collections is usually overesti-mated
Document retrieval models are biased towards returning some popular documents for many queries
Samples produced by query-based sampling are not ran-dom
14
Copyright 2008 by CEBT
Results (cont’d)
15
Copyright 2008 by CEBT
Results (cont’d)
16
Copyright 2008 by CEBT
Conclusion & Discussion
Pros
Propose the efficient algorithm for handling duplicates