Query-Driven Indexing for Peer-to-Peer Query-Driven Indexing for Peer-to-Peer Text Retrieval Text Retrieval ** ** WWW 2007 Banff, Canada Contact: Contact: Gleb Gleb Skobeltsyn Skobeltsyn [email protected] http:// * I.Podnar is currently affiliated with University of Zagreb, Croatia ** The work presented in this paper was (partly) carried out in the framework of the EPFL Center for Global Computing and supported by the Swiss National Funding Agency OFES as part of the European projects BRICKS (507457) and ALVIS (002068). G.Skobeltsyn, T.Luu, I.Podnar G.Skobeltsyn, T.Luu, I.Podnar * , M.Rajman, K.Aberer , M.Rajman, K.Aberer Experiments: retrieval quality of the query-driven index when compared to Google 0 1 2 3 4 5 10 50 100 ∞ c)QFmin /3 m onths Our goal: Our goal: Features: Features: - Low bandwidth Low bandwidth during retrieval as posting lists of bounded size bounded size are transmitted, - The content of the index adapts adapts to the current query popularity popularity distribution, - Tradeoff Tradeoff between retrieval quality and index size (i.e., indexing cost). Scalable full text web retrieval in a structured P2P network. Processing the query abc with a query-driven index More details in: • Skobeltsyn et al: “Query-Driven Indexing for Scalable Peer-to-Peer Text Retrieval”, in Infoscale’07, Suzhou, China, 2007 • Skobeltsyn et al: “Web Text Retrieval with a P2P Query-Driven Index”, in SIGIR’07, Amsterdam, The Netherlands, 2007 • Alvis project web site: http://globalcomputing.epfl.ch/alvis http://globalcomputing.epfl.ch/alvis 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 Q uality ofansw er(% ) 0% (0%,25%) [25%,50%) [50%,75%) [75%,100%) 100% Avg.overlap a)Log history size (days) 20 50 100 200 300 400 500 600 b)DFm ax (documents) Overlap achieved for different sizes of the query log measured in number of days with QF min =1, DF max =600 Overlap achieved for different values of DF max with QF min =1 Overlap achieved for different values of QF min /3 months with DF max =600 >id=481, q=“what did babe ruth do in the 1920 what did babe ruth do in the 1920” “1920 babe ruth”, qf=0 ----> Ov@100= 100% “1920 babe”, qf=0 ---------> Ov@100= 9% + “1920 ruth 1920 ruth”, qf=1 ---------> Ov@100= 33% 33% + “babe ruth babe ruth”, qf=495 -------> Ov@100= 69% 69% - “1920”, qf=716 ------------> Ov@100= 1% - “babe”, qf=3196 -----------> Ov@100= 2% - “ruth”, qf=1653 -----------> Ov@100= 7% Size: 192 192, Keys used: 2, Overlap@100: 94% 94% Top-20 overlap measure: •Use Google Google to answer a query and compare compare it to the union of top- top- DF DF max max Google Google results results for each of its indexed indexed keys, •Keys are indexed indexed if contained in more than QF QF min min queries in the global query history. Example of resolving a query: • Distributed single term index – maintains global posting lists for each single term single term in a DHT • To process a multi-term query abc it intersects intersects the full posting lists of a, b and c. • Intersections lead to unscalable unscalable retrieval traffic The naïve approach: P c Pb P a Querying peer a b c ab bc ac abc a b c ab bc ac abc a)ifthe posting lists for b and c are truncated (only) c)ifthe key bc is also indexed. a b c ab bc ac abc b)ifthe posting listfor a is also truncated, - probed combination - skipped combination - popularity counter - truncated posting list - posting list is used to answ erthe query - no index item for the key - candidate index item (only stat.) - active index item (stat.+TPL) In d e x i te m s : a b Legend