“Information Retrieval in Peer-to-Peer Systems” Demetrios Zeinalipour-Yazti http://www.cs.ucr.edu/~csyiazti/msc.html M.Sc. Thesis Defense Monday, May 5, 2003 Surge 349 12:00-1:00 PM Thesis Committee: Dr. Dimitrios Gunopulos, Chairperson Dr. Vana Kalogeraki Dr. Chinya V. Ravishankar Dept. of Computer Science & Engineering. @ University of California - Riverside
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
“Information Retrieval in Peer-to-Peer
Systems”Demetrios Zeinalipour-Yazti
http://www.cs.ucr.edu/~csyiazti/msc.html
M.Sc. Thesis Defense
Monday, May 5, 2003Surge 349 12:00-1:00 PM
Thesis Committee:Dr. Dimitrios Gunopulos,
ChairpersonDr. Vana Kalogeraki
Dr. Chinya V. Ravishankar
Dept. of Computer Science & Engineering. @ University of California - Riverside
Search Techniques for P2P systems1. Breadth-First Search (Gnutella)• Idea: Each Query Message is propagated along all
outgoing links of a peer using TTL (time-to-live).• TTL is decremented on each forward until it becomes 0• Technique for I.R in P2P systems such as Gnutella.• Highlights
– The physical network comes to its knees– Long Delays for search results.
3. Searching Using Random Walkers[Q. Lv et al P. Cao, E. Cohen, K. Li, and S. Shenker. ICS2002]
• Idea: Each Query Message is forwarded to 1 neighbor• With k walkers after T steps we reach the same nodes
as 1 walker after kT steps. (They use 16-64 walkers)• Highlights
– Network Traffic reduced (from BFS) by 2 orders of magnitude– Increases the user-perceived delay (from 2-6 hops to 4-15 hops)– This algorithm is probabilistic and the likelihood to locate the
Search Techniques for P2P systems5. Searching using Local Indices [Arturo Crespo and
Hector Garcia-Molina, ICDCS 2002.]
• Idea: Create indices which contain “statistics” that reveal the “direction” towards the documents.
• Types of Proposed Indices– Compound Routing Index (CRI): metric=number of documents– Hop-Count Routing Index (HRI): maintain a CRI for k hops, – Exponentially Aggregated Index (ERI): Apply some cost
formula on HRI to shrink HRI’s size.
• Highlights– Not Scalable, Expensive Routing Updates but better than
replicating data indexes.– Assumes static environment but No Data Replication Required
Search Techniques for P2P systems7. Depth-First-Search and Freenet
[I. Clarke O. Sandberg, B. Wiley, and T.W. Hong, LNCS 2009 ]
Idea: Objects are Hashed and route the hash of a query based on the “key closeness” in a DFS manner.Highlights:
– Uses caching of key/object for future requests.– Data Replication along the QueryHit path provides Availability– Anonymity of Searcher and Publisher. – Drawbacks: i) Searches ONLY based on Object Identifier.
Intelligent Search Mechanism ISMIntroduction• Idea: Each Query Message is forwarded intelligently
based on what queries a peer answered in the past.• Components of ISM (for each node u)
a) Profile Mechanism, for each neighbor N(u).b) Peer Ranking Mechanism, for ranking peers locally and send a
search query only to the ones that most likely will answer.c) Similarity Function, for finding similar search queries.d) Search Mechanism, for propagating queries based on local
PeerWare Simulation InfrastructurePeerWare Components1. dataGen – The Dataset Generator
2. graphGen – The Network Graph Generator
3. dataPeer – The Data Node
4. searchPeer – The Search Node
Other Administrative Components • netLaucher – Shell script that launches Network• netStats – Shell script that provides statistics• graphPlot – Shell script that plots Graphs based on
• We use a Random Network of 104 peers – Each peer has documents for 1 country– The average degree of a node is 7 ~= log2100 (connected graph)
• We perform two series of experiments1. 10x10 sequential queries with a delay of 4 sec.2. 400 random queries with a delay of 4 sec.
• We compare Doc. Ratio (Recall Rate) vs. Num. of messages– BFS (Gnutella Message Flooding) (forward to degree nodes).– Random BFS (randomly forward to degree/2 nodes).– Intelligent Search Mechanism (forward to M=(degree/2)-1 highest
RelevanceRank nodes + 1 random).– >RES Heuristic (forward to degree/2 nodes that answered >RES)
Recall Rate vs. Num. of messages with TTL=4 • BFS uses ~1050 messages w/ recall rate 100%• RBFS uses ~220 (20%) msgs w/ recall rate ~50%• >RES uses ~400 (38%) msgs w/ recall rate ~70%• ISM uses ~400 (38%) msgs w/ recall rate ~90%• ISM improves over time since Peer Profiles get more knowledge.• ISM and >RES start out slow since the use RBFS
Experimental EvaluationImproving Recall Rate over Time (400 Experiment)• 10x10 Queries Experiment suited well ISM• In this experiment we perform 400 random queries• BFS overwhelming message create two major outbreaks • ISM improves over time achieving:
Future Work• Probe different Network Topologies such as ASMap with PowerLaws.• Deploy larger PeerWares with more queries.• Probe different Peer-Profile maintenance policies. • Use Stemming/Stop Words to answer more accurately.• Compare the performance of our method with new proposed
techniques (random gossiping, random walkers, etc).• 60% of Gnutella belongs to 20% ISPs. How to exploit that to provide