Top Banner
Efficient Peer to Peer Keyword Searching Nathan Gray
32
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Efficient Peer to Peer Keyword Searching Nathan Gray.

Efficient Peer to Peer Keyword Searching

Nathan Gray

Page 2: Efficient Peer to Peer Keyword Searching Nathan Gray.

Introduction

• Current applications (Chord, Freenet) don’t provide keyword search functionality

• System developed uses DHT that will store documents lists containing keywords

Page 3: Efficient Peer to Peer Keyword Searching Nathan Gray.

Introduction

• Topics to be covered:– Search model and design– Simulation

• Idea: Authors believe end user latency is the most important measurement metric – Most latency comes from network transfer

time– Goal: Minimize the number of bytes sent

Page 4: Efficient Peer to Peer Keyword Searching Nathan Gray.

System Model

• Search – – associating keywords with document IDs– Retrieving document IDs matching keywords

from DHT

• Invertices Index– Map words found in document to document

list where word is found

Page 5: Efficient Peer to Peer Keyword Searching Nathan Gray.
Page 6: Efficient Peer to Peer Keyword Searching Nathan Gray.

Partitioning

• Horizontal– Requires all nodes be contacted– Broadcast queries to all nodes

• Vertical– Minimizes cost of searches -> ensure that no more

than k servers participate in querying k keywords– Most changes in a FS occur in bursts utilize lazy

updating– Send queries to # of hosts– Throughput grows linearly with system size

Page 7: Efficient Peer to Peer Keyword Searching Nathan Gray.

Partitioning

Page 8: Efficient Peer to Peer Keyword Searching Nathan Gray.

Why Distribute the searching?

• Google succeeds and it uses centralized searching– Bad idea to concentrate both load and trust

on small # of hosts– Distributed system would have voluntarily

contributed end user machines– Distributed benefits more from replication

• Less susceptible to correlated failures

Page 9: Efficient Peer to Peer Keyword Searching Nathan Gray.

Ranking

• Key idea– Order of documents presented to the user– Google PageRank uses hyperlinked nature of

the web– P2P doesn’t necessarily have the hyperlinked

infrastructure of the web but can use word position and proximity

Page 10: Efficient Peer to Peer Keyword Searching Nathan Gray.

Update Discovery

• Search engine must discover new, removed, or modified documents

• Distributed environs benefit most from pushed updates as opposed to broadcast– Efficiency– Currency of index

Page 11: Efficient Peer to Peer Keyword Searching Nathan Gray.

P2P search support

• Want to show that keyword search in p2p is feasible

• Remote servers contacted– Lookup mapping of words to documents

• Peers contacted across network– Intersection sets calculated small subset of matching

documents usually wanted by user

Page 12: Efficient Peer to Peer Keyword Searching Nathan Gray.

P2P Search

• Key challenge:– Perform efficient searches

• Limit amount of bandwidth used• A inter B calculated sent to server B• Server B discards most of the server A intersect

info because the result is smaller than A – the matches for A’s documents to the keyword

Page 13: Efficient Peer to Peer Keyword Searching Nathan Gray.
Page 14: Efficient Peer to Peer Keyword Searching Nathan Gray.

Bloomfilters (BF)

• Recall: BF summarize membership in a set

• In this paper, – BF act to compress the data sent between

servers (the intersections)– Reduce amount of communication

Page 15: Efficient Peer to Peer Keyword Searching Nathan Gray.

BF

• Data assumed to have 128 bit hashes

• BF give a general 12:1 compression ratio

Page 16: Efficient Peer to Peer Keyword Searching Nathan Gray.
Page 17: Efficient Peer to Peer Keyword Searching Nathan Gray.

Caches

• Goal:– Want to store more keywords– Cache the BF for the successor host– Keyword population follows Zipf distribution

(heavy tailed)– Popular keywords are dominant

• So caching of BF or entire doc list F(A) gives high hit ratio

Page 18: Efficient Peer to Peer Keyword Searching Nathan Gray.

Cache

• Cache hit rate # reduces # of excess bits

• Higher compression ratio grows linearly with bit reduction

• Consistency– TTL scheme used– Updates at keyword primary location only– Small staleness factor, expected given Web

update patterns (assumption)

Page 19: Efficient Peer to Peer Keyword Searching Nathan Gray.

Incremental Results

• Look at scalability– Desired # of results wanted– Low cost O(n) with size of network– BF and Caching provide only constant O(1)

improvement in data sent• Chunks

– Partial cache hits for each keyword– Reduce amount of cache allotted to each keyword– Con: Large cpu overhead– Soln:

• Send contiguous chunks• Tell Server B which portion of hash to test

Page 20: Efficient Peer to Peer Keyword Searching Nathan Gray.
Page 21: Efficient Peer to Peer Keyword Searching Nathan Gray.

Discussion

• 2 issues– End to end query latency– # bytes sent

• BF gives compression bonus– Latency– Probability of False Positives

• Caching– Reduce FP prob– Reduces bandwidth costs

Page 22: Efficient Peer to Peer Keyword Searching Nathan Gray.

Discussion

• Incremental Results (IR)– Assume user wants only # of results

• 1) Reduce # of bytes sent• 2) E-E query latency bonus to constant with network growth

– Risk:• Popular but uncorrelated results

– Entire search space needed to be checked– Increase # bytes sent– BF still gives 10:1 compression ration over whole document list

– BF and IR complicate ranking schemes– BF: Do not allow:

• Order of set members• Convey metadata along with result set

Page 23: Efficient Peer to Peer Keyword Searching Nathan Gray.

Discussion Cotd

• IF IR sends next chunks with lower rank previous results are better

• Risk: Order within chuck lost– Maintained overall though

• Key:– Rank more important than small bandwidth or

latency benefits

Page 24: Efficient Peer to Peer Keyword Searching Nathan Gray.

Simulation

• Goals:– Test number of nodes in the network with

realistic numbers– Bloomfilter threshold

• Sizes

– Caching– Incremental results

Page 25: Efficient Peer to Peer Keyword Searching Nathan Gray.

Simulation Characteristics

• Doc size: 1.85 Gb of html• 1.17 Million unique words• Three types of node distribution

– Modems– Backbone links– Measure of gnutella-like network

• Randomized latencies– 2500 square mile grid– Packets assumed to travel 100K miles/sec (SOL)

Page 26: Efficient Peer to Peer Keyword Searching Nathan Gray.

Simulation Cot’d

• Documents: Identifiers of 128bytes• Process:

– Simulate lookup of KW in inverted matrix– Map index to M search results– Node intersections– Using BF send intersection to another host (size

dependence, might be whole doc list)– Host checks for next hosts doc list in cache

• Yes? Perform intersection for host and skips that comm phase

Page 27: Efficient Peer to Peer Keyword Searching Nathan Gray.

Experimental Results

• Goal: Performance effects of keyword search in p2p network

Page 28: Efficient Peer to Peer Keyword Searching Nathan Gray.

Virtual Hosts

• Concept: Varying the number of nodes\hosts per machine

• Result:– Little effect on amount of data sent over

network– Network times cut by 60% for local nodes– Reduced chance of load balance issues

Page 29: Efficient Peer to Peer Keyword Searching Nathan Gray.

BF and Caching

• BF: Drawback – increased network transactions (FP checking)– Initial Comparison– Remove False Positives

Page 30: Efficient Peer to Peer Keyword Searching Nathan Gray.
Page 31: Efficient Peer to Peer Keyword Searching Nathan Gray.

BF and Caching Results

• BF– BF Threshold: 300– Smaller number of keywords requested entire

result list sent– Why?

• Benefit in bandwidth << latency introduced

• Caching– Decreased the number of bytes sent– Increase optimal BF size (~24 bits/entry)– 50% decrease in # of bytes sent per query

Page 32: Efficient Peer to Peer Keyword Searching Nathan Gray.

Conclusions

• Keyword searching in P2P networks is feasible

• Traffic growth is linear with size of network

• Improved completeness relative to crawling (centralized keyword search)

• BF/VH/Caching/Incremental Results:– Reduce network resources consumed– End to end client search latency decrease