Top Banner
Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03
24

Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Jan 02, 2016

Download

Documents

Richard Hoover
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Efficient P2P Searches Using Result-Caching

From U. of Maryland. Presented by Lintao Liu

2/24/03

Page 2: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Motivation & Query ModelAvoid duplicating work and data movement by caching previous query resultsObservation:

ai && aj && ak = (ai && aj) && ak

So We can keep the result for (ai &&aj ) as materialized view.Query Model: (ai && aj && ak) || (bi && bj && bk ) ||…

Page 3: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

View and View Treea view is the cached result for a previous query.Where to store the views: Using the underlying P2P system mechanism

for example, in Chord, the view “a && b” is stored at the successor of Hash(“a&&b”)

But it can’t be used to efficiently answer view queries.

Why? For “a1 && a2 &&..&& ak “, there are 2k possible views. And you don’t know which one exists .

Page 4: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

View and View Tree (Cont.)

Another possible way: Centralized, consistent list of views Easy to locate Problems:

Frequent updates Storage requirements

Proposed Solution: View Tree Implemented as a trie,

A tree for storing strings in which there is one node for every common prefix. The strings are stored in extra leaf nodes.

scalable, stateless

Page 5: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

View Tree

All nodes are at level 1. (single-attribute views)

A canonical order on the attributes is defined and used to uniquely identify equivalent conjunctive queries.

“a && b” and “b && a” are both recorded as “a && b” (alphabetical order)

Page 6: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Answering Queries Finding a smallest set of views to evaluate a query is NP-hardInstead, the following method is used: Exact match, if such a match exists Forward progress:

For each node accessed, at least one attribute must be located which does not occur in the views located so far.

Page 7: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Example on answering queries

Query: “cbagekhilo”Step 1: match prefix “cbag”Step 2: “cbage” is not found,

but “cbagh” exists and is useful for the query (forward progress)…..

Page 8: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Algorithm: Search(n, q)

Page 9: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Creating a balanced View Tree

For a query “a && b && c && .. && x”, there exist a lot of equivalent views , each corresponds to a position at View Tree. And any of them can be used to represent the result of the search query.Which one to choose and how to make a balanced Tree? Deterministically pick a permutation P

uniformly at random among all possible permutations.

Page 10: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Maintaining the View TreeThe owner of a new node needs to update all attribute indexed corresponding to attributes of the new node. (Q: Cross the whole network? All related views need to be updated? Isn’t it too expensive?)Heartbeat is used to check the presence of child nodes and parent node in the view tree.Insertion of new view is less expensive: some child pointers need to be reassigned.

Page 11: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Example of a new view join

Page 12: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Preliminary Results: Data source and methodology: Document: TREC-Web data set HTML pages with keyword meta-tag 64K different pages for each

experiment Queries: generated using the statistical

characteristics from search.com

Page 13: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Preliminary results:

Caching Benefit Query locality Benefit

Page 14: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

On the Feasibility of P2P Web Indexing and Search

From UC Berkeley & MIT

Page 15: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Motivation:

Is P2P web search likely to work?two keyword-search techniques: Flooding (Gnutella): not discussed in this

paper Intersection of index lists

This paper presents a feasibility analysis based on the resource constraints and workload.

Page 16: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

IntroductionWhy interested in P2P searching:

A good stress test for P2P architectures More resistant to censoring and manipulated ranking More robust from single node failure

550 billion documents on the web Google indexes more than 2 billion of them

Gnutella and KaZaA: flooding search 500 Million files Typically music files Search: file meta-data such as titles and artist

DHT-based keyword searching: Good full-text search performance with about

100,000 files (Duke Univ.)

Page 17: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Fundamental Constraints of the real world

We assumed the following parametersWeb documents: 3 billionWords per file: 1000An inverted index would have: 3*109*1000 unique docIDs.docID: 20 bytes (hash of the file content)Inverted index size: 6*1013 bytesQueries per second: 1000 (google)

Page 18: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Fundamental constraints (cont.)

Storage Constraints 1 GB for each PC, at least 60000 PCs.

Communication Constraints: Assume web search consume 10% (after

comparison with the traffic for DNS) 1999, Internet backbone of US: 100Gbits 1000 queries/sec, 10Mbits can be used

for each query

Page 19: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Basic Cost analysisAssumption: DHT-based P2P systems Two-term query (each search has 2 keyword)

In MIT, 1.7 Million Web pages, 81000 queries, 300,000 bytes are moved for each query.Scale to Internet (3 billion page), it might require 530 Mbytes for each query(Q: Is that true?)

Page 20: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Optimizations:All the optimizations used the 81000 queries from mit.eduCaching and Precomputation Caching received posting lists: reduce

communication cost by 38% Computing and storing the intersection

of different posting lists in advance: 3% of all possible term pairs is precomputed, the communication cost is reduced by 50%(Zipf distribution, most popular words)

Page 21: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

CompressionBloom Filters: Two-round Bloom intersection (one node sends

the bloom filter of its posting list to another node, which returns the result): compression ratio 13

4-round Bloom intersection: Compression ratio 40 Compressed Bloom filters: 30% improvement

Gap Compression: Effective when the gaps between sorted

docIDs are small, ? So less bits are required for docID ?

Page 22: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Compression (cont.)Adaptive Set Intersection: Exploit the structure in the posting lists to

avoid transferring entire lists Example: {1, 3, 4, 7}&&{8, 10, 20, 30}

requires one element exchange since 7<8

Clustering Similar documents are grouped together

based on their term occurrences and assigned adjacent docIDs, which improves the compression ration of adaptive set intersecton with gap compression to 75

Page 23: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Optimization Tech. And Improvements

Still one order of magnitude higher than the budget

Page 24: Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Two compromises And Conclusion

Compromising Result QualityCompromising P2P StructureConclusion: Naïve implementations of P2P Web search are not

feasible. The most effective optimizations bring the

problem to within an order of magnitude of feasibility.

Two possible compromises are proposed All of them combined together will bring us within

feasibility range for P2P Web search. (Q: Sure?)