Top Banner
May 15, 2002 Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan Tel-Aviv University
39

May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

Jan 02, 2016

Download

Documents

Ada Williams
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Associative Peer to Peer Networks: Harnessing Latent

Semantics

Edith CohenAT&T Labs-research

Amos Fiat Haim KaplanTel-Aviv University

Page 2: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

“Traditional” Client-server Web

Web server

Page 3: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Peer-to-peer Networks

• Harness vast resources• Scalability/Robustness

to failures/shutdowns

Distributed network for sharing content (music, video, software, etc.), where each host acts as both a server and a client

Page 4: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

P2P Search

• Scope: ability to locate “rare” items “Find the 10th episode of Star Trek Voyager”

• Partial-match/complex queries: “Find an Indiana Jones movie”

…Or “Indiana Joens” movie…..

Overall performance of a P2P network highly depends on the efficiency and versatility of search

What features are important ?

Page 5: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

(search in) Basic P2P Architectures

Decentralized: peers are connected by low-degree overlay network.

Partial-Matches Scop

eCentralized (Napster): central index service.

• Blind Search (Gnutella, FastTrack – Morpheus/ Kazza,…): search via flooding, multi-head random walks...

• Distributed Hash Tables (Freenet, Can, Chord,…): induced topology routed search.

Page 6: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Associative P2P networksRetain Gnutella’s desirable properties:• Distributed overlay network• Peers store only what they need (“common

good” at par with “own welfare”)• No tight control of topology/content• Support partial-match queriesAND• Have search scope (orders of magnitude

improvement over Gnutella)

• Make implicit use of latent semantics– Provably good on a reasonable model– Very good on simulations

Page 7: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

P2P search framework

• Search queries are propagated on the overlay (from peer to a neighbor peer).

• When a peer receives a query, it checks if it can satisfy it; decreases hop count; and forwards it to a subset of its neighbors.

• Each search includes query and a “propagation rule”, which determines which neighbors the search is propagated to.

“DHTs” propagation rule= hash of query“Gnutella” propagation rule independent of queryAssociative propagation rules are predicates (guide rules)

Page 8: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Overview• What do we mean by “latent semantics” ?• Challenges in using latent semantics in P2P setting• Our proposal: search propagation via Possession

rules• Possession rules overlays• Search strategies

– Possession rules search strategies: Rapier, GAS– Models for “blind search” strategies (gnutella)

• Analysis in the Itemsets model• Experimental evaluation• More on GAS search strategy

Page 9: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

View of P2P file sharing network

Page 10: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

What is latent semantics?

• Peer/Item matrix is “Market Basket” dataset. Similar to buyers/items, Document/terms, Web-pages/hyperlinks, movies/viewers.

• Applications for extracting patterns from market basket data: Information Retrieval, Collaborative Filtering, Web search, Marketing, Recommendation Systems,…. (clustering, search, association rules)

Selections people make are dependent:•If you buy baby formula, you are more likely to buy diapers.•If two people loved a show, they are more likely to agree on other shows.

?? P2P search – direct queries to peers with interests that match yours

Page 11: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Challenges

• Overlay topology (“networking aspects”) must be coupled with search strategy (“Information Retrieval/Data-Mining”)

• “Traditional” IR and data-mining tools are not adapted to the highly distributed P2P setting. – Similarity metrics/clustering/ranking involve matrix

operations on the “market basket” data: principal component analysis (LSI), eigenvalue computations, association rules…

Page 12: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Possession Rules

• Rule(O): do you possess item O ?• Peer maintains a possession rule for each

item in its index (subset if index is large)• Search strategy: a sequence of possession

rules (with “hop counts”/search size limit)

Making this work:

• “Network”: How to build overlay that supports possession rules• “IR/DM”: design search strategies that use possession rules (and work!)

Page 13: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Possession-rules overlays

item

Rule(item) neighbors

A P11,p7,p3

B P2,p6,p9

C P13,p15,p1

D P4,p5,p10

Index of P26Rules/Items:Rule(A)Rule(B)Rule(C )Rule(D)

Peer26

Example Search Strategy of P26: 2 hops in rule(A) 4 hops in rule(B) 6 hops in rule(C )

4 hops in rule(A) 3 hops in rule(D)

Page 14: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Blind searching for O takes 13 probesSearching with rule(O) takes 2 probes

Rules/Items:Rule(A)Rule(B)Rule(C )Rule(D)

Page 15: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Possession-rule overlay

• When you find O, you often discover multiple peers that have O; when you give O, the searcher informs you of other peers with O.

• Peers that have O can find other peers that have O

• Coverage: The induced overlay on peers that satisfy each rule constitutes of large connected components.

• Small degree: Each peer participates in a limited number of rules. (yet, overall there is a large number rules), for each rule it “participates” in, the peer maintains several participating neighbors.

• Overlay and search boost each other (easy to find appropriate neighbors for each rule):

Network is “gnutella-like”, within each rule

(… can use “super-peer” overlay within each rule !!)

Page 16: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Search strategies• To beat blind search, associative search should

probe peers that are more likely to answer than “random peers”

Associative search:• RAPIER: Random Possession Rule – crudest

strategy• GAS: Greedy Selection – refined strategyBlind search: • Urand: (“gnutella”) all peers have same likelihood of being probed in each query• Prand: (“gnutella modified”) peers are probed proportionally to their index size (RAPIER has same bias)

Page 17: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

RAPIER – Random Possession Rule

simplest possession-rule based strategy

RAPIER Search strategy:• Repeat until found:

– Pick a random item O from your index– Search peers that have this item (using

rule(O))

Straightforward to implement on top of a possession-rule overlay network

Page 18: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Analysis: Itemsets Model

• Items belong to “topics.” There are very many topics; but each peer can only select items from a fixed set of topics. Topic popularities can highly vary; but each peer has equal interest in each of “its” topics.

We show that• RAPIER is at least as good as Prand• RAPIER is better than Prand when peers

have fewer topicsSimple model that hints on what is going on…

Page 19: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Experiments Data: used Client/Hostname matrix from proxy

logs as peer/item matrix. Each entry, in turn, is treated as a search item.– Similarly-structured “market basket” data– Has rare items (which current P2P networks don’t

support)– No universal model for market basket data– Can’t get a full index for many peers from current

P2P networks… and these networks don’t reflect well on rare items.

• Metric: ESS (Expected Search Size – number of peers probed till search is resolved). CDF of fraction of “searches” that have ESS below “x”.

Page 20: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

ESS – Expected Search Size

• ESS: 1/(success probability in each probe) (when probes are “independent” – not true for GAS)

• Probe success probability:• Urand: fraction of peers that have the item in

their index• Prand: weight of each peer is its index size

divided by sum of index sizes of all peers.– Success prob: (weight of peers with item) /

(weight of peers without item)• RAPIER: the average, over possession rules peer

participates in, of fraction of peers in rule that have the item.

Page 21: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Peer-Item Matrix - Experiment

0 0 1 1 1 0 0 0 0 0

0 0 0 0 0 1 0 0 1 1

1 1 0 0 0 0 1 0 0 0

0 0 1 0 1 0 0 0 1 0

0 0 0 0 0 0 1 1 1 0

1 1 0 0 0 0 0 0 1 0

0 0 0 1 1 0 0 1 1 1

0 0 1 1 0 0 0 0 1 0

1 1 0 0 0 1 0 0 0 0

0 1 0 0 1 0 0 0 1 0

Items

Peers

????

??

??

Page 22: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Urand and Prand

0 0 1 1 1 0 0 0 0 0

0 0 0 0 0 1 0 0 1 1

1 1 0 0 0 0 1 0 0 0

0 0 1 0 1 0 0 0 1 0

0 0 0 0 0 0 1 1 1 0

1 1 0 0 0 0 0 0 1 0

0 0 0 1 1 0 0 1 1 1

0 0 1 1 0 0 0 0 1 0

1 1 0 0 0 1 0 0 0 0

0 1 0 0 1 0 0 0 1 0

ItemsPeers

?

Urand Ps=3/9 ESS=3

1/9

1/9

1/9

1/9

1/9

1/9

1/9

1/9

1/9

Prand ESS=29/9

3/29

3/29

3/29

3/29

3/29

3/29

3/29

3/29

5/29

Page 23: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

RAPIER (Random Possession Rule)

0 0 1 1 1 0 0 0 0 0

0 0 0 0 0 1 0 0 1 1

1 1 0 0 0 0 1 0 0 0

0 0 1 0 1 0 0 0 1 0

0 0 0 0 0 0 1 1 1 0

1 1 0 0 0 0 0 0 1 0

0 0 0 1 1 0 0 1 1 1

0 0 1 1 0 0 0 0 1 0

1 1 0 0 0 1 0 0 0 0

0 1 0 0 1 0 0 0 1 0

Items

Peers ?

rule rule

0.5

0.25

0.25

0.5 0.5

Page 24: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Caveat: comparing apples and oranges

• When searching by possession rules we have bias towards peers that participate in more rules/ have more items.

• But, with this bias, a strategy has better chance of finding what it is looking for! So…

• We show that the likelihood of being probed is proportional to number of rules you participate in.

• Prand “blind search” strategy has same bias. • Thus, it is “fair” to compare Prand search with

possession-rule based RAPIER

Page 25: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

GAS …Refining RAPIERIdeas:• Some rules are better than others (e.g., possession

of a very popular item carries weaker information) • Unsuccessful search carries information: suppose

you lost something, you think you lost it at home. You search home going through various closets and drawers and don’t find it, then you may decide to go search the office, even if you have not completed an exhaustive search at home. What happened? The posterior distribution on the item’s location had changed as a result of the search.

GAS – Greedy Strategy

Page 26: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

All ItemsUrand Blind search (Gnutella),Prand Gnutella modified, Rapier, GAS – our algorithms

Page 27: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Rare Items: present in 1% of peers

Page 28: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Rarer items: 0.1% of peers

Page 29: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Even Rarer Item: 0.01% of peers

Page 30: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

GAS – Greedy Strategy• Idea: use the search strategy that would have

optimized your search on previous queries.• Caveat: this is NP-Complete• Can do: greedy approximation strategy: GAS GAS: • initialize the “query vector” to a uniform distribution

on previous selections.• Iterate the following:

– Apply the possession rule that maximizes success probability with respect to the query posterior

– update the query posterior.

Theorem: GAS is a constant factor approximation of the optimal strategy

Page 31: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Building GAS strategies• GAS:

– Take a sample of items currently in your index D,E,F,G.– “search” for these items in each possession rule you

participate A,B,C– obtain a matrix: fraction of peers with item x in rule(y)

ItemRule()

D E F G

rule(A) 0.03

* 0.2 *

rule(B) * 0.04

* 0.1

rule(C) 0.1 0.2 0.03

*

Page 32: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

GAS strategy (example) ItemRule()

D E F G

rule(A) 0.03

* 0.2 *

rule(B) * 0.04

* 0.1

rule(C) 0.1 0.2 0.03

*C,C,C,A,C,C,A,C,A,C,B,B,A,C,B,B,C,A,B,B,C

GAS search of size 21: 10 probes in rule(C) 6 probes in rule(B) 5 probes in rule(A)

RAPIER search of size 21: 7 probes in rule(C) 7 probes in rule(B) 7 probes in rule(A)

Page 33: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Summary• We proposed a general framework for associative

P2P search: exploit patterns inherent in human selections to boost search. Adapted to the P2P setting.

• Search strategies and the overlay structure are “symbiotic” and guided/boosted by previous selections/queries.

• “Common good” in par with “own welfare”: All data maintained by each peer has direct personal benefit (like gnutella). Helping others helps you…

• Possession rules:– Strategies are “approximations” to “standard”

similarity metrics… that work!!.– Easy to find other sources of desired item (for

alternative/parallel downloads)

Page 34: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Related work• IR-DM: association rules/collaborative filtering/Web search• P2P networks: unstructured networks; DHTs

– DHTs have “symbiotic” overlay/search strategy– Caching at peers (Freenet) adapt overlay according to search

• Intersection: – Crespo/Garcia-Molina 02– routing indexesSystem isolates “topics”+map queries/items to topics. Peer knows “summary” of what can be reached thru it/each neighborQuery keywords are used to select a neighbor who is a best matchDifferences from our approach:– No connection between search and overlay topology – Uses only text/keywords. We use co-location associations between

items.CG02: tradeoff between topic divergence (all nodes ending up with

similar index “summary”); or restricted coverage (number of peers included in each peer summary);

– neurogrid.net (Sam Joseph, U. Tokyo) “agent” text-based approach

• Peers learn and remember content of other peers

Page 35: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Future…• Integrate text matching (of query keywords) in

search strategy (use rule(O) if query keywords match O’s metadata)

• Select which possession rules to participate in (e.g., using item popularity heuristic or GAS-like selection)

• Search strategy gives more weight to more recent selections (are more indicative of next query)

• Explore other types of propagation rules • P2P “communities” ?• Integrate “Recommendation Systems” in P2P ?• Implementation …

Page 36: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Page 37: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Some Extra Comments…

• Issues with straightforward importing of IR techniques – Vector space approach– Similarity metrics

• Why we need to use several propagation rules in a search? (when searching according to “examples” in the index)

Page 38: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

“Straight” IR vector-space approach

• #neighbors=O(dimension) - want small dimension• Yet, Matrix operations, e.g principal component

analysis (LSI), are hard in our distributed setting• Yet, each peer should be able to compute the

mapping for its queries and/or index • Proximity metric alone is insufficient (Need

different propagation rules)

• Peers are mapped to vectors, according to their index content. Queries are mapped to the vectors in the same space.

• Overlay topology is correlated with distances in this vector space (bias towards closer peers)

• Search propagation targets regions of the space that are “closest” to the query.

Page 39: May 15, 2002Stanford Networking Seminar Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan.

May 15, 2002 Stanford Networking Seminar

Why we need several propagation rules for the same

query –”decision-tree like” search

propagation rule =approx interest areaEach peer covers several interest areas, peers

have different sets of interest areas.Peer Query: 80% basketball 20%polo“World” Index: 5% basketball 0.1% poloAll “basketball” lovers would be close matches;

but need to direct search to more “polo” lovers multi-rule search strategy: “basketball” 200

peers; “polo” 200 peers