Caching and Data Consistency in P2P

Caching and Data Consistency in P2P

Dai Bing TianZeng Yiming

Caching and Data Consistency

Why CachingCaching helps use bandwidth more

efficiently The data consistency in this topic is different

from the consistency in distributed database It refers to the consistency between cached

copy and data on servers.

Introduction Caching is built based on current P2P arch

itectures like CAN, BestPeer, Pastry, etc. Caching layer is between application layer

and P2P layer. Every peer has its cache control unit and it

s local cache, and publish the cache contents

Presentation Order

We will present four papers, they areSquirrelPeerOLAPCaching for Range Queries

With CANWith DAG

Overview

Paper Based on Caching Consistency

Squirrel Pastry Yes Yes

PeerOLAP BestPeer Yes No

RQ with CAN CAN Yes Yes

RQ with DAG Not Specified

Yes Yes

Squirrel

Enables web browsers on desktop machines to share their local caches

Uses a self-organizing, peer-to-peer network Pastry as its object location service

Pastry is fault resilient, so is Squirrel

Web Caching Web browser generate HTTP GET requests If the object is in the local cache, return it if “f

resh” enough “freshness” can be checked by submitting c

GET request If no such object, issue GET request to the s

erver For simplicity, we assume objects are cache

able

Home Node

As described in Pastry, every peer (node) has its nodeID

objectID = SHA-1 (obj URL) This object is assigned to the node whose

ID is numerically nearest to the objectID The node who owns this object is called th

e home node of this object

Two approaches There are two approaches of Squirrel

Home-storeDirectory

Home-store stores the object directly in the cache of the home node

Directory stores the pointer to the nodes who have this object in its cache, these nodes are called delegates

Home-store

Origin ServerRequester

HomeNode

LAN

WAN

RequestRoutedThroughPastry

Request for AIs my copy of A fresh?

Send A over

Yes, it is fresh Request for A

Is my copy of A fresh?

Send A over

Yes, it is fresh

Directory Origin Server

HomeNode

Requester

LANWAN

RequestRoutedThroughPastry

Delegate

Request for A

Get it from D

Request for A

Send A over

No directoryGet it from Server

Request for A

Send A over

Send A over

Request for A

I’m your delegate

Is my copy of A fresh?

Yes, it is fresh

Update Meta-infoKeep the directory

Requester and I are your delegates

Conclusion The home-store approach is less

complicated, but it does not have any collaboration

The directory approach is more collaborative, it has the ability to store more objects in those peers with larger cache capacity, by setting the pointers to these peers in the directory

PeerOLAP OnLine Analytical Processing (OLAP) quer

y typically involves large amounts of data Each peer has a cache containing some re

sults An OLAP query can be answered by combi

ning partial results from many peers PeerOLAP acts as a large distributed cach

e

Data Warehouse & Chunk “A data warehous

e is based on a multidimensional data model which views data in the form of a data cube.”

–Han & Kamber

Date

Product

Country

sum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4QtrU.S.A

Canada

Mexico

sum

http://www.cs.sfu.ca/~han/dmbook

PeerOLAP network LIGLO servers provide global name lookup

and maintain a list of active peers

Except for LIGLO servers, the network is fully distributed without any centralized administration point

LIGLO

Data Warehouse Peer

Query Processing

Assumption 1: Only chunks at the same aggregation level as the query are considered

Assumption 2: The selecting predicates is a subset of grouping-by predicates

Cost Model Every chunk is associated with a cost

value, indicating how long it spends to get this chunk

PQTr

csizekQPCnPQcN

,

PQcNQcSPQcT ,,,

Eager Query Processing (EQP) Peer P sends requests for the missing chu

nks to all its neighbors, Q1, Q2, .... Qk

Each Qi provides the desired chunks as many as possible, return to P with a cost associated with each chunk

Qi then propagates the requests to all its neighbors recursively

In order to avoid flooding, hmax is set to limit the depth of the search

EQP (Contd.) P collects (chunk, cost) pairs from all its neigh

bors Random select one chunk ci, and find the peer

who can provide it with lowest cost, Qi For the subsequent chunks, it evaluates the mi

nimum of two cases: the peer with lowest cost is not connected yet, or some existing peer who can also provide this chunk

Ask for chunks from these peers and the rest missing chunks from the warehouse.

Lazy Query Processing (LQP)

Instead of propagating the requests from each Qi to all its neighbors, each Qi selects its most beneficial neighbor, and forward the request.

Given the expected number of neighbors a peer has is k, EQP will visit O(k^hmax) nodes, LQP only visit O(khmax)

Chunk Replacement Least Benefit First (LBF)

csize

QPHaPQcTPcB

,,

Similar to LRU, every chunk has a weight

Once the chunk is used by P, its weight is set back to the original benefit value

Every time there is a new chunk come in, the weight of old chunks will reduce

Collaboration

LBF gives local chunk replacement algorithm 3 variations of global behavior

Isolated Caching Policy: non-collaborativeHit Aware Caching Policy: collaborativeVoluntary Caching: highly collaborative

Network Reorganization

Optimization can be done by creating virtual neighborhoods of peers with similar query patterns

So that there is a high probability for P to get missing chunks directly from neighbors

Each connection is assigned a benefit value and the most beneficial connections are selected to be the peer’s neighbors

Conclusion

PeerOLAP is a distributed caching system for OLAP results

By sharing the contents of individual caches, PeerOLAP constructs a large virtual cache which can benefit all peers

PeerOLAP is fully distributed and highly scalable

Caching For Range Queries Range Query:

E.g. SELECT Student.name

WHERE 20<Student.age<30 Why Cache?

Data source too far away from the requesting node Data source overloaded with queries Data source is a single point of failure

What to cache? All tuples falling in the range

Who cache? Peers responsible for the range

Problem Definition

Given a relation R, and a range attribute A, we assume that the results of prior range-selection queries of the form R.A(LOW, HIGH) are stored at the peers. When a query is issued at a peer which requires the retrieval of tuples from R in the range R.A(low, high), we want to locate a peer in the system which already stores tuples that can be accessed to compute the answer.

A P2P Framework for Caching Range Queries Based on CAN. Map data into 2d virtual space, where d is # dim

ensions of the relation. For every dimension/attribute, say its domain is

[a, b], it is mapped to a square virtual hash space whose corner coordinates are (a,a), (b,a), (b,b) and (a,b).

The virtual hash space is further partitioned into rectangular areas, each of which is called a zone.

Example Virtual hash space for an

attribute whose domain is [10,70]

zone-1: <(10,56),(15,70)>zone-5: <(10,48),(25,56)>zone-8: <(47,10),(70,54)>

Terminology Each zone is assigned to a peer. Active Peer

Owns a zone Passive Peer

Not participate in the partitioning, register itself with an active peer Target Point

A range [low,high] is hashed to a point with coordinates (low,high) Target Zone

Where the target point resides Target Node

The peer that owns the target zone “Stores” the tuples falling into the range which is mapped to the its

zone Caches the tuples in the local cache; OR Stores a pointer to the peer who caches the tuples

Zone Maintenance

Initially, only the data source is the active node and the entire virtual hash space is its zone

A zone split happens under two conditions:Heavy Answering LoadHeavy Routing Load

Example of Zone Splits If a zone has too

many queries to answer It finds the x-median

and y-median of the stored results. Determine if a split at x-median or y-median results in even distribution of stored answers and the space.

If a zone is overloaded because of routing queries It splits the zone from

the midpoint of the longer side.

Answering A Range Query

If an active node poses the query, the query is initiated from the corresponding zone; if a passive node poses the query, it contacts any active node from where the query starts routing.

2 steps involvedQuery RoutingQuery Forwarding

Query Routing

If the target point falls in this zone

Return this zone Else

Route the query to the neighbor who is closest to the target point

(26,30)

Query Routing




(26,30)

Query Routing




(26,30)

Forwarding

If the results are stored in the target node, then the results are sent back to the querying node

Else, it is still possible that zones lie in the upper left area of the target point store the results. So we need to forward the query to these zones too.

Example If no results are found

in zone-7, the shaded region may still contains the results.

Reason: Any prior range query q whose range subsumes (x,y) must be hashed into the shaded region.

Forwarding (Cont.)

How far should it go? For a range (low,high), w

e want to restrict to results falling in (low-offset,high+offset), where offset = AcceptableFit x |domain|.

AcceptabelFit [0,1] The shaded square defin

ed by the target point and offset is called the Acceptable Region

offset

Forwarding (Cont.) Flood Forwarding

A naïve approach. Forward to the left and top neighbors if they fall in the acceptable region

Directed Forwarding Forward to the neighbor

that maximally overlaps with the acceptable region

Can bound the number of forwards by specifying a limit d, which is decremented for every forward.

Discussion

ImprovementsLookup During RoutingWarm up queries

Peer soft-departure & Failure event Update—cache consistency

Say a tuple t with range attribut a=k is updated in the data source, then the target zone of point (k,k) and all zones lie in the upper left region have to update their cache.

Range Addressable Network: A P2P Cache Architecture for Data Ranges Assumption:

Tuples stored in the system are labeled 1,2,…,N according to the range attribute

A range [a,b] is a contiguous subset of {1,2,…,N}, where 1<=a<=b<=N

Objective: Given a query range [a,b], peers cooperativel

y find results falling in the shortest superset of [a,b], if they are cached somewhere.

Overview

Based on Range Addressable DAG (Directed Acyclic Graph)

Map every active node in the P2P system to a group of nodes in the DAG

A node is responsible for storing results and answering queries falling into a specific range

Range Addressable DAG

The entire universe [1,N] is mapped to the root.

Recursively divide one node into 3 overlapping intervals of equal length.

Range LookupInput: a query range q=[a,b],

a node v in DAGOutput: the shortest range in

DAG that contains qboolean down=true;search (q, v){

if q i(v)search (q, parent(v));

if q i(child(v)) & downsearch (q, child(v));

elseif some range stored at v is a superset of q

return the shortest range containing q that is stored at v or parent(v); (*)

elsedown=false;search(q,parent(v));

}

Q: [7,10]

[5,12]

[7,13]

Peer Protocol

Maps the logical DAG structure to physical peers

Two componentsPeer Management

Handles peer joining, leaving, failureRange Management

Deals with query routing and updates

Peer Management

It ensures that at any time, every node in the DAG is assigned to some

peer the nodes belonging to one peer, called a

zone, is a connected component of the DAG This is done by handling Join Request,

Leave Request, Failure Event properly.

Join Request

The first peer joining the system takes over the entire DAG

A new peer joining the system contacts one of the peers in the system to take over one of its child zones. Default strategy: left child, then mid child, then right child.

Join Request



Join Request



Join Request



Leave Request

When a peer wants to leave (soft departure), it hands over its zone to the smallest neighboring zone.

Neighboring zones: there is a parent-child relationship among any nodes in the zones

Leave Request

When a peer wants to leave (soft departure), it hands over its zone to the smallest neighboring zone.

Neighboring zones: there is a parent-child relationship among any nodes in the zones

Failure Event

A zone maintains info on all its ancestors. So in case it finds out one of its parents failed, it contacts the nearest alive ancestor for zone takeover.

Range Management

Range Lookup Range Update

When a tuple is updated in the data source, we locate the peer with the shortest range containing that tuple, then update this peer and all its ancestors.

Improvement

Cross Pointers For a node v, if it’s the

left child of its parent, then it keeps cross pointers to all the left children of nodes that are in its parent’s level.

Similarly for mid child.

Improvement (Cont.) Load Balancing by Peer Sampl

ing Collapsed DAG: collapse each

peer’s zone to a single node. The system is balanced if the

collapsed DAG is balanced. Lookup time is O(h) where h is

the height of the collapsed DAG. Hence a balanced system leads to optimal performance.

When a new peer joins, it polls k peers randomly, and send join request to the one whose zone is rooted nearest to the root.

P2

P1

P3

Collapsed DAG: P1

P2 P3

Improvement (Cont.) Load Balancing by Peer Sampl

ing Collapsed DAG: collapse each

peer’s zone to a single node. The system is balanced if the

collapsed DAG is balanced. Lookup time is O(h) where h is

the height of the collapsed DAG. Hence a balanced system leads to optimal performance.

When a new peer joins, it polls k peers randomly, and send join request to the one whose zone roots nearest to the root.

Collapsed DAG:

Conclusion

Caching Range Queries based on CANMaps every attribute into a 2D spaceThe space is divided into zonesPeers manage their respective zonesA range [low,high] is mapped to a point (low,hi

gh) in the 2D spaceQuery Routing & Query Forwarding

Conclusion (Cont.)

Range Addressable NetworkModel ranges as DAGEvery peer takes responsibility of a group of

nodes in DAGQuerying involves traversal of the DAG

Caching and Data Consistency in P2P

Documents