Sylvia Ratnasamy, Paul Francis, Mark Handley,
Richard Karp, Scott Shenker
A Scalable, Content-Addressable Network (CAN)
ACIRI U.C.Berkeley Tahoe Networks
1 2 3
1,2 3 1
1,2 1
Slide Credits (for UO CIS 410/510)
• Ratnasamy et.al. from SIGCOMM ‘01
• Ken Birman from CIS 514 Cornell
CAN: solution
• Virtual d-dimensional Cartesian coordinate space
• entire space is partitioned amongst all the nodes – every node “owns” a zone in the overall space– point = node that owns the enclosing zone
• State per node is O(d)• Routing between nodes is O(d * n1/d) , d dimensions, n nodes assuming evenly
distributed nodes
CAN Virtual Space
• Virtual d-dimensionalCartesian coordinatesystem on a d-torus– Example: 2-d [0,1]x[1,0]
– Note: coordinates can be
rational numbers !!
– Infinitely expandable
• Subspaces (zones) dynamically partitionedamong all nodes
CAN: simple example
1
CAN: simple example(vertical split first)
1 2
CAN: simple example(horizontal split next)
1
2
3
CAN: simple example
1
2
3
4
CAN: simple example
Storing and Retrieving (K,V) in CAN
• Pair (K,V) is stored bymapping key K to a point P in the space using a uniform hash function and storing (K,V) at the node in the zone containing P
• Retrieve entry (K,V) by applying the same hash function to map K to P and retrieve entry from node in zone containing P– If P is not contained in the zone of the requesting node or its neighboring
zones, route request to neighbor node in zone nearest P
CAN: simple example
I
CAN: simple example
node I::insert(K,V)
I
(1) a = hx(K)
CAN: simple example
x = a
node I::insert(K,V)
I
(1) a = hx(K)
b = hy(K)
CAN: simple example
x = a
y = b
node I::insert(K,V)
I
(1) a = hx(K)
b = hy(K)
CAN: simple example
(2) route(K,V) -> (a,b)
node I::insert(K,V)
I
CAN: simple example
(2) route(K,V) -> (a,b)
(3) (a,b) stores (K,V)
(K,V)
node I::insert(K,V)
I(1) a = hx(K)
b = hy(K)
CAN: simple example
(2) route “retrieve(K)” to (a,b)
(K,V)
(1) a = hx(K)
b = hy(K)
node J::retrieve(K)
J
Routing in a CAN
• Each node maintains a table of the IP address and virtual coordinate zone of each local neighbor
• Follow path through the Cartesian space from source to destination coordinates
• Use greedy routing to neighbor closest to destination
• For d-dimensional space partitioned into n equal zones, nodes maintain 2d neighbors– Average routing path length:
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎠
⎞⎜⎝
⎛ dnd 1
4
CAN: routing table
CAN: routing ???
(a,b)
(x,y)
CAN: node insertion
I
new node1) discover some node “I” already in CAN
CAN: node insertion
2) pick random point in space
I
(p,q)
new node
CAN: node insertion
(p,q)
3) I routes to (p,q), discovers node J
I
J
new node
CAN: node insertion
newJ
4) split J’s zone in half… new owns one half
CAN: node failures
• Need to repair the space
– recover database• soft-state updates• use replication, rebuild database from replicas
– repair routing • takeover algorithm
CAN: takeover algorithm• Simple failures
– know your neighbor’s neighbors– when a node fails, one of its neighbors takes over its
zone
• More complex failure modes– simultaneous failure of multiple adjacent nodes – scoped flooding to discover neighbors
• Only the failed node’s immediate neighbors are required for recovery
Evaluation
• Scalability
• Low-latency
• Load balancing
• Robustness
CAN: scalability• For a uniformly partitioned space with n nodes and
d dimensions – per node, number of neighbors is 2d– average routing path is (dn1/d)/4 hops– simulations show that the above results hold in practice
• Can scale the network without increasing per-node state. Unlimited growth possible.
• Chord/Plaxton/Tapestry/Buzz– log(n) nbrs with log(n) hops
CAN: low-latency
• Problem– latency stretch = (CAN routing delay)
(IP routing delay)– application-level routing may lead to high
stretch
• Solution– increase dimensions– heuristics
• RTT-weighted routing• multiple nodes per zone (peer nodes)• deterministically replicate entries
CAN: low-latency
#nodes
Late
ncy
str
etc
h
0
20
40
60
80
100
120
140
160
180
16K 32K 65K 131K
#dimensions = 2
w/o heuristics
w/ heuristics
0
2
4
6
8
10
CAN: low-latency
#nodes
Late
ncy
str
etc
h
16K 32K 65K 131K
#dimensions = 10
w/o heuristics
w/ heuristics
CAN: load balancing
• Two pieces
– Dealing with hot-spots• popular (key,value) pairs• nodes cache recently requested entries• overloaded node replicates popular entries at
neighbors
– Uniform coordinate space partitioning• uniformly spread (key,value) entries• uniformly spread out routing load
Uniform Partitioning
• Added check – at join time, pick a zone– check neighboring zones– pick the largest zone and split that one
0
20
40
60
80
100
Uniform Partitioning
V 2V 4V 8V
Volume
Perce
nta
ge
of n
od
es
w/o check
w/ check
V = total volumen
V16
V 8
V 4
V 2
65,000 nodes, 3 dimensions
CAN: Robustness
• Completely distributed – no single point of failure
• Not exploring database recovery
• Resilience of routing– can route around trouble
Routing resilience
destination
source
Routing resilience
Routing resilience
destination
Routing resilience
• Node X::route(D)
If (X cannot make progress to D) – check if any neighbor of X can make
progress– if yes, forward message to one such nbr
Routing resilience
Routing resilience
Routing resilience
0
20
40
60
80
100
2 4 6 8 10dimensions
Pr(
suc c
es s
ful ro
uti
ng
)
CAN size = 16K nodesPr(node failure) = 0.25
Routing resilience
0
20
40
60
80
100
0 0.25 0.5 0.75
CAN size = 16K nodes#dimensions = 10
Pr(node failure)
Pr(
suc c
es s
ful ro
uti
ng
)
Topologically Sensitive Overlay Construction for CAN:
Distributed Binning • Idea
– well known set of landmark machines – each CAN node, measures its RTT to each landmark– orders the landmarks in order of increasing RTT– Nodes are sorted into bins based on the landmark order
• CAN construction– place nodes from the same bin close together on the CAN
Distributed Binning
Basic Binning Example• 5 landmark machines: L1, L2, L3, L4, L5 • CAN node pings the 5 landmarks and orders them
from closest to farthest based on RTTbin label is 21453 which is one of 5! possible orderings
• Bin label gets translated into CAN coordinatesDivide into 5! Subspaces by cycling through the dimensionsXYZXYZXYZ … gets divided into m, m-1, m-2, … portions Thus, X axis divided into 5*2, Y divided into 4*1, Z divided into 3.
Sophisticated Distributed Binning
• Divide RTTs into ranges:– Level 0: [0,100) ms– Level 1: [100,200) ms ,etc.
• Node bin includes the order of the landmarks and level:E.g. Node A’s distance to landmarks L1, L2,L3 is 232 ms, 51ms, and 117 ms. IBin ordering is L2, L3, L1.
Bin + level level ordering is 012.
Distributed Binning
– 4 Landmarks (placed at 5 hops away from each other)– naïve partitioning
number of nodes
256 1K 4K
late
ncy
Str
etc
h
5
10
15
20
256 1K 4K
w/o binning w/ binning
w/o binning w/ binning
#dimensions=2 #dimensions=4
Fault Tolerance: Multiple Hash Functions
• Improve data availability by using k hash functions to map a single key to k points in the coordinate space
• Replicate (K,V) and storeat k distinct nodes
• (K,V) is only unavailablewhen all k replicas aresimultaneouslyunavailable
• Authors suggest queryingall k nodes in parallel toreduce average lookup latency