Page 1
1
CS 268: Lecture 22 DHT Applications
Ion StoicaComputer Science Division
Department of Electrical Engineering and Computer SciencesUniversity of California, Berkeley
Berkeley, CA 94720-1776
(Presentation based on slides from Robert Morris and Sean Rhea)
Page 2
2
Outline
Cooperative File System (CFS) Open DHT
Page 3
3
Target CFS Uses
Serving data with inexpensive hosts:- open-source distributions
- off-site backups
- tech report archive
- efficient sharing of music
node
nodenode
node
Internet
node
Page 4
4
How to mirror open-source distributions?
Multiple independent distributions- Each has high peak load, low average
Individual servers are wasteful
Solution: aggregate- Option 1: single powerful server
- Option 2: distributed service
• But how do you find the data?
Page 5
5
Design Challenges
Avoid hot spots Spread storage burden evenly Tolerate unreliable participants Fetch speed comparable to whole-file TCP Avoid O(#participants) algorithms
- Centralized mechanisms [Napster], broadcasts [Gnutella]
CFS solves these challenges
Page 6
6
CFS Architecture
Each node is a client and a server Clients can support different interfaces
- File system interface
- Music key-word search
node
client server
node
clientserverInternet
Page 7
7
Client-server interface
Files have unique names Files are read-only (single writer, many readers) Publishers split files into blocks Clients check files for authenticity
FS Client serverInsert file f
Lookup file f
Insert block
Lookup block
node
server
node
Page 8
8
Server Structure
• DHash stores, balances, replicates, caches blocks
• DHash uses Chord [SIGCOMM 2001] to locate blocks
DHash
Chord
Node 1 Node 2
DHash
Chord
Page 9
9
Chord Hashes a Block ID to its Successor
N32
N10
N100
N80
N60
CircularID Space
• Nodes and blocks have randomly distributed IDs• Successor: node with next highest ID
B33, B40, B52
B11, B30
B112, B120, …, B10
B65, B70
B100
Block ID Node ID
Page 10
10
DHash/Chord Interface
lookup() returns list with node IDs closer in ID space to block ID
- Sorted, closest first
server
DHash
Chord
Lookup(blockID) List of <node-ID, IP address>
finger table with <node IDs, IP address>
Page 11
11
DHash Uses Other Nodes to Locate Blocks
N40
N10
N5
N20
N110
N99
N80 N50
N60N68
Lookup(BlockID=45)
1.
2.
3.
Page 12
12
Storing Blocks
Long-term blocks are stored for a fixed time
- Publishers need to refresh periodically Cache uses LRU
disk: cache Long-term block storage
Page 13
13
Replicate blocks at r successors
N40
N10
N5
N20
N110
N99
N80
N60
N50
Block17
N68
• Node IDs are SHA-1 of IP Address• Ensures independent replica failure
Page 14
14
Lookups find replicas
N40
N10
N5
N20
N110
N99
N80
N60
N50
Block17
N68
1.3.
2.
4.
Lookup(BlockID=17)
RPCs:1. Lookup step2. Get successor list3. Failed block fetch4. Block fetch
Page 15
15
First Live Successor Manages Replicas
N40
N10
N5
N20
N110
N99
N80
N60
N50
Block17
N68
Copy of17
• Node can locally determine that it is the first live successor
Page 16
16
DHash Copies to Caches Along Lookup Path
N40
N10
N5
N20
N110
N99
N80
N60
Lookup(BlockID=45)
N50
N68
1.
2.
3.
4.RPCs:1. Chord lookup2. Chord lookup3. Block fetch4. Send to cache
Page 17
17
Caching at Fingers Limits Load
N32
• Only O(log N) nodes have fingers pointing to N32• This limits the single-block load on N32
Page 18
18
Virtual Nodes Allow Heterogeneity
Hosts may differ in disk/net capacity Hosts may advertise multiple IDs
- Chosen as SHA-1(IP Address, index)
- Each ID represents a “virtual node” Host load proportional to # v.n.’s Manually controlled
Node A
N60N10 N101
Node B
N5
Page 19
19
Why Blocks Instead of Files?
Cost: one lookup per block- Can tailor cost by choosing good block size
Benefit: load balance is simple- For large files
- Storage cost of large files is spread out
- Popular files are served in parallel
Page 20
20
Outline
Cooperative File System (CFS) Open DHT
Page 21
21
Questions:
How many DHTs will there be?
Can all applications share one DHT?
Page 22
22
Benefits of Sharing a DHT
Amortizes costs across applications- Maintenance bandwidth, connection state, etc.
Facilitates “bootstrapping” of new applications- Working infrastructure already in place
Allows for statistical multiplexing of resources- Takes advantage of spare storage and bandwidth
Facilitates upgrading existing applications- “Share” DHT between application versions
Page 23
23
The DHT as a Service
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
Page 24
24
The DHT as a Service
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V OpenDHT
Page 25
25
The DHT as a Service
OpenDHT Clients
Page 26
26
The DHT as a Service
OpenDHT
Page 27
27
The DHT as a Service
OpenDHT
What is this interface?
Page 28
28
It’s not lookup()
lookup(k)
k
What does this node do with it?
Challenges:1. Distribution2. Security
Page 29
29
How are DHTs Used?1. Storage
- CFS, UsenetDHT, PKI, etc.
2. Rendezvous- Simple: Chat, Instant Messenger
- Load balanced: i3
- Multicast: RSS Aggregation, White Board
- Anycast: Tapestry, Coral
Page 30
30
What about put/get?
Works easily for storage applications
Easy to share- No upcalls, so no code distribution or security complications
But does it work for rendezvous?- Chat? Sure: put(my-name, my-IP)
- What about the others?
Page 31
31
Protecting Against Overuse
Must protect system resources against overuse- Resources include network, CPU, and disk
- Network and CPU straightforward
- Disk harder: usage persists long after requests
Hard to distinguish malice from eager usage- Don’t want to hurt eager users if utilization low
Number of active users changes over time- Quotas are inappropriate
Page 32
32
Fair Storage Allocation
Our solution: give each client a fair share- Will define “fairness” in a few slides
Limits strength of malicious clients- Only as powerful as they are numerous
Protect storage on each DHT node separately- Must protect each subrange of the key space
- Rewards clients that balance their key choices
Page 33
33
The Problem of Starvation
Fair shares change over time- Decrease as system load increases
time
Client 1 arrivesfills 50% of disk
Client 2 arrivesfills 40% of disk
Client 3 arrivesmax share = 10%
Starvation!
Page 34
34
Preventing Starvation
Simple fix: add time-to-live (TTL) to puts- put (key, value) put (key, value, ttl)
Prevents long-term starvation- Eventually all puts will expire
Page 35
35
Preventing Starvation
Simple fix: add time-to-live (TTL) to puts- put (key, value) put (key, value, ttl)
Prevents long-term starvation- Eventually all puts will expire
Can still get short term starvation
time
Client A arrivesfills entire of disk
Client B arrivesasks for space
Client A’s valuesstart expiring
B Starves
Page 36
36
Preventing Starvation
Stronger condition:Be able to accept rmin bytes/sec new data at all times
This is non-trivial to arrange!
Reserved for futureputs. Slope = rmin
Candidate put
TTL
size
Sum must be < max capacity
time
space
max
max0now
Page 37
37
Preventing Starvation
Stronger condition:Be able to accept rmin bytes/sec new data at all times
This is non-trivial to arrange!
TTL
size
time
space
max
max0now
TTLsize
time
space
max
max0now
Violation!
Page 38
38
Preventing Starvation
Formalize graphical intuition:
f() = B(tnow) - D(tnow, tnow+ ) + rmin • D(tnow, tnow+ ): aggregate size of puts expiring in the
interval (tnow, tnow+ )
To accept put of size x and TTL l:
f() + x < C for all 0 ≤ < l
Can track the value of f efficiently with a tree- Leaves represent inflection points of f
- Add put, shift time are O(log n), n = # of puts
Page 39
39
Fair Storage Allocation
Per-clientput queues
Queue full:reject put
Not full:enqueue put
Select mostunder-
represented
Wait until canaccept withoutviolating rmin
Store andsend accept
message to client
The Big Decision: Definition of “most under-represented”
Page 40
40
Defining “Most Under-Represented”
Not just sharing disk, but disk over time- 1 byte put for 100s same as 100 byte put for 1s
- So units are bytes seconds, call them commitments
Equalize total commitments granted?- No: leads to starvation
- A fills disk, B starts putting, A starves up to max TTL
time
Client A arrivesfills entire of disk
Client B arrivesasks for space
B catches up with A
Now A Starves!
Page 41
41
Defining “Most Under-Represented”
Instead, equalize rate of commitments granted- Service granted to one client depends only on others putting “at same
time”
time
Client A arrivesfills entire of disk
Client B arrivesasks for space
B catches up with A
A & B shareavailable rate
Page 42
42
Defining “Most Under-Represented”
Instead, equalize rate of commitments granted- Service granted to one client depends only on others putting “at same
time”
Mechanism inspired by Start-time Fair Queuing- Have virtual time, v(t)
- Each put gets a start time S(pci) and finish time F(pc
i)
F(pci) = S(pc
i) + size(pci) ttl(pc
i)
S(pci) = max(v(A(pc
i)) - , F(pci-1))
v(t) = maximum start time of all accepted puts
Page 43
43
FST Performance