Tonglin Li, Xiaobing Zhou, Kevin Brandstatter, Dongfang Zhao, Ke Wang, Zhao Zhang, Ioan Raicu Illinois Institute of Technology, Chicago, U.S.A ZHT 1 A Fast, Reliable and Scalable Zero-hop Distributed Hash Table
Feb 23, 2016
1
Tonglin Li, Xiaobing Zhou, Kevin Brandstatter, Dongfang Zhao, Ke Wang, Zhao Zhang, Ioan RaicuIllinois Institute of Technology, Chicago, U.S.A
ZHTA Fast, Reliable and Scalable Zero-hop Distributed Hash Table
A supercomputer is a device for turning compute-bound problems into I/O-bound problems.
Ken Batcher
3
Big problem: file systems scalability
Parallel file system (GPFS, PVFS, Lustre) Separated computing resource from
storage Centralized metadata management
Distributed file system(GFS, HDFS) Specific-purposed design (MapReduce
etc.) Centralized metadata management
4
The bottleneck of file systemsMetadata
Concurrent file creates
1
10
100
1,000
10,000
100,000
1 4 16 64 256 1024 4096 16384
Tim
e pe
r Ope
ratio
n (m
s)
Scale (# of Cores)
File Create (GPFS Many Dir)File Create (GPFS One Dir)
5
Proposed work
A distributed hash table (DHT) for HEC
As building block for high performance distributed systems
Performance Latency Throughput
Scalability Reliability
6
Related work: Distributed Hash Tables Many DHTs: Chord, Kademlia, Pastry,
Cassandra, C-MPI, Memcached, Dynamo ...
Why another?Name Impl. Routin
g TimePersiste
nceDynamic members
hip
Append Operati
onCassandra Java Log(N) Yes Yes NoC-MPI C Log(N) No No No
Dynamo Java 0 to Log(N) Yes Yes No
Memcached C 0 No No NoZHT C++ 0 to 2 Yes Yes Yes
7
Zero-hop hash mappingNode
1 Node2
...
NodenNode
n-1
Client 1 … n
hash
Key j
Value jReplica
1
hash
Key k
Value jReplica
2
Value jReplica
3
Value kReplica
1 Value kReplica
2
Value kReplica
3
8
2-layer hashing
9
Architecture and terms Name space:
264
Physical node Manager ZHT Instance Partition: n
(fixed) n = max(k)
Instance
Manager
Update
Response to request
Partition
Instance
Partition
Responseto request
Broadcast
Physical node
Membership table
UUID(ZHT)KeyIPPortCapacityworkload
10
How many partition per node can we do?
1 10 100 10000.6
0.620.640.660.68
0.70.720.740.760.78
Average latency
Number of partitions per instance
Late
ncy
(ms)
11
Membership management
Static: Memcached, ZHT Dynamic
Logarithmic routing: most of DHTs Constant routing: ZHT
12
Membership management Update membership
Incremental broadcasting Remap k-v pairs
Traditional DHTs: rehash all influenced pairs
ZHT: Moving whole partition▪ HEC has fast local network!
13
Consistency
Updating membership tables Planed nodes join and leave: strong
consistency Nodes fail: eventual consistency
Updating replicas Configurable Strong consistency: consistent, reliable Eventual consistency: fast, availability
14
Persistence: NoVoHT
NoVoHT persistent in-memory hash map Append operation Live-migration
1 million 10 million 100 million024
68
1012
14161820 NoVoHT
NoVoHT (No persistence)KyotoCabinetBerkeleyDBunordered_map
Scale( number of key/value pairs)
Late
ncy
(mic
rose
cond
s)
15
Failure handling
Insert and append Send it to next replica Mark this record as primary copy
Lookup Get from next available replica
Remove Mark record on all replicas
16
Evaluation: test beds
IBM Blue Gene/P supercomputer Up to 8192 nodes 32768 instance deployed
Commodity Cluster Up to 64 node
Amazon EC2 M1.medium and Cc2.8xlarge 96 VMs, 768 ZHT instances
deployed
17
Latency on BG/P
0
0.5
1
1.5
2
2.5 TCP without Connection CachingTCP connection cachigUDPMemcached
Number of Nodes
Late
ncy
(ms)
18
Latency distribution
SCALES 75% 90% 95% 99%64 713 853 961 1259
256 755 933 1097 18481024 820 1053 1289 3105
19
Throughput on BG/P
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921,000
10,000
100,000
1,000,000
10,000,000TCP: no connection cachingZHT: TCP connection cachingUDP non-blockingMemcached
Scale (# of Nodes)
Thro
ughp
ut (o
ps/s
)
20
Aggregated throughput on BG/P
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81920
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
16,000,000
18,000,000
1 instances/node
2 instances/node
4 instances/node
8 instances/node
Number of Nodes
Thro
ughp
ut (o
ps/s
)
21
Latency on commodity cluster
1 2 4 8 16 32 640
0.5
1
1.5
2
2.5
3
ZHTCassandraMemcached
Scale (# of nodes)
Late
ncy
(ms)
22
ZHT on cloud: latency
1 2 4 8 16 32 64 960
2000
4000
6000
8000
10000
12000
14000
ZHT on m1.medium instance (1/node)ZHT on cc2.8xlarge instance (8/node)DynamoDB
Node number
Aver
age
late
ncy
in m
icro
se
cond
s
ZHT on cloud: latency distribution
SCALES 75% 90% 95% 99% AVG THROUGHP
UT8 11942 13794 20491 35358 12169 83.3932 10081 11324 12448 34173 9515 3363.11
128 10735 12128 16091 37009 11104 11527512 9942 13664 30960 38077 28488 ERROR
SCALES 75% 90% 95% 99% AVG THROUGHP
UT8 186 199 214 260 172 46421
32 509 603 681 1114 426 75080128 588 717 844 2071 542 236065512 574 708 865 3568 608 841040
ZHT on cc2.8xlarge instance
8 s-c pair/instance
DynamoDB: 8 clients/instance
DynamoDB readDynamoDB writeZHT 4 ~ 64 nodes
0.99
0.9
24
ZHT on cloud: throughput
1 2 4 8 16 32 64 9610
100
1000
10000
100000
1000000
10000000
0
5
10
15
20
25
ZHT cost, m1ZHT cost, cc2DynamoDB cost (10k ops/s provision)
Node number
Agre
ggat
ed th
roug
hput
op
s/se
cond
Hou
rly
cost
in U
S do
llar
25
Amortized cost
2 4 8 16 32 64 960.01
0.1
1
10ZHT on m1.medium instance (1/node)
Hou
rly
cost
for
1K o
ps/s
th
roug
hput
in U
S do
llar
26
Applications FusionFS
A distributed file system Metadata: ZHT
IStore A information dispersal storage system Metadata: ZHT
MATRIX A distributed many-Task computing execution
framework ZHT is used to submit tasks and monitor the task
execution status
27
FusionFS result: Concurrent File Creates
1 2 4 8 16 32 64 128 256 5121
10
100
1000
FusionfsGPFS
Number of Nodes
Tim
e Pe
r O
pera
tion
(ms)
28
Istore results
0
100
200
300
400
500
600
8 16 32
Thro
ughp
ut (c
hunk
s/se
c)
Scale (# of Nodes)
1GB100MB10MB1MB100KB10KB
29
MATRIX results
0
1000
2000
3000
4000
5000
6000
1 10 100 1000 10000
Thro
ughp
ut (t
asks
/sec
)
Number of Processors
Falkon (Linux Cluster - C)Falkon (SiCortex)Falkon (BG/P)Falkon (Linux Cluster - Java)MATRIX (BG/P)
30
Future work
Larger scale Active failure detection and
informing Spanning tree communication Network topology-aware routing Fully synchronized replicas and
membership: Paxos protocol More protocols support (UDT, MPI…) Many optimizations
31
Conclusion
ZHT : A distributed Key-Value store light-weighted high performance Scalable Dynamic Fault tolerant Versatile: works from clusters, to clouds,
to supercomputers