ZHT

1

Tonglin Li, Xiaobing Zhou, Kevin Brandstatter, Dongfang Zhao, Ke Wang, Zhao Zhang, Ioan RaicuIllinois Institute of Technology, Chicago, U.S.A

ZHTA Fast, Reliable and Scalable Zero-hop Distributed Hash Table

A supercomputer is a device for turning compute-bound problems into I/O-bound problems.

Ken Batcher

3

Big problem: file systems scalability

Parallel file system (GPFS, PVFS, Lustre) Separated computing resource from

storage Centralized metadata management

Distributed file system(GFS, HDFS) Specific-purposed design (MapReduce

etc.) Centralized metadata management

4

The bottleneck of file systemsMetadata

Concurrent file creates

1

10

100

1,000

10,000

100,000

1 4 16 64 256 1024 4096 16384

Tim

e pe

r Ope

ratio

n (m

s)

Scale (# of Cores)

File Create (GPFS Many Dir)File Create (GPFS One Dir)

5

Proposed work

A distributed hash table (DHT) for HEC

As building block for high performance distributed systems

Performance Latency Throughput

Scalability Reliability

6

Related work: Distributed Hash Tables Many DHTs: Chord, Kademlia, Pastry,

Cassandra, C-MPI, Memcached, Dynamo ...

Why another?Name Impl. Routin

g TimePersiste

nceDynamic members

hip

Append Operati

onCassandra Java Log(N) Yes Yes NoC-MPI C Log(N) No No No

Dynamo Java 0 to Log(N) Yes Yes No

Memcached C 0 No No NoZHT C++ 0 to 2 Yes Yes Yes

7

Zero-hop hash mappingNode

1 Node2

...

NodenNode

n-1

Client 1 … n

hash

Key j

Value jReplica

1

hash

Key k

Value jReplica

2

Value jReplica

3

Value kReplica

1 Value kReplica

2

Value kReplica

3

8

2-layer hashing

9

Architecture and terms Name space:

264

Physical node Manager ZHT Instance Partition: n

(fixed) n = max(k)

Instance

Manager

Update

Response to request

Partition

Instance

Partition

Responseto request

Broadcast

Physical node

Membership table

UUID(ZHT)KeyIPPortCapacityworkload

10

How many partition per node can we do?

1 10 100 10000.6

0.620.640.660.68

0.70.720.740.760.78

Average latency

Number of partitions per instance

Late

ncy

(ms)

11

Membership management

Static: Memcached, ZHT Dynamic

Logarithmic routing: most of DHTs Constant routing: ZHT

12

Membership management Update membership

Incremental broadcasting Remap k-v pairs

Traditional DHTs: rehash all influenced pairs

ZHT: Moving whole partition▪ HEC has fast local network!

13

Consistency

Updating membership tables Planed nodes join and leave: strong

consistency Nodes fail: eventual consistency

Updating replicas Configurable Strong consistency: consistent, reliable Eventual consistency: fast, availability

14

Persistence: NoVoHT

NoVoHT persistent in-memory hash map Append operation Live-migration

1 million 10 million 100 million024

68

1012

14161820 NoVoHT

NoVoHT (No persistence)KyotoCabinetBerkeleyDBunordered_map

Scale( number of key/value pairs)

Late

ncy

(mic

rose

cond

s)

15

Failure handling

Insert and append Send it to next replica Mark this record as primary copy

Lookup Get from next available replica

Remove Mark record on all replicas

16

Evaluation: test beds

IBM Blue Gene/P supercomputer Up to 8192 nodes 32768 instance deployed

Commodity Cluster Up to 64 node

Amazon EC2 M1.medium and Cc2.8xlarge 96 VMs, 768 ZHT instances

deployed

17

Latency on BG/P

0

0.5

1

1.5

2

2.5 TCP without Connection CachingTCP connection cachigUDPMemcached

Number of Nodes

Late

ncy

(ms)

18

Latency distribution

SCALES 75% 90% 95% 99%64 713 853 961 1259

256 755 933 1097 18481024 820 1053 1289 3105

19

Throughput on BG/P

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921,000

10,000

100,000

1,000,000

10,000,000TCP: no connection cachingZHT: TCP connection cachingUDP non-blockingMemcached

Scale (# of Nodes)

Thro

ughp

ut (o

ps/s

)

20

Aggregated throughput on BG/P

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81920

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

16,000,000

18,000,000

1 instances/node

2 instances/node

4 instances/node

8 instances/node

Number of Nodes

Thro

ughp

ut (o

ps/s

)

21

Latency on commodity cluster

1 2 4 8 16 32 640

0.5

1

1.5

2

2.5

3

ZHTCassandraMemcached

Scale (# of nodes)

Late

ncy

(ms)

22

ZHT on cloud: latency

1 2 4 8 16 32 64 960

2000

4000

6000

8000

10000

12000

14000

ZHT on m1.medium instance (1/node)ZHT on cc2.8xlarge instance (8/node)DynamoDB

Node number

Aver

age

late

ncy

in m

icro

se

cond

s

ZHT on cloud: latency distribution

SCALES 75% 90% 95% 99% AVG THROUGHP

UT8 11942 13794 20491 35358 12169 83.3932 10081 11324 12448 34173 9515 3363.11

128 10735 12128 16091 37009 11104 11527512 9942 13664 30960 38077 28488 ERROR

SCALES 75% 90% 95% 99% AVG THROUGHP

UT8 186 199 214 260 172 46421

32 509 603 681 1114 426 75080128 588 717 844 2071 542 236065512 574 708 865 3568 608 841040

ZHT on cc2.8xlarge instance

8 s-c pair/instance

DynamoDB: 8 clients/instance

DynamoDB readDynamoDB writeZHT 4 ~ 64 nodes

0.99

0.9

24

ZHT on cloud: throughput

1 2 4 8 16 32 64 9610

100

1000

10000

100000

1000000

10000000

0

5

10

15

20

25

ZHT cost, m1ZHT cost, cc2DynamoDB cost (10k ops/s provision)

Node number

Agre

ggat

ed th

roug

hput

op

s/se

cond

Hou

rly

cost

in U

S do

llar

25

Amortized cost

2 4 8 16 32 64 960.01

0.1

1

10ZHT on m1.medium instance (1/node)

Hou

rly

cost

for

1K o

ps/s

th

roug

hput

in U

S do

llar

26

Applications FusionFS

A distributed file system Metadata: ZHT

IStore A information dispersal storage system Metadata: ZHT

MATRIX A distributed many-Task computing execution

framework ZHT is used to submit tasks and monitor the task

execution status

27

FusionFS result: Concurrent File Creates

1 2 4 8 16 32 64 128 256 5121

10

100

1000

FusionfsGPFS

Number of Nodes

Tim

e Pe

r O

pera

tion

(ms)

28

Istore results

0

100

200

300

400

500

600

8 16 32

Thro

ughp

ut (c

hunk

s/se

c)

Scale (# of Nodes)

1GB100MB10MB1MB100KB10KB

29

MATRIX results

0

1000

2000

3000

4000

5000

6000

1 10 100 1000 10000

Thro

ughp

ut (t

asks

/sec

)

Number of Processors

Falkon (Linux Cluster - C)Falkon (SiCortex)Falkon (BG/P)Falkon (Linux Cluster - Java)MATRIX (BG/P)

30

Future work

Larger scale Active failure detection and

informing Spanning tree communication Network topology-aware routing Fully synchronized replicas and

membership: Paxos protocol More protocols support (UDT, MPI…) Many optimizations

31

Conclusion

ZHT ： A distributed Key-Value store light-weighted high performance Scalable Dynamic Fault tolerant Versatile: works from clusters, to clouds,

to supercomputers

32

Questions?Tonglin Li

[email protected]://datasys.cs.iit.edu/projects/ZHT/

mailto:[email protected]

http://datasys.cs.iit.edu/projects/ZHT/

ZHT

Documents

distributed hash table1

zht matrixa

hash mapping

zht instances

zht tonglin li

dhtsconstant routing

zhtdynamiclogarithmic

iobound problems