Top Banner
Tonglin Li, Xiaobing Zhou, Kevin Brandstatter, Dongfang Zhao, Ke Wang, Zhao Zhang, Ioan Raicu Illinois Institute of Technology, Chicago, U.S.A ZHT 1 A Fast, Reliable and Scalable Zero-hop Distributed Hash Table
32

ZHT

Feb 23, 2016

Download

Documents

Mairi

ZHT. A Fast, Reliable and Scalable Zero-hop Distributed Hash Table. Tonglin Li, Xiaobing Zhou, Kevin Brandstatter , Dongfang Zhao, Ke Wang, Zhao Zhang, Ioan Raicu Illinois Institute of Technology, Chicago, U.S.A. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ZHT

1

Tonglin Li, Xiaobing Zhou, Kevin Brandstatter, Dongfang Zhao, Ke Wang, Zhao Zhang, Ioan RaicuIllinois Institute of Technology, Chicago, U.S.A

ZHTA Fast, Reliable and Scalable Zero-hop Distributed Hash Table

Page 2: ZHT

A supercomputer is a device for turning compute-bound problems into I/O-bound problems.

Ken Batcher

Page 3: ZHT

3

Big problem: file systems scalability

Parallel file system (GPFS, PVFS, Lustre) Separated computing resource from

storage Centralized metadata management

Distributed file system(GFS, HDFS) Specific-purposed design (MapReduce

etc.) Centralized metadata management

Page 4: ZHT

4

The bottleneck of file systemsMetadata

Concurrent file creates

1

10

100

1,000

10,000

100,000

1 4 16 64 256 1024 4096 16384

Tim

e pe

r Ope

ratio

n (m

s)

Scale (# of Cores)

File Create (GPFS Many Dir)File Create (GPFS One Dir)

Page 5: ZHT

5

Proposed work

A distributed hash table (DHT) for HEC

As building block for high performance distributed systems

Performance Latency Throughput

Scalability Reliability

Page 6: ZHT

6

Related work: Distributed Hash Tables Many DHTs: Chord, Kademlia, Pastry,

Cassandra, C-MPI, Memcached, Dynamo ...

Why another?Name Impl. Routin

g TimePersiste

nceDynamic members

hip

Append Operati

onCassandra Java Log(N) Yes Yes NoC-MPI C Log(N) No No No

Dynamo Java 0 to Log(N) Yes Yes No

Memcached C 0 No No NoZHT C++ 0 to 2 Yes Yes Yes

Page 7: ZHT

7

Zero-hop hash mappingNode

1 Node2

...

NodenNode

n-1

Client 1 … n

hash

Key j

Value jReplica

1

hash

Key k

Value jReplica

2

Value jReplica

3

Value kReplica

1 Value kReplica

2

Value kReplica

3

Page 8: ZHT

8

2-layer hashing

Page 9: ZHT

9

Architecture and terms Name space:

264

Physical node Manager ZHT Instance Partition: n

(fixed) n = max(k)

Instance

Manager

Update

Response to request

Partition

Instance

Partition

Responseto request

Broadcast

Physical node

Membership table

UUID(ZHT)KeyIPPortCapacityworkload

Page 10: ZHT

10

How many partition per node can we do?

1 10 100 10000.6

0.620.640.660.68

0.70.720.740.760.78

Average latency

Number of partitions per instance

Late

ncy

(ms)

Page 11: ZHT

11

Membership management

Static: Memcached, ZHT Dynamic

Logarithmic routing: most of DHTs Constant routing: ZHT

Page 12: ZHT

12

Membership management Update membership

Incremental broadcasting Remap k-v pairs

Traditional DHTs: rehash all influenced pairs

ZHT: Moving whole partition▪ HEC has fast local network!

Page 13: ZHT

13

Consistency

Updating membership tables Planed nodes join and leave: strong

consistency Nodes fail: eventual consistency

Updating replicas Configurable Strong consistency: consistent, reliable Eventual consistency: fast, availability

Page 14: ZHT

14

Persistence: NoVoHT

NoVoHT  persistent in-memory hash map Append operation Live-migration

1 million 10 million 100 million024

68

1012

14161820 NoVoHT

NoVoHT (No persistence)KyotoCabinetBerkeleyDBunordered_map

Scale( number of key/value pairs)

Late

ncy

(mic

rose

cond

s)

Page 15: ZHT

15

Failure handling

Insert and append Send it to next replica Mark this record as primary copy

Lookup Get from next available replica

Remove Mark record on all replicas

Page 16: ZHT

16

Evaluation: test beds

IBM Blue Gene/P supercomputer Up to 8192 nodes 32768 instance deployed

Commodity Cluster Up to 64 node

Amazon EC2 M1.medium and Cc2.8xlarge 96 VMs, 768 ZHT instances

deployed

Page 17: ZHT

17

Latency on BG/P

0

0.5

1

1.5

2

2.5 TCP without Connection CachingTCP connection cachigUDPMemcached

Number of Nodes

Late

ncy

(ms)

Page 18: ZHT

18

Latency distribution

SCALES 75% 90% 95% 99%64 713 853 961 1259

256 755 933 1097 18481024 820 1053 1289 3105

Page 19: ZHT

19

Throughput on BG/P

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921,000

10,000

100,000

1,000,000

10,000,000TCP: no connection cachingZHT: TCP connection cachingUDP non-blockingMemcached

Scale (# of Nodes)

Thro

ughp

ut (o

ps/s

)

Page 20: ZHT

20

Aggregated throughput on BG/P

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81920

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

16,000,000

18,000,000

1 instances/node

2 instances/node

4 instances/node

8 instances/node

Number of Nodes

Thro

ughp

ut (o

ps/s

)

Page 21: ZHT

21

Latency on commodity cluster

1 2 4 8 16 32 640

0.5

1

1.5

2

2.5

3

ZHTCassandraMemcached

Scale (# of nodes)

Late

ncy

(ms)

Page 22: ZHT

22

ZHT on cloud: latency

1 2 4 8 16 32 64 960

2000

4000

6000

8000

10000

12000

14000

ZHT on m1.medium instance (1/node)ZHT on cc2.8xlarge instance (8/node)DynamoDB

Node number

Aver

age

late

ncy

in m

icro

se

cond

s

Page 23: ZHT

ZHT on cloud: latency distribution

SCALES 75% 90% 95% 99% AVG THROUGHP

UT8 11942 13794 20491 35358 12169 83.3932 10081 11324 12448 34173 9515 3363.11

128 10735 12128 16091 37009 11104 11527512 9942 13664 30960 38077 28488 ERROR

SCALES 75% 90% 95% 99% AVG THROUGHP

UT8 186 199 214 260 172 46421

32 509 603 681 1114 426 75080128 588 717 844 2071 542 236065512 574 708 865 3568 608 841040

ZHT on cc2.8xlarge instance

8 s-c pair/instance

DynamoDB: 8 clients/instance

DynamoDB readDynamoDB writeZHT 4 ~ 64 nodes

0.99

0.9

Page 24: ZHT

24

ZHT on cloud: throughput

1 2 4 8 16 32 64 9610

100

1000

10000

100000

1000000

10000000

0

5

10

15

20

25

ZHT cost, m1ZHT cost, cc2DynamoDB cost (10k ops/s provision)

Node number

Agre

ggat

ed th

roug

hput

op

s/se

cond

Hou

rly

cost

in U

S do

llar

Page 25: ZHT

25

Amortized cost

2 4 8 16 32 64 960.01

0.1

1

10ZHT on m1.medium instance (1/node)

Hou

rly

cost

for

1K o

ps/s

th

roug

hput

in U

S do

llar

Page 26: ZHT

26

Applications FusionFS

A distributed file system Metadata: ZHT

IStore A information dispersal storage system Metadata: ZHT

MATRIX A distributed many-Task computing execution

framework ZHT is used to submit tasks and monitor the task

execution status

Page 27: ZHT

27

FusionFS result: Concurrent File Creates

1 2 4 8 16 32 64 128 256 5121

10

100

1000

FusionfsGPFS

Number of Nodes

Tim

e Pe

r O

pera

tion

(ms)

Page 28: ZHT

28

Istore results

0

100

200

300

400

500

600

8 16 32

Thro

ughp

ut (c

hunk

s/se

c)

Scale (# of Nodes)

1GB100MB10MB1MB100KB10KB

Page 29: ZHT

29

MATRIX results

0

1000

2000

3000

4000

5000

6000

1 10 100 1000 10000

Thro

ughp

ut (t

asks

/sec

)

Number of Processors

Falkon (Linux Cluster - C)Falkon (SiCortex)Falkon (BG/P)Falkon (Linux Cluster - Java)MATRIX (BG/P)

Page 30: ZHT

30

Future work

Larger scale Active failure detection and

informing Spanning tree communication Network topology-aware routing Fully synchronized replicas and

membership: Paxos protocol More protocols support (UDT, MPI…) Many optimizations

Page 31: ZHT

31

Conclusion

ZHT : A distributed Key-Value store light-weighted high performance Scalable Dynamic Fault tolerant Versatile: works from clusters, to clouds,

to supercomputers

Page 32: ZHT

32

Questions?Tonglin Li

[email protected]://datasys.cs.iit.edu/projects/ZHT/