Top Banner
High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel)
20

High Performance Ceph Jun Park (Adobe) and Dan …schd.ws/.../4b/Ceph_Deep_Dive-s1.pdf · High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel) ... Next Generation Ceph: BlueStore

Jul 19, 2018

Download

Documents

trinhhanh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: High Performance Ceph Jun Park (Adobe) and Dan …schd.ws/.../4b/Ceph_Deep_Dive-s1.pdf · High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel) ... Next Generation Ceph: BlueStore

High Performance CephJun Park (Adobe) and Dan Ferber (Intel)

Page 2: High Performance Ceph Jun Park (Adobe) and Dan …schd.ws/.../4b/Ceph_Deep_Dive-s1.pdf · High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel) ... Next Generation Ceph: BlueStore

Caveats

§ Relative meaning of “High Performance”

§ Still on a long journey; Evolving

§ Not large scale; unlike CERN

§ Possibly opinionated in some cases

2

Page 3: High Performance Ceph Jun Park (Adobe) and Dan …schd.ws/.../4b/Ceph_Deep_Dive-s1.pdf · High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel) ... Next Generation Ceph: BlueStore

OpenStack Core Services

3

CEPH

Neutron

Compute

Murano

App Catalog

Heat

Orchestration

Page 4: High Performance Ceph Jun Park (Adobe) and Dan …schd.ws/.../4b/Ceph_Deep_Dive-s1.pdf · High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel) ... Next Generation Ceph: BlueStore

How To Evaluate Storage?

4

CapacityIOPS

(Bandwidth)

Typically, hit IOPS as a first bottleneck with HDDs

* IOPS: I/O operations per second, regardless of block size

Durability

Availability

Page 5: High Performance Ceph Jun Park (Adobe) and Dan …schd.ws/.../4b/Ceph_Deep_Dive-s1.pdf · High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel) ... Next Generation Ceph: BlueStore

Path For IOPS

5

Physical StoreHDD or SSD Network Compute Host

VM

IOPS(almost bandwidth :)

Bandwidth (Throughput) = IOPS X Block Size

Page 6: High Performance Ceph Jun Park (Adobe) and Dan …schd.ws/.../4b/Ceph_Deep_Dive-s1.pdf · High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel) ... Next Generation Ceph: BlueStore

Relationship Between IOPS and Bandwidth

6

Bandwidth

Block Size Block Size

IOPS

4KB 8KB 16KB 4096KB

Block Size

4KB 8KB 16KB 4096KB

Bandwidth (Throughput) = IOPS X Block Size

In some cases, flat!Then , you got double of bandwidth

Page 7: High Performance Ceph Jun Park (Adobe) and Dan …schd.ws/.../4b/Ceph_Deep_Dive-s1.pdf · High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel) ... Next Generation Ceph: BlueStore

Ceph With Replication

§ Distribution Algorithm: Straw -> Tree

7

Rack1 Rack2 Rack3 Rack4

3 Replicas

Data node 1

Data node 2

Data node 3

16 osds x 4 racks x 4 data nodes x 2TB / 3 replicas = 170 TB (Effective Disk Capacity for users)

Page 8: High Performance Ceph Jun Park (Adobe) and Dan …schd.ws/.../4b/Ceph_Deep_Dive-s1.pdf · High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel) ... Next Generation Ceph: BlueStore

Ceph Architecture

8

CephData

Nodes

VLAN100: Ceph Public

VLAN200: Ceph Cluster

Compute

CephMonitorNodes

2 x 10G 2 x 10G 2 x 10G

Page 9: High Performance Ceph Jun Park (Adobe) and Dan …schd.ws/.../4b/Ceph_Deep_Dive-s1.pdf · High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel) ... Next Generation Ceph: BlueStore

Ceph Clients

9

Page 10: High Performance Ceph Jun Park (Adobe) and Dan …schd.ws/.../4b/Ceph_Deep_Dive-s1.pdf · High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel) ... Next Generation Ceph: BlueStore

Write Operation With Journaling

10

SSDs

Drives, e.g. SAS 2TB drives

<Ceph1>:~# lscpu | egrep 'Thread|Core|Socket|^CPU\(’CPU(s): 48 Thread(s) per core: 2Core(s) per socket: 12Socket(s): 2

Page 11: High Performance Ceph Jun Park (Adobe) and Dan …schd.ws/.../4b/Ceph_Deep_Dive-s1.pdf · High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel) ... Next Generation Ceph: BlueStore

Consistent Hashing In Ceph

§ Advantages§ No need to store metadata explicitly§ Fast

§ Disadvantages§ Overhead of rebalance§ Operational difficulties in dealing with edge cases

§ E.g., Swift, Cassandra, Amazon Dynamo, and so on.

11

Page 12: High Performance Ceph Jun Park (Adobe) and Dan …schd.ws/.../4b/Ceph_Deep_Dive-s1.pdf · High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel) ... Next Generation Ceph: BlueStore

Network Bandwidth Impact (rados bench write)

12

0 500 1000 1500 2000

10 Gbps Interface

20 Gbps Interface

1165

1953

Bandwidth (MB/s) For Write

Same as in our LABwith 20G(due to smaller # of data nodes)

Page 13: High Performance Ceph Jun Park (Adobe) and Dan …schd.ws/.../4b/Ceph_Deep_Dive-s1.pdf · High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel) ... Next Generation Ceph: BlueStore

Different I/O Patterns With 128K Block Size

13

0 1000 2000 3000 4000 5000 6000 7000 8000

RadonWrite

SeqWrite

SeqRead

W25R75

604

364

978

233

4828

2913

7827

1864

Block Size: 128K

IOPS Bandwidth (MB/s)

ß For write

Page 14: High Performance Ceph Jun Park (Adobe) and Dan …schd.ws/.../4b/Ceph_Deep_Dive-s1.pdf · High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel) ... Next Generation Ceph: BlueStore

Random Writes of 3 VMS On The Same Compute

14

0

50

100

150

200

250

300

350

VM1 VM2 VM3

318 313 325

Bandwidth (MB/s)

• Block Size: 64KB• System Wide Max Performance with other traffic: Max 40,000 IOPS,

1367 MB/s Write

0

500

1000

1500

2000

2500

3000

VM1 VM2 VM3

2547 2505 2600

IOPS

Page 15: High Performance Ceph Jun Park (Adobe) and Dan …schd.ws/.../4b/Ceph_Deep_Dive-s1.pdf · High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel) ... Next Generation Ceph: BlueStore

Pain Points In Production

15

Computes vs. Data nodesRatio?

Upgrading? E.g., DecaPodSeparated from OpenStack

Operational Overheads?E.g., adding more data nodes

-> creating internal trafficDeep Scrubbing

Pin-pointBottlenecks?

QoS?

Page 16: High Performance Ceph Jun Park (Adobe) and Dan …schd.ws/.../4b/Ceph_Deep_Dive-s1.pdf · High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel) ... Next Generation Ceph: BlueStore

Pleasure Points§ Generic Architecture§ With high tech such as NVMe SSDs, immediately

improved

§ Various use cases

§ Good community§ Open & stable

§ Work well with OpenStack

§ Truly Scale Out§ High Performance with low cost

16

Page 17: High Performance Ceph Jun Park (Adobe) and Dan …schd.ws/.../4b/Ceph_Deep_Dive-s1.pdf · High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel) ... Next Generation Ceph: BlueStore

Future Of Ceph

17

Page 18: High Performance Ceph Jun Park (Adobe) and Dan …schd.ws/.../4b/Ceph_Deep_Dive-s1.pdf · High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel) ... Next Generation Ceph: BlueStore

Next Generation Ceph: BlueStore

18

SSDs

Drives, e.g. SAS 2TB drives

BlueStore: Optimal Key-Value Store,

No POSIX, etc.

Page 19: High Performance Ceph Jun Park (Adobe) and Dan …schd.ws/.../4b/Ceph_Deep_Dive-s1.pdf · High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel) ... Next Generation Ceph: BlueStore

Next Steps?

§ NVMe (Non-volatile memory Express) SSD

§ Ceph Cashing Tier

§ RDMA (Remote Direct Memory Access)

§ BlueStore

19

Page 20: High Performance Ceph Jun Park (Adobe) and Dan …schd.ws/.../4b/Ceph_Deep_Dive-s1.pdf · High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel) ... Next Generation Ceph: BlueStore

ONS ‘15, Amin Vahdat at Google

20