Top Banner
Ceph Day LA – July 16, 2015 Deploying Flash Storage For Ceph
33

Deploying flash storage for Ceph without compromising performance

Aug 10, 2015

Download

Technology

Ceph Community
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Deploying flash storage for Ceph without compromising performance

Ceph Day LA – July 16, 2015

Deploying Flash Storage For Ceph

Page 2: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 2 - Mellanox Confidential -

Leading Supplier of End-to-End Interconnect Solutions

Virtual Protocol Interconnect

Storage Front / Back-End Server / Compute Switch / Gateway

56G IB & FCoIB 56G InfiniBand 10/40/56GbE & FCoE 10/40/56GbE

Virtual Protocol Interconnect

Host/Fabric Software ICs Switches/Gateways Adapter Cards Cables/Modules

Comprehensive End-to-End InfiniBand and Ethernet Portfolio

Metro / WAN

Page 3: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 3 - Mellanox Confidential -

Scale-Out Architecture Requires A Fast Network

§ Scale-out grows capacity and performance in parallel § Requires fast network for replication, sharing, and metadata (file)

•  Throughput requires bandwidth •  IOPS requires low latency

§ Proven in HPC, storage appliances, cloud, and now… Ceph

Interconnect Capabilities Determine Scale Out Performance

Page 4: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 4 - Mellanox Confidential -

Solid State Storage Technology Evolution – Lower Latency

Advanced Networking and Protocol Offloads Required to Match Storage Media Performance

0.1

10

1000

HD SSD NVM

Access  Tim

e  (m

icro-­‐Sec)

Storage  Media  Technology

50%

100%

Networked  Storage

Storage Protocol  (SW) Network

Storage Media

Network HW & SW

Hard Drives

NAND Flash

Next Gen NVM

Page 5: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 5 - Mellanox Confidential -

Ceph and Networks

§ High performance networks enable maximum cluster availability •  Clients, OSD, Monitors and Metadata servers communicate over multiple network layers •  Real-time requirements for heartbeat, replication, recovery and re-balancing

§ Cluster (“backend”) network performance dictates cluster’s performance and scalability •  “Network load between Ceph OSD Daemons easily dwarfs the network load between Ceph Clients

and the Ceph Storage Cluster” (Ceph Documentation)

Page 6: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 6 - Mellanox Confidential -

Ceph Deployment Using 10GbE and 40GbE

§ Cluster (Private) Network @ 40/56GbE •  Smooth HA, unblocked heartbeats, efficient data balancing

§ Throughput Clients @ 40/56GbE •  Guaranties line rate for high ingress/egress clients

§  IOPs Clients @ 10GbE or 40/56GbE •  100K+ IOPs/Client @4K blocks

2.5x Higher Throughput , 15% Higher IOPs with 40Gb Ethernet vs. 10GbE! (http://www.mellanox.com/related-docs/whitepapers/WP_Deploying_Ceph_over_High_Performance_Networks.pdf)

Throughput Testing results based on fio benchmark, 8MB block, 20GB file,128 parallel jobs, RBD Kernel Driver with Linux Kernel 3.13.3 RHEL 6.3, Ceph 0.72.2 IOPs Testing results based on fio benchmark, 4KB block, 20GB file,128 parallel jobs, RBD Kernel Driver with Linux Kernel 3.13.3 RHEL 6.3, Ceph 0.72.2

Cluster  Network

Admin  Node

40GbE

Public  Network10GbE/40GBE

Ceph  Nodes(Monitors,  OSDs,  MDS)

Client  Nodes10GbE/40GbE

Page 7: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 7 - Mellanox Confidential -

Ceph Is Accelerated by A Faster Network – Optimized at 56GbE

4,300 4,350

5,475 5,495

-

1,000

2,000

3,000

4,000

5,000

6,000

Ceph fio_rbd 64K random read Ceph fio_rbd 256K random read

40 Gb/s (MB/s) 56 Gb/s (MB/s)

27% More Throughput On Random Reads

Page 8: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 8 - Mellanox Confidential -

Ceph Reference Architectures Using Disk

Page 9: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 9 - Mellanox Confidential -

Optimizing Ceph For Throughput and Price/Throughput

§ Red Hat, Supermicro, Seagate, Mellanox, Intel § Extensive Performance Testing: Disk, Flash, Network, CPU, OS, Ceph § Reference Architecture Published Soon 10GbE Network Setup 40GbE Network Setup

Page 10: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 10 - Mellanox Confidential -

Testing 12 to 72 Disks Per Node, 2x10GbE vs. 1x40GbE

§ Key Test Results •  More disks = more MB/s per server, less/OSD •  More flash is faster (usually) •  All-flash 2 SSDs as fast as many disks

§ 40GbE Advantages •  Up to 2x read throughput per server •  Up to 50% decrease in latency •  Easier than bonding multiple 10GbE links

Page 11: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 11 - Mellanox Confidential -

Cisco UCS Reference Architecture for Ceph

§ Cisco Test Setup •  UCS C3160 servers, Nexus 9396PX switch •  28 or 56 6TB SAS disks; Replication or EC •  4x10GbE per server, bonded

§ Results •  One node read: 3700 MB/s rep, 1100 MB/s EC •  One node write: 860 MB/s rep, 1050 MB/s EC •  3 nodes read: 9,700 MB/s rep, 7,700 MB/s EC •  8 nodes read: 20,000 MB/s rep, 10,000 MB/s EC

Page 12: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 12 - Mellanox Confidential -

Optimizing Ceph for Flash

Page 13: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 13 - Mellanox Confidential -

Ceph Flash Optimization Highlights Compared to Stock Ceph •  Read performance up to 8x better •  Write performance up to 2x better with tuning Optimizations •  All-flash storage for OSDs •  Enhanced parallelism and lock optimization •  Optimization for reads from flash •  Improvements to Ceph messenger

Test Configuration •  InfiniFlash Storage with IFOS 1.0 EAP3 •  Up to 4 RBDs •  2 Ceph OSD nodes, connected to InfiniFlash •  40GbE NICs from Mellanox

SanDisk InfiniFlash

Page 14: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 14 - Mellanox Confidential -

SanDisk InfiniFlash, Maximizing Ceph Random Read IOPS

Random Read IOPs Random Read Latency (ms)

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

25% Read 50% Read 75% Read 100% Read

8KB Random Read, QD=16

Stock Ceph IF-500

0

2

4

6

8

10

12

14

25% Read 50% Read 75% Read 100% Read

8KB Random Read, QD=16

Stock Ceph IF-500

Page 15: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 15 - Mellanox Confidential -

SanDisk Ceph Optimizations for Flash

Setup SanDisk InfiniFlash

Scalable Informatics

Supermicro Mellanox

OSD Servers Dell R720 SI Unison Supermicro Supermicro OSD Nodes 2 2 3 2 Flash 1 InfiniFlash

64x8TB = 512TB 24 SATA SSDs per node

2x PCIe SSDs per node

12x SAS SSDs per node

Cluster Network 40GbE 100GbE 40GbE 56GbE Total Read Throughput

71.6 Gb/s 70 Gb/s 43 Gb/s 44 Gb/s

Page 16: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 16 - Mellanox Confidential -

XioMessenger

Adding RDMA To Ceph

Page 17: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 17 - Mellanox Confidential -

RDMA Enables Efficient Data Movement

§ Hardware Network Acceleration à Higher bandwidth, Lower latency § Highest CPU efficiency à more CPU Power To Run Applications

Efficient Data Movement With RDMA

Higher Bandwidth

Lower Latency

More CPU Power For Applications

Page 18: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 18 - Mellanox Confidential -

RDMA Enables Efficient Data Movement At 100Gb/s

§ Without RDMA •  5.7 GB/s throughput •  20-26% CPU utilization •  4 cores 100% consumed by moving data

§ With Hardware RDMA •  11.1 GB/s throughput at half the latency •  13-14% CPU utilization •  More CPU power for applications, better ROI

x x x x

100GbE With CPU Onload 100 GbE With Network Offload

CPU Onload Penalties •  Half the Throughput •  Twice the Latency •  Higher CPU Consumption

2X Better Bandwidth

Half the Latency

33% Lower CPU

See the demo: https://www.youtube.com/watch?v=u8ZYhUjSUoI

Page 19: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 19 - Mellanox Confidential -

Adding RDMA to Ceph

§ RDMA Beta Included in Hammer •  Mellanox, Red Hat, CohortFS, and Community collaboration •  Full RDMA expected in Infernalis

§ Refactoring of Ceph Messaging Layer •  New RDMA messenger layer called XioMessenger •  New class hierarchy allowing multiple transports (simple one is TCP) •  Async design that leverages Accelio •  Reduced locks; Reduced number of threads

§ XioMessenger built on top of Accelio (RDMA abstraction layer) •  Integrated into all CEPH user space components: daemons and clients •  Both “public network” and “cloud network”

Page 20: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 20 - Mellanox Confidential -

§ Open source! •  https://github.com/accelio/accelio/ && www.accelio.org

§ Faster RDMA integration to application

§ Asynchronous

§ Maximize msg and CPU parallelism •  Enable >10GB/s from single node •  Enable <10usec latency under load

§  In Giant and Hammer •  http://wiki.ceph.com/Planning/Blueprints/Giant/Accelio_RDMA_Messenger

Accelio, High-Performance Reliable Messaging and RPC Library

Page 21: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 21 - Mellanox Confidential -

Ceph 4KB Read IOPS: 40Gb TCP vs. 40Gb RDMA

0

50

100

150

200

250

300

350

400

450

2 OSDs, 4 clients 4 OSDs, 4 clients

Thou

sand

s of

IOPS

40Gb TCP 40Gb RDMA

34 c

ores

in O

SD

24

cor

es in

clie

nt

38 c

ores

in O

SD

24

cor

es in

clie

nt

38 c

ores

in

OS

D

30 c

ores

in c

lient

38 c

ores

in

OS

D

24 c

ores

in c

lient

RDMA TCP

RDMA TCP

Page 22: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 22 - Mellanox Confidential -

Ceph-Powered Solutions

Deployment Examples

Page 23: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 23 - Mellanox Confidential -

Ceph For Large Scale Storage– Fujitsu Eternus CD10000

§ Hyperscale Storage •  4 to 224 nodes •  Up to 56 PB raw capacity

§ Runs Ceph with Enhancements •  3 different storage nodes •  Object, block, and file storage

§ Mellanox InfiniBand Cluster Network •  40Gb InfiniBand cluster network •  10Gb Ethernet front end network

Page 24: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 24 - Mellanox Confidential -

Media & Entertainment Storage – StorageFoundry Nautilus

§ Turnkey Object Storage •  Built on Ceph •  Pre-configured for rapid deployment •  Mellanox 10/40GbE networking

§ High-Capacity Configuration •  6-8TB Helium-filled drives •  Up to 2PB in 18U

§ High-Performance Configuration •  Single client read 2.2 GB/s •  SSD caching + Hard Drives •  Supports Ethernet, IB, FC, FCoE front-end ports

§ More information: www.storagefoundry.net

Page 25: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 25 - Mellanox Confidential -

SanDisk InfiniFlash

§ Flash Storage System •  Announced March 2015 •  512 TB (raw) in one 3U enclosure •  Tested with 40GbE networking

§ High Throughput •  8 SAS ports, up to 7GB/s •  Connect to 2 or 4 OSD nodes •  Up to 1M IOPS with two nodes

§ More information: •  http://bigdataflash.sandisk.com/infiniflash

Page 26: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 26 - Mellanox Confidential -

More Ceph Solutions

§ Cloud – OnyxCCS ElectraStack •  Turnkey IaaS •  Multi-tenant computing system •  5x faster Node/Data restoration •  https://www.onyxccs.com/products/8-series

§ Flextronics CloudLabs •  OpenStack on CloudX design •  2SSD + 20HDD per node •  Mix of 1Gb/40GbE network •  http://www.flextronics.com/

§  ISS Storage Supercore •  Healthcare solution •  82,000 IOPS on 512B reads •  74,000 IOPS on 4KB reads •  1.1GB/s on 256KB reads •  http://www.iss-integration.com/supercore.html

§ Scalable Informatics Unison •  High availability cluster •  60 HDD in 4U •  Tier 1 performance at archive cost •  https://scalableinformatics.com/unison.html

Page 27: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 27 - Mellanox Confidential -

Even More Ceph Solutions

§ Keeper Technology – keeperSAFE •  Ceph appliance •  For US Government •  File Gateway for NFS, SMB, & StorNext •  Mellanox Switches

§ Monash University -- Melbourne, Australia •  3 Ceph Clusters, >6PB total storage •  8, 17 (27), and 37 nodes •  OpenStack Cinder and S3/Swift Object Storage •  Mellanox networking, 10GbE nodes, 56GbE ISLs

Page 28: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 28 - Mellanox Confidential -

Summary

§ Ceph scalability and performance benefit from high performance networks •  Especially with lots of disk

§ Ceph being optimized for flash storage § End-to-end 40/56 Gb/s transport accelerates Ceph today

•  100Gb/s testing has begun! •  Available in various Ceph solutions and appliances

§ RDMA is next to optimize flash performance—beta in Hammer

Page 29: Deploying flash storage for Ceph without compromising performance

Thank You

Page 30: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 30 - Mellanox Confidential -

SanDisk IF-500 topology on a single 512 TB IF-100

Flash Memory Summit 2015 Santa Clara, CA

30

-­‐        IF-­‐100  BW  is  ~8.5GB/s  (with  6Gb  SAS,  12  Gb  is  coming  EOY)  and  ~1.5M  4K  IOPS  -­‐  We  saw  that  Ceph  is  very  resource  hungry,  so,  need  at  least  2  physical  nodes  on  top  of  IF-­‐100  -­‐  We  need  to  connect  all  8  ports  of  an  HBA  to  saturate  IF-­‐100  for  bigger  block  size  

Page 31: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 31 - Mellanox Confidential -

SanDisk Ceph-InfiniFlash Setup Details

Flash Memory Summit 2015 Santa Clara, CA

31

Performance  Config    -­‐  IF-­‐500  2  Node  Cluster  (  32  drives  shared  to  each  OSD  node)  

Node      2  Servers  (Dell  R720)   2x  E5-­‐2680  12C  2.8GHz      4x  16GB  RDIMM,  dual  rank  x4  (64GB)      

1x  Mellanox  X3  Dual  40GbE      1x  LSI  9207  HBA  card  

RBD  Client    4  Servers  (Dell  R620)   1  x  E5-­‐2680  10C  2.8GHz      2  x  16GB  RDIMM,  dual  rank  x4  (32  GB)    1x  Mellanox  X3  

Dual  40GbE      Storage  –  IF-­‐100  with  64  Icechips  in  A2  Config  IF-­‐100   IF-­‐100  is  connected  64  x  1YX2  Icechips  in  A2  topology.   Total  storage  -­‐  64  *  8  tb  =  512tb  

Network  Details  40G  Switch   NA      OS  Details    

OS     Ubuntu  14.04  LTS  64bit   3.13.0-­‐32  LSI  card/  driver   SAS2308(9207)   mpt2sas    

Mellanox  40gbps  nw  card   MT27500  [ConnectX-­‐3]   mlx4_en    -­‐    2.2-­‐1  (Feb  2014)  Cluster  Configura[on    CEPH  Version   sndk-­‐ifos-­‐1.0.0.04   0.86.rc.eap2  

Replicadon  (Default)  2    [Host]    

 Note:  -­‐  Host  level  replicadon.  

Number  of  Pools,  PGs  &  RBDs   pool  =  4    ;PG  =    2048  per  pool     2  RBDs  from  each  pool  

RBD  size   2TB      Number  of  Monitors   1      Number  of  OSD  Nodes   2      Number  of  OSDs  per  Node   32   total  OSDs    =  32  *  2  =    64  

Page 32: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 32 - Mellanox Confidential -

SanDisk: 8K Random - 2 RBD/Client with File System IOPS: 2 LUNs /Client (Total 4 Clients)

0  

50000  

100000  

150000  

200000  

250000  

300000  

1   2   4   8   16   32   1   2   4   8   16   32   1   2   4   8   16   32   1   2   4   8   16   32   1   2   4   8   16   32  

0   25   50   75   100  

Stock  Ceph   IFOS  1.0  

Lat(ms): 2 LUNs/Client (Total 4 Clients)

[Queue Depth] Read Percent

0  

20  

40  

60  

80  

100  

120  

1   2   4   8   16   32   1   2   4   8   16   32   1   2   4   8   16   32   1   2   4   8   16   32   1   2   4   8   16   32  

0   25   50   75   100  

IOPS Latency (ms)

Page 33: Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 33 - Mellanox Confidential -

SanDisk: 64K Random -2 RBD/Client with File System IOPS: 2 LUNs/Client (Total 4 Clients) Lat(ms): 2 LUNs/Client (Total 4 Clients)

[Queue Depth] Read Percent

0  

20000  

40000  

60000  

80000  

100000  

120000  

140000  

160000  

1   2   4   8   16   32   1   2   4   8   16   32   1   2   4   8   16   32   1   2   4   8   16   32   1   2   4   8   16   32  

0   25   50   75   100  

Stock  Ceph  

IFOS  1.0  0  

20  

40  

60  

80  

100  

120  

140  

160  

180  

1   2   4   8   16   32   1   2   4   8   16   32   1   2   4   8   16   32   1   2   4   8   16   32   1   2   4   8   16   32  

0   25   50   75   100  

IOPS Latency (ms)