Ceph Day LA – July 16, 2015
Deploying Flash Storage For Ceph
© 2015 Mellanox Technologies 2 - Mellanox Confidential -
Leading Supplier of End-to-End Interconnect Solutions
Virtual Protocol Interconnect
Storage Front / Back-End Server / Compute Switch / Gateway
56G IB & FCoIB 56G InfiniBand 10/40/56GbE & FCoE 10/40/56GbE
Virtual Protocol Interconnect
Host/Fabric Software ICs Switches/Gateways Adapter Cards Cables/Modules
Comprehensive End-to-End InfiniBand and Ethernet Portfolio
Metro / WAN
© 2015 Mellanox Technologies 3 - Mellanox Confidential -
Scale-Out Architecture Requires A Fast Network
§ Scale-out grows capacity and performance in parallel § Requires fast network for replication, sharing, and metadata (file)
• Throughput requires bandwidth • IOPS requires low latency
§ Proven in HPC, storage appliances, cloud, and now… Ceph
Interconnect Capabilities Determine Scale Out Performance
© 2015 Mellanox Technologies 4 - Mellanox Confidential -
Solid State Storage Technology Evolution – Lower Latency
Advanced Networking and Protocol Offloads Required to Match Storage Media Performance
0.1
10
1000
HD SSD NVM
Access Tim
e (m
icro-‐Sec)
Storage Media Technology
50%
100%
Networked Storage
Storage Protocol (SW) Network
Storage Media
Network HW & SW
Hard Drives
NAND Flash
Next Gen NVM
© 2015 Mellanox Technologies 5 - Mellanox Confidential -
Ceph and Networks
§ High performance networks enable maximum cluster availability • Clients, OSD, Monitors and Metadata servers communicate over multiple network layers • Real-time requirements for heartbeat, replication, recovery and re-balancing
§ Cluster (“backend”) network performance dictates cluster’s performance and scalability • “Network load between Ceph OSD Daemons easily dwarfs the network load between Ceph Clients
and the Ceph Storage Cluster” (Ceph Documentation)
© 2015 Mellanox Technologies 6 - Mellanox Confidential -
Ceph Deployment Using 10GbE and 40GbE
§ Cluster (Private) Network @ 40/56GbE • Smooth HA, unblocked heartbeats, efficient data balancing
§ Throughput Clients @ 40/56GbE • Guaranties line rate for high ingress/egress clients
§ IOPs Clients @ 10GbE or 40/56GbE • 100K+ IOPs/Client @4K blocks
2.5x Higher Throughput , 15% Higher IOPs with 40Gb Ethernet vs. 10GbE! (http://www.mellanox.com/related-docs/whitepapers/WP_Deploying_Ceph_over_High_Performance_Networks.pdf)
Throughput Testing results based on fio benchmark, 8MB block, 20GB file,128 parallel jobs, RBD Kernel Driver with Linux Kernel 3.13.3 RHEL 6.3, Ceph 0.72.2 IOPs Testing results based on fio benchmark, 4KB block, 20GB file,128 parallel jobs, RBD Kernel Driver with Linux Kernel 3.13.3 RHEL 6.3, Ceph 0.72.2
Cluster Network
Admin Node
40GbE
Public Network10GbE/40GBE
Ceph Nodes(Monitors, OSDs, MDS)
Client Nodes10GbE/40GbE
© 2015 Mellanox Technologies 7 - Mellanox Confidential -
Ceph Is Accelerated by A Faster Network – Optimized at 56GbE
4,300 4,350
5,475 5,495
-
1,000
2,000
3,000
4,000
5,000
6,000
Ceph fio_rbd 64K random read Ceph fio_rbd 256K random read
40 Gb/s (MB/s) 56 Gb/s (MB/s)
27% More Throughput On Random Reads
© 2015 Mellanox Technologies 8 - Mellanox Confidential -
Ceph Reference Architectures Using Disk
© 2015 Mellanox Technologies 9 - Mellanox Confidential -
Optimizing Ceph For Throughput and Price/Throughput
§ Red Hat, Supermicro, Seagate, Mellanox, Intel § Extensive Performance Testing: Disk, Flash, Network, CPU, OS, Ceph § Reference Architecture Published Soon 10GbE Network Setup 40GbE Network Setup
© 2015 Mellanox Technologies 10 - Mellanox Confidential -
Testing 12 to 72 Disks Per Node, 2x10GbE vs. 1x40GbE
§ Key Test Results • More disks = more MB/s per server, less/OSD • More flash is faster (usually) • All-flash 2 SSDs as fast as many disks
§ 40GbE Advantages • Up to 2x read throughput per server • Up to 50% decrease in latency • Easier than bonding multiple 10GbE links
© 2015 Mellanox Technologies 11 - Mellanox Confidential -
Cisco UCS Reference Architecture for Ceph
§ Cisco Test Setup • UCS C3160 servers, Nexus 9396PX switch • 28 or 56 6TB SAS disks; Replication or EC • 4x10GbE per server, bonded
§ Results • One node read: 3700 MB/s rep, 1100 MB/s EC • One node write: 860 MB/s rep, 1050 MB/s EC • 3 nodes read: 9,700 MB/s rep, 7,700 MB/s EC • 8 nodes read: 20,000 MB/s rep, 10,000 MB/s EC
© 2015 Mellanox Technologies 12 - Mellanox Confidential -
Optimizing Ceph for Flash
© 2015 Mellanox Technologies 13 - Mellanox Confidential -
Ceph Flash Optimization Highlights Compared to Stock Ceph • Read performance up to 8x better • Write performance up to 2x better with tuning Optimizations • All-flash storage for OSDs • Enhanced parallelism and lock optimization • Optimization for reads from flash • Improvements to Ceph messenger
Test Configuration • InfiniFlash Storage with IFOS 1.0 EAP3 • Up to 4 RBDs • 2 Ceph OSD nodes, connected to InfiniFlash • 40GbE NICs from Mellanox
SanDisk InfiniFlash
© 2015 Mellanox Technologies 14 - Mellanox Confidential -
SanDisk InfiniFlash, Maximizing Ceph Random Read IOPS
Random Read IOPs Random Read Latency (ms)
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
25% Read 50% Read 75% Read 100% Read
8KB Random Read, QD=16
Stock Ceph IF-500
0
2
4
6
8
10
12
14
25% Read 50% Read 75% Read 100% Read
8KB Random Read, QD=16
Stock Ceph IF-500
© 2015 Mellanox Technologies 15 - Mellanox Confidential -
SanDisk Ceph Optimizations for Flash
Setup SanDisk InfiniFlash
Scalable Informatics
Supermicro Mellanox
OSD Servers Dell R720 SI Unison Supermicro Supermicro OSD Nodes 2 2 3 2 Flash 1 InfiniFlash
64x8TB = 512TB 24 SATA SSDs per node
2x PCIe SSDs per node
12x SAS SSDs per node
Cluster Network 40GbE 100GbE 40GbE 56GbE Total Read Throughput
71.6 Gb/s 70 Gb/s 43 Gb/s 44 Gb/s
© 2015 Mellanox Technologies 16 - Mellanox Confidential -
XioMessenger
Adding RDMA To Ceph
© 2015 Mellanox Technologies 17 - Mellanox Confidential -
RDMA Enables Efficient Data Movement
§ Hardware Network Acceleration à Higher bandwidth, Lower latency § Highest CPU efficiency à more CPU Power To Run Applications
Efficient Data Movement With RDMA
Higher Bandwidth
Lower Latency
More CPU Power For Applications
© 2015 Mellanox Technologies 18 - Mellanox Confidential -
RDMA Enables Efficient Data Movement At 100Gb/s
§ Without RDMA • 5.7 GB/s throughput • 20-26% CPU utilization • 4 cores 100% consumed by moving data
§ With Hardware RDMA • 11.1 GB/s throughput at half the latency • 13-14% CPU utilization • More CPU power for applications, better ROI
x x x x
100GbE With CPU Onload 100 GbE With Network Offload
CPU Onload Penalties • Half the Throughput • Twice the Latency • Higher CPU Consumption
2X Better Bandwidth
Half the Latency
33% Lower CPU
See the demo: https://www.youtube.com/watch?v=u8ZYhUjSUoI
© 2015 Mellanox Technologies 19 - Mellanox Confidential -
Adding RDMA to Ceph
§ RDMA Beta Included in Hammer • Mellanox, Red Hat, CohortFS, and Community collaboration • Full RDMA expected in Infernalis
§ Refactoring of Ceph Messaging Layer • New RDMA messenger layer called XioMessenger • New class hierarchy allowing multiple transports (simple one is TCP) • Async design that leverages Accelio • Reduced locks; Reduced number of threads
§ XioMessenger built on top of Accelio (RDMA abstraction layer) • Integrated into all CEPH user space components: daemons and clients • Both “public network” and “cloud network”
© 2015 Mellanox Technologies 20 - Mellanox Confidential -
§ Open source! • https://github.com/accelio/accelio/ && www.accelio.org
§ Faster RDMA integration to application
§ Asynchronous
§ Maximize msg and CPU parallelism • Enable >10GB/s from single node • Enable <10usec latency under load
§ In Giant and Hammer • http://wiki.ceph.com/Planning/Blueprints/Giant/Accelio_RDMA_Messenger
Accelio, High-Performance Reliable Messaging and RPC Library
© 2015 Mellanox Technologies 21 - Mellanox Confidential -
Ceph 4KB Read IOPS: 40Gb TCP vs. 40Gb RDMA
0
50
100
150
200
250
300
350
400
450
2 OSDs, 4 clients 4 OSDs, 4 clients
Thou
sand
s of
IOPS
40Gb TCP 40Gb RDMA
34 c
ores
in O
SD
24
cor
es in
clie
nt
38 c
ores
in O
SD
24
cor
es in
clie
nt
38 c
ores
in
OS
D
30 c
ores
in c
lient
38 c
ores
in
OS
D
24 c
ores
in c
lient
RDMA TCP
RDMA TCP
© 2015 Mellanox Technologies 22 - Mellanox Confidential -
Ceph-Powered Solutions
Deployment Examples
© 2015 Mellanox Technologies 23 - Mellanox Confidential -
Ceph For Large Scale Storage– Fujitsu Eternus CD10000
§ Hyperscale Storage • 4 to 224 nodes • Up to 56 PB raw capacity
§ Runs Ceph with Enhancements • 3 different storage nodes • Object, block, and file storage
§ Mellanox InfiniBand Cluster Network • 40Gb InfiniBand cluster network • 10Gb Ethernet front end network
© 2015 Mellanox Technologies 24 - Mellanox Confidential -
Media & Entertainment Storage – StorageFoundry Nautilus
§ Turnkey Object Storage • Built on Ceph • Pre-configured for rapid deployment • Mellanox 10/40GbE networking
§ High-Capacity Configuration • 6-8TB Helium-filled drives • Up to 2PB in 18U
§ High-Performance Configuration • Single client read 2.2 GB/s • SSD caching + Hard Drives • Supports Ethernet, IB, FC, FCoE front-end ports
§ More information: www.storagefoundry.net
© 2015 Mellanox Technologies 25 - Mellanox Confidential -
SanDisk InfiniFlash
§ Flash Storage System • Announced March 2015 • 512 TB (raw) in one 3U enclosure • Tested with 40GbE networking
§ High Throughput • 8 SAS ports, up to 7GB/s • Connect to 2 or 4 OSD nodes • Up to 1M IOPS with two nodes
§ More information: • http://bigdataflash.sandisk.com/infiniflash
© 2015 Mellanox Technologies 26 - Mellanox Confidential -
More Ceph Solutions
§ Cloud – OnyxCCS ElectraStack • Turnkey IaaS • Multi-tenant computing system • 5x faster Node/Data restoration • https://www.onyxccs.com/products/8-series
§ Flextronics CloudLabs • OpenStack on CloudX design • 2SSD + 20HDD per node • Mix of 1Gb/40GbE network • http://www.flextronics.com/
§ ISS Storage Supercore • Healthcare solution • 82,000 IOPS on 512B reads • 74,000 IOPS on 4KB reads • 1.1GB/s on 256KB reads • http://www.iss-integration.com/supercore.html
§ Scalable Informatics Unison • High availability cluster • 60 HDD in 4U • Tier 1 performance at archive cost • https://scalableinformatics.com/unison.html
© 2015 Mellanox Technologies 27 - Mellanox Confidential -
Even More Ceph Solutions
§ Keeper Technology – keeperSAFE • Ceph appliance • For US Government • File Gateway for NFS, SMB, & StorNext • Mellanox Switches
§ Monash University -- Melbourne, Australia • 3 Ceph Clusters, >6PB total storage • 8, 17 (27), and 37 nodes • OpenStack Cinder and S3/Swift Object Storage • Mellanox networking, 10GbE nodes, 56GbE ISLs
© 2015 Mellanox Technologies 28 - Mellanox Confidential -
Summary
§ Ceph scalability and performance benefit from high performance networks • Especially with lots of disk
§ Ceph being optimized for flash storage § End-to-end 40/56 Gb/s transport accelerates Ceph today
• 100Gb/s testing has begun! • Available in various Ceph solutions and appliances
§ RDMA is next to optimize flash performance—beta in Hammer
Thank You
© 2015 Mellanox Technologies 30 - Mellanox Confidential -
SanDisk IF-500 topology on a single 512 TB IF-100
Flash Memory Summit 2015 Santa Clara, CA
30
-‐ IF-‐100 BW is ~8.5GB/s (with 6Gb SAS, 12 Gb is coming EOY) and ~1.5M 4K IOPS -‐ We saw that Ceph is very resource hungry, so, need at least 2 physical nodes on top of IF-‐100 -‐ We need to connect all 8 ports of an HBA to saturate IF-‐100 for bigger block size
© 2015 Mellanox Technologies 31 - Mellanox Confidential -
SanDisk Ceph-InfiniFlash Setup Details
Flash Memory Summit 2015 Santa Clara, CA
31
Performance Config -‐ IF-‐500 2 Node Cluster ( 32 drives shared to each OSD node)
Node 2 Servers (Dell R720) 2x E5-‐2680 12C 2.8GHz 4x 16GB RDIMM, dual rank x4 (64GB)
1x Mellanox X3 Dual 40GbE 1x LSI 9207 HBA card
RBD Client 4 Servers (Dell R620) 1 x E5-‐2680 10C 2.8GHz 2 x 16GB RDIMM, dual rank x4 (32 GB) 1x Mellanox X3
Dual 40GbE Storage – IF-‐100 with 64 Icechips in A2 Config IF-‐100 IF-‐100 is connected 64 x 1YX2 Icechips in A2 topology. Total storage -‐ 64 * 8 tb = 512tb
Network Details 40G Switch NA OS Details
OS Ubuntu 14.04 LTS 64bit 3.13.0-‐32 LSI card/ driver SAS2308(9207) mpt2sas
Mellanox 40gbps nw card MT27500 [ConnectX-‐3] mlx4_en -‐ 2.2-‐1 (Feb 2014) Cluster Configura[on CEPH Version sndk-‐ifos-‐1.0.0.04 0.86.rc.eap2
Replicadon (Default) 2 [Host]
Note: -‐ Host level replicadon.
Number of Pools, PGs & RBDs pool = 4 ;PG = 2048 per pool 2 RBDs from each pool
RBD size 2TB Number of Monitors 1 Number of OSD Nodes 2 Number of OSDs per Node 32 total OSDs = 32 * 2 = 64
© 2015 Mellanox Technologies 32 - Mellanox Confidential -
SanDisk: 8K Random - 2 RBD/Client with File System IOPS: 2 LUNs /Client (Total 4 Clients)
0
50000
100000
150000
200000
250000
300000
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
0 25 50 75 100
Stock Ceph IFOS 1.0
Lat(ms): 2 LUNs/Client (Total 4 Clients)
[Queue Depth] Read Percent
0
20
40
60
80
100
120
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
0 25 50 75 100
IOPS Latency (ms)
© 2015 Mellanox Technologies 33 - Mellanox Confidential -
SanDisk: 64K Random -2 RBD/Client with File System IOPS: 2 LUNs/Client (Total 4 Clients) Lat(ms): 2 LUNs/Client (Total 4 Clients)
[Queue Depth] Read Percent
0
20000
40000
60000
80000
100000
120000
140000
160000
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
0 25 50 75 100
Stock Ceph
IFOS 1.0 0
20
40
60
80
100
120
140
160
180
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
0 25 50 75 100
IOPS Latency (ms)