This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Increasing Ceph Performance Cost-Effectively with New Non-Volatile
n Ceph* with Intel® Non-Volatile Memory Technologies n 2.8M IOPS Ceph* cluster with Intel® Optane™ SSDs + Intel® 3D
TLC SSDs n Ceph* Performance analysis on Intel® Optane™ SSDs based all-
flash array n Summary
2 Santa Clara, CA August 2017
Intel® 3D NAND SSDs and OPTANE SSD Transform Storage
Refer to appendix for footnotes
Capacity for Less
Performance for Less
Optimized STORAGE
Santa Clara, CA August 2017
4
A Brief Ceph Introduction
4
RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors
LIBRADOS A library allowing apps to directly access RADOS
§ Open-source, object-based scale-out storage § Object, Block and File in single unified storage cluster § Highly durable, available – replication, erasure coding § Runs on economical commodity hardware § 10 years of hardening, vibrant community
§ Scalability – CRUSH data placement, no single POF § Replicates and re-balances dynamically § Enterprise features – snapshots, cloning, mirroring § Most popular block storage for Openstack use cases § Commercial support from Red Hat
Innovation for Cloud STORAGE : Intel® Optane™ + Intel® 3D NAND SSDs
• New Storage Infrastructure: enable high performance and cost effective storage:
Journal/Log/Cache Data • Openstack/Ceph: ‒ Intel Optane™ as Journal/Metadata/WAL (Best write
performance, Lowest latency and Best QoS) ‒ Intel 3D NAND TLC SSD as data store (cost effective
storage) ‒ Best IOPS/$, IOPS/TB and TB/Rack
6
Ceph Node (Yesterday)
P3520 2TB
P3520 2TB
P3520 2TB
P3520 2TB
P3700 U.2 800GB
Ceph Node (Today)
Intel® Optane™ P4800X (375GB)
3D NAND
P4500 4TB
3D XPoint™
P4500 4TB
P4500 4TB
P4500 4TB
P4500 4TB
P4500 4TB
Transition to
+
P4500 4TB
P4500 4TB
Santa Clara, CA August 2017
Suggested Configurations for Ceph Storage Node
• Standard/good (baseline): • Use cases/Applications: that need high capacity storage with
high throughput performance
§ NVMe*/PCIe* SSD for Journal + Caching, HDDs as OSD data drive
• Better IOPS • Use cases/Applications: that need higher performance
especially for throughput, IOPS and SLAs with medium storage capacity requirements
§ NVMe/PCIe SSD as Journal, High capacity SATA SSD for data drive
• Best Performance • Use cases/Applications: that need highest performance
(throughput and IOPS) and low latency/QoS (Quality of Service).
§ All NVMe/PCIe SSDs Santa Clara, CA August 2017
7 More information at Ceph.com (new RAs update soon!) http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments
*Other names and brands may be claimed as the property of others.
Ceph* storage node --Good
CPU Intel(R) Xeon(R) CPU E5-2650v4
Memory 64 GB
NIC 10GbE
Disks 1x 1.6TB P3700 + 12 x 4TB HDDs (1:12 ratio) P3700 as Journal and caching
Caching software Intel(R) CAS 3.0, option: Intel(R) RSTe/MD4.3
Ceph* Storage node --Better
CPU Intel(R) Xeon(R) CPU E5-2690v4
Memory 128 GB
NIC Duel 10GbE
Disks 1x Intel(R) DC P3700(800G) + 4x Intel(R) DC S3510 1.6TB Or 1xIntel P4800X (375GB) + 8x Intel® DC S3520 1.6TB
Ceph* Storage node --Best
CPU Intel(R) Xeon(R) CPU E5-2699v4
Memory >= 128 GB
NIC 2x 40GbE, 4x dual 10GbE
Disks 1xIntel P4800X (375GB) + 6x Intel® DC P4500 4TB
Drivers for Ceph on All-Flash Arrays
• Storage providers are struggling to achieve the required high performance • There is a growing trend for cloud providers to adopt SSD
• CSP who wants to build Amazon EBS like services for their OpenStack* based public/private cloud
• Strong demands to run enterprise applications • OLTP workloads running on Ceph, tail latency is critical • high performance multi-purpose Ceph cluster is a key advantage • Performance is still an important factor
• SSD performance continue to increase while price continue to decrease
Santa Clara, CA August 2017
8 *Other names and brands may be claimed as the property of others.
Ceph Performance Trends with SSD
• 18x performance improvement in Ceph on all-flash array!
9
1.98x
3.7x
1.66x
1.23x1.19x
Santa Clara, CA August 2017
*Refer to backup section for detail cluster configuration
8x Storage Node • Intel Xeon processor E5-2699 v4 @ 2.3 GHz • 256GB Memory • 1x 400G SSD for OS • 1x Intel® DC P4800 375G SSD as WAL and DB • 8x 2.0TB Intel® SSD DC P4500 as data drive • 2 OSD instances one each P4500 SSD • Ceph 12.0.0 with Ubuntu 16.10
2x40Gb NIC
Test Environment
CEPH1
MON
OSD1 OSD16 …
FIO FIO
CLIENT 1
1x40Gb NIC
FIO FIO
CLIENT 2
FIO FIO
CLIENT 3
FIO FIO
…..
FIO FIO
CLIENT8
*Other names and brands may be claimed as the property of others.
CEPH3 … CEPH8
Workloads • Fio with librbd • 20x 30 GB volumes each client • 4 test cases: 4K random read & write; 64K
Sequential read & write
Santa Clara, CA August 2017
Ceph Optane Performance Overview
• Excellent performance on Optane cluster • random read & write hit CPU bottleneck
11
Throughput Latency (avg.)
99.99% latency (ms)
4K Random Read 2876K IOPS 0.9 ms 2.25
4K Random Write 610K IOPS 4.0 ms 25.435
64K Sequential Read 27.5 GB/s 7.6 ms 13.744
64K Sequential Write 13.2 GB/s 11.9 ms 215
Santa Clara, CA August 2017
Ceph Optane Performance Improvement
• The breakthrough high performance of Optane eliminated the WAL & rocksdb bottleneck • 1 P4800X or P3700 covers up to 8x P4500 data drivers as both WAL and rocksdb
12 Santa Clara, CA August 2017
Ceph Optane Latency Improvement
• Significant tail latency improvement with Optane • 20x latency reduction for 99.99% latency
13
5.7 ms
14.7 ms
18.3 ms
317.8 ms
Santa Clara, CA August 2017
Ceph Performance Optimization on Optane
Optane Performance advantage over P3700 increased from 7% to 33% with tunings (bufferIO) Optane Optimizations with separate kv_sync_thread • Separate the thread to feed KV as much as possible. (#PR13943, merged) • 1.77x performance boost with OSD side optimization on Optane, 1.3x with librbd interface • Need to further optimize OSD layer
14
1 8 16 32 48 64
One Part 23600 65100 71800 62500 62100 63600
Two Parts 34900 97800 104000 109000 110000 113000
0 20000 40000 60000 80000
100000 120000
IOP
S
numjobs#
OSD optimization: Separate kv_sync_thread (fio+object store, one OSD,qd=16)
1 8 16 32 48 64
One Part 21900 36339 36169 33728 30370 23004
Two Parts 25100 48916 47581 43084 37136 24479
0 10000 20000 30000 40000 50000 60000
IOP
S
volume#
OSD optimization: Separate kv_sync_thread(fio+librbd, one OSD, qd=16)
• Summary • Ceph* is awesome! • Strong demands for all-flash array Ceph* solutions • Optane based all-flash array Ceph* cluster is capable of delivering over 2.8M
IOPS with very low latency! • Let’s work together to make Ceph* more efficient with all-flash array!
• Next • Improving Ceph network messenger performance with RDMA. • Ceph NVMeOF solutions • Client side cache on Optane with SQL workloads!
15 *Other names and brands may be claimed as the property of others.
Santa Clara, CA August 2017
Acknowledgements
• This is a joint effort • Thanks for the contributions of Haodong, Tang,
Jianpeng Ma of our Intel Shanghai development team
Ceph* All Flash SATA configuration - IVB (E5 -2680 V2) + 6X S3700
20
COMPUTE NODE 2 nodes with Intel® Xeon™ processor x5570 @ 2.93GHz, 128GB mem 1 node with Intel Xeon processor E5 2680 @2.8GHz, 56GB mem
STORAGE NODE Intel Xeon processor E5-2680 v2 32GB Memory 1xSSD for OS 6x 200GB Intel® SSD DC S3700 2 OSD instances each Drive
WORKLOADS • Fio with librbd • 20x 30 GB volumes each client • 4 test cases: 4K random read & write; 64K Sequential read & write
2x10Gb NIC
CEPH1 MON
OSD1 OSD12 …
FIO FIO
CLIENT 1
1x10Gb NIC
FIO FIO
CLIENT 2
FIO FIO
CLIENT 3
FIO FIO
CLIENT 4
CEPH2 CEPH3 CEPH4
TEST
EN
VIR
ON
MEN
T
*other names and brands may be claimed as the property of others. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
Ceph* All Flash SATA configuration - HSW (E5 -2699 v3) + P3700 + S3510
5x Storage Node Intel Xeon processor E5-2699 v3 @ 2.3GHz, 64GB memory 1x Intel® DC P3700 800G SSD for Journal (U.2) 4x 1.6TB Intel® SSD DC S3510 as data drive 2 OSDs on single S3510 SSD
Workloads • Fio with librbd • 20x 30 GB volumes each client • 4 test cases: 4K random read & write;
64K Sequential read & write
Test Environment
Ceph* All Flash 3D NAND configuration - HSW (E5 -2699 v3) + P3700 + P3520
22
5x Client Node Intel® Xeon™ processor E5-2699 v3 @ 2.3GHz, 64GB mem 10Gb NIC
5x Storage Node Intel Xeon processor E5-2699 v3 @ 2.3 GHz 128GB Memory 1x 400G SSD for OS 1x Intel® DC P3700 800G SSD for journal (U.2) 4x 2.0TB Intel® SSD DC P3520 as data drive 2 OSD instances one each P3520 SSD
Test Environment
CEPH1 MON
OSD1 OSD8 …
FIO FIO
CLIENT 1
1x10Gb NIC
FIO FIO
CLIENT 2
FIO FIO
CLIENT 3
FIO FIO
CLIENT 4
FIO FIO
CLIENT 5
CEPH2 CEPH3 CEPH4 CEPH5
*Other names and brands may be claimed as the property of others.
Workloads • Fio with librbd • 20x 30 GB volumes each client • 4 test cases: 4K random read &