Top Banner
Reddy Chagam – Principal Engineer & Chief SDS Architect Tushar Gohad – Senior Staff Engineer Intel Corporation April 19, 2016 Acknowledgements: Orlando Moreno, Dan Ferber (Intel) Accelerating Ceph for Database Workloads with an all PCIe SSD Cluster
29

Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

Jul 19, 2018

Download

Documents

nguyentram
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

Reddy Chagam – Principal Engineer & Chief SDS ArchitectTushar Gohad – Senior Staff Engineer

Intel Corporation April 19, 2016

Acknowledgements: Orlando Moreno, Dan Ferber (Intel)

Accelerating Ceph for Database Workloads with an

all PCIe SSD Cluster

Page 2: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

Legal Disclaimer

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at http://intel.com.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Configurations: Ceph v0.94.3 Hammer, v10.1.2 Jewel Release, CentOS 7.2, 3.10-327 Kernel, CBT used for testing and data acquisition. OSD System Config: Intel Xeon E5-2699 v4 2x@ 2.20 GHz, 44 cores w/ HT, Cache 46080KB, 128GB DDR4, Each system with 4x P3700 800GB NVMe SSDs, partitioned into 4 OSD’s each, 16 OSD’s total per node. FIO Client Systems: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 36 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4. Ceph public and cluster networks 2x 10GbE each. FIO 2.2.8 with LibRBD engine. Sysbench 0.5 for MySQL testing. Tests run by Intel DCG Storage Group in Intel lab. Ceph configuration and CBT YAML file provided in backup slides.

For more information go to http://www.intel.com/performance.

Intel, Intel Inside and the Intel logo are trademarks of Intel Corporation in the United States and other countries. *Other names and brands may be claimed as the property of others.

© 2016 Intel Corporation.

Page 3: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

Agenda

• Transition to NVMe flash

• NVMe architecture with Ceph

• Database & Ceph – leading flash use case

• The “All NVMe” high-density Ceph Cluster

• MySQL workload performance results

• Summary and next steps

Page 4: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

Storage Evolution

Near TermTodayYesterday

Memory&

Storage

StorageNAND based Intel PCIe SSDs

for NVMe

3D NAND based Intel PCIeRamping in 2016

3D XPoint™ Technology basedOptane™ SSD for NVMe

3D XPoint™ Technology basedApache Pass (AEP) for DDR4

Revolutionary Storage

Class Memory

World’s Fastest

NVMe SSD

Next Gen NVM enables world’s fastest NVMe SSD and revolutionary storage class memory

NVMe SSD accelerates performance for latency sensitive workloads on Ceph

3D XPoint ™ Memory Media

Latency: ~100XSize of Data: ~1,000X

Page 5: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

Data Center Form Factors for

80, and 110mm lengths, Smallest footprint of PCIe, use for boot or for max storage density

2.5in makes up the majority of SSDs sold today because of ease of deployment, hotplug, serviceability, and small form factor

M.2 Add-in-cardU.2 2.5in (SFF-8639)

Add-in-card (AIC) has maximum system compatibility with existing servers and most reliable compliance program. Higher power envelope, and options for height and length

7mm 15mm

Page 6: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

Intel PlatformsTick-Tock Development Model

Intel® MicroarchitectureCodename Nehalem

Intel® MicroarchitectureCodename Sandy Bridge

Intel® MicroarchitectureCodename Haswell

Tock Tock TockTick Tick Tick

Nehalem

45nm

New Micro-architecture

Westmere

32nm

New ProcessTechnology

Sandy Bridge

32nm

New Micro-architecture

Ivy Bridge

22nm

New ProcessTechnology

Haswell

22nm

New Micro-architecture

Broadwell

14nm

New ProcessTechnology

Grantley Platform (Today)Romley PlatformThurley Platform

Wellsburg PCHPatsburg PCHTylersburg PCH

Xeon E5 v4 socket compatible with v3 series - improves Ceph performance

Page 7: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

Ceph WorkloadsS

tora

ge

Pe

rfo

rma

nce

(IO

PS

, T

hro

ug

hp

ut)

Storage Capacity (PB)

Lower Higher

Lo

we

rH

igh

er

Boot Volumes

CDN

Enterprise Dropbox

Backup, Archive

Remote Disks

VDI

App Storage

BigData

Mobile Content Depot

Databases

BlockObject

NVMFocus

Test & Dev

Cloud DVR

HPC

Page 8: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

Ceph - NVM Usages

Caching

RADOSNode

RADOSProtocol

RADOSProtocol

OSD

Journal Filestore

NVM

File System

NVM

NVM

Hypervisor

Guest VM

Qemu/Virtio

Application

LIBRBD

RADOS

RADOS Protocol

NVM

Kernel

User

Kernel RBD

RADOS

Application

RADOS Protocol

NVM

Caching

RADOSNodeOSD

NVM

NVM

NVM

data metadata

RocksDB

BlueRocksEnv

BlueFS

10-25 GbE

Client caching w/ write through

JournalingRead cacheOSD data

Production - FileStore Tech Preview – BlueStore

Kernel

Container Application

Kernel RBD

RADOS

RADOS Protocol

NVM CachingCaching

Baremetal Virtual Machine Container

Today’s Focus

Today’s Focus

Page 9: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

Ceph and Percona Server MySQL Integration

IP Fabric

Virtual Machine

Hypervisor

Guest VM

Qemu/Virtio

Application

RBD

RADOS

MySQL

Virtual Machine

Hypervisor

Guest VM

Qemu/Virtio

Application

RBD

RADOS

MySQL

Linux Container

Host

Application

Kernel RBD

RADOS

MySQL

Ceph Storage Cluster

SSD SSD

OSDOSD OSD

SSD SSD

OSDOSD OSD

SSD SSD

OSDOSD OSD

SSD SSD

OSDOSD OSD

MON MON

Deployment Considerations• Bootable Ceph volumes (OS & MySQL

data)• MySQL RBD volumes (all in one, separate)

Configurations• Good: NVMe SSD for Journal/Cache, HDDs

as OSD data drive• Better: NVMe SSD as Journal, High

capacity SATA (or) 3D-NAND NVMe SSDfor data drive

• Best: All NVMe SSD

Page 10: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

An “All-NVMe” high-density Ceph Cluster Configuration

Supermicro 1028U-TN10RT+

NVMe1 NVMe2

NVMe3 NVMe4

Ce

ph

OS

D1

Ce

ph

OS

D2

Ce

ph

OS

D3

Ce

ph

OS

D4

Ce

ph

OS

D1

6

5-Node all-NVMe Ceph ClusterDual-Xeon E5 [email protected], 44C HT, 128GB DDR4

Centos 7.2, 3.10-327, Ceph v10.1.2, bluestore async

Clu

ste

r N

W 2

x 1

0G

bE

10x Client Systems + 1x Ceph MONDual-socket Xeon E5 [email protected]

36 Cores HT, 128GB DDR4

Public NW 2x 10GbE

Docker1 (krbd)MySQL DB Server

Docker2 (krbd)MySQL DB Server

Docker3Sysbench Client

FIO

Docker4Sysbench Client

DB containers - 16 vCPUs, 32GB mem, 200GB RBD volume, 100GB MySQL dataset, InnoDB buf cache 25GB (25%)

Client containers – 16 vCPUs, 32GB RAMFIO 2.8, Sysbench 0.5

Test-set 1

Te

st-

se

t 2

Page 11: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

Multi-partitioning flash devices

• High performance NVMe devices are capable of high parallelism at low latency

• DC P3700 800GB Raw Performance: 460K read IOPS & 90K Write IOPS at QD=128

• High Resiliency of “Data Center” Class NVMe devices

• At least 10 Drive writes per day

• Power loss protection, full data path protection, device level telemetry

• By using multiple OSD partitions, Ceph performance scales linearly

• Reduces lock contention within a single OSD process

• Lower latency at all queue-depths, biggest impact to random reads

• Introduces the concept of multiple OSD’s on the same physical device

• Conceptually similar crushmap data placement rules as managing disks in an enclosure

Ce

ph

OS

D1

Ce

ph

OS

D2

Ce

ph

OS

D3

Ce

ph

OS

D4

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark parameters.

NVMe SSD

Page 12: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

Partitioning multiple OSD’s per NVMe

0

2

4

6

8

10

12

0 200,000 400,000 600,000 800,000 1,000,000 1,200,000

Av

gL

ate

ncy

(m

s)

IOPS

Latency vs IOPS - 4K Random Read - Multiple OSD's per Device comparison

5 nodes, 20/40/80 OSDs, Intel DC P3700 Xeon E5 2699v3 Dual Socket /

128GB Ram / 10GbE

Ceph0.94.3 w/ JEMalloc,

1 OSD/NVMe 2 OSD/NVMe 4 OSD/NVMe

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark parameters.

0

10

20

30

40

50

60

70

80

90

% C

PU

Uti

liza

tio

n

Single Node CPU Utilization Comparison - 4K Random Reads@QD32

4/8/16 OSDs, Intel DC P3700, Xeon E5 2699v3 Dual Socket /

128GB Ram / 10GbE

Ceph0.94.3 w/ JEMalloc

1 OSD/NVMe 2 OSD/NVMe 4 OSD/NVMe

Single OSD

Double OSD

Quad OSD

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark parameters.

Multiple OSD’s per NVMe result in higher performance,

lower latency, and better CPU utilization

Page 13: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

4K Random Read/Write Performance and Latency(Baseline FIO Test)

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark parameters.

0

1

2

3

4

5

6

7

8

9

10

11

12

0 200000 400000 600000 800000 1000000 1200000 1400000 1600000 1800000

Ave

rage

Lat

ency

(m

s)

IOPS

IODepth Scaling - Latency vs IOPS - Read, Write, and 70/30 4K Random Mix5 nodes, 80 OSDs, Xeon E5 2699v4 Dual Socket / 128GB Ram / 2x10GbE

Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients

100% Rand Read 100% Rand Write 70% Rand Read

~1.4 M 100% 4k Random

Read IOPS @~1 ms avg

~220k 100% 4k Random Write IOPS @~5 ms avg

~560k 70/30% (OLTP) Random

IOPS @~3 ms avg ~1.6 M 100% 4k Random Read

IOPS @~2.2 ms avg

Page 14: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

Sysbench MySQL OLTP Performance (100% SELECT)

0

5

10

15

20

25

30

35

0 200000 400000 600000 800000 1000000 1200000 1400000

Avg

Lat

en

cy (

ms)

Aggregate Queries Per Second (QPS)

Sysbench Thread Scaling - Latency vs QPS – 100% read (Point SELECTs)5 nodes, 80 OSDs, Xeon E5 2699v4 Dual Socket / 128GB Ram / 2x10GbE

Ceph 10.1.2 w/ BlueStore w/ async msgr. 20 Docker-rbd Sysbench Clients (16vCPUs, 32GB)

100% Random Read

1 million QPS (Aggregate, 20 clients) @~11 ms avg

~55000 QPS per client w/ 2 Sysbench threads

~1.3 million QPS (Aggregate, 20 clients)8 Sysbench threads

InnoDB buf pool = 25%, SQL dataset = 100GB

Page 15: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

Sysbench MySQL OLTP Performance (100% UPDATE, 70/30% SELECT/UPDATE)

0

50

100

150

200

250

300

350

400

450

500

0 100000 200000 300000 400000 500000 600000

Avg

Lat

en

cy (

ms)

Aggregate Queries Per Second (QPS)

Sysbench Thread Scaling - Latency vs QPS – 100% Write (Index UPDATEs), 70/30% OLTP5 nodes, 80 OSDs, Xeon E5 2699v4 Dual Socket / 128GB Ram / 2x10GbE

Ceph 10.1.2 w/ BlueStore w/ async msgr. 20 Docker-rbd Sysbench Clients (16vCPU, 32GB)

100% Random Write 70/30% Read/Write

~400k 70/30% OLTP QPS@~50 ms avg

~25000 QPS w/ 1 Sysbench client (4-8 threads)

~100k Write QPS@~200 ms avg (Aggregate, 20 clients)

~5500 QPS w/ 1 Sysbench client (2-4 threads)

InnoDB buf pool = 25%, SQL dataset = 100GB

Page 16: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

Summary & Conclusions

• NVMe Flash storage for low latency workloads

• Ceph a compelling case for database workloads

• With Ceph, 1.4 million random IOPS is achievable in 5U with ~1ms latency today. Ceph performance is only getting better!

• Using Xeon E5 v4 standard high-volume servers and Intel NVMe SSDs, you can now deploy a high performance Ceph cluster for database workloads

• Next steps:• Evaluation on large scale cluster

• Ceph community collaboration in improving write latency

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark parameters.

Page 17: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

Thank you- Any Questions?

Refer to backup slides for additional configuration and details

Page 18: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

Backup

Page 19: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

Intel Ceph Contributions

19

2014 2015 2016

New Key/Value Store

Backend (rocksdb)

Giant* Hammer Infernalis Jewel

CRUSH Placement Algorithm improvements

(straw2 bucket type)

Bluestore Backend Optimizations for NVM

Bluestore SPDKOptimizations

RADOS I/O Hinting(35% better EC Write erformance)

Cache-tiering with SSDs(Write support)

PMStore(NVM-optimized backend

based on libpmem)

RGW, Bluestore

Compression, Encryption(w/ ISA-L, QAT backend)

Virtual Storage Manager (VSM) Open Sourced

CeTuneOpen Sourced

Erasure Coding support

with ISA-LCache-tiering with SSDs

(Read support)

Industry-first Ceph Cluster to break

1 Million 4k Random IOPs

Client-side Block Cache (librbd)

Page 20: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

Configuration Detail – ceph.conf

20

[global]enable experimental unrecoverable data corrupting features = bluestore rocksdbosd objectstore = bluestorems_type = async

rbd readahead disable after bytes = 0rbd readahead max bytes = 4194304bluestore default buffered read = true

auth client required = noneauth cluster required = noneauth service required = nonefilestore xattr use omap = true

cluster network = 192.168.142.0/24, 192.168.143.0/24private network = 192.168.144.0/24, 192.168.145.0/24

log file = /var/log/ceph/$name.loglog to syslog = falsemon compact on trim = falseosd pg bits = 8osd pgp bits = 8mon pg warn max object skew = 100000mon pg warn min per osd = 0mon pg warn max per osd = 32768

debug_lockdep = 0/0debug_context = 0/0debug_crush = 0/0debug_buffer = 0/0debug_timer = 0/0debug_filer = 0/0debug_objecter = 0/0debug_rados = 0/0debug_rbd = 0/0debug_ms = 0/0debug_monc = 0/0debug_tp = 0/0debug_auth = 0/0debug_finisher = 0/0debug_heartbeatmap = 0/0debug_perfcounter = 0/0debug_asok = 0/0debug_throttle = 0/0debug_mon = 0/0debug_paxos = 0/0debug_rgw = 0/0perf = truemutex_perf_counter = truethrottler_perf_counter = falserbd cache = false

Page 21: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

Configuration Detail – ceph.conf (continued)

21

[osd]osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k,delaylogosd_mkfs_options_xfs = -f -i size=2048osd_op_threads = 32filestore_queue_max_ops=5000filestore_queue_committing_max_ops=5000journal_max_write_entries=1000journal_queue_max_ops=3000objecter_inflight_ops=102400filestore_wbthrottle_enable=falsefilestore_queue_max_bytes=1048576000filestore_queue_committing_max_bytes=1048576000journal_max_write_bytes=1048576000journal_queue_max_bytes=1048576000ms_dispatch_throttle_bytes=1048576000objecter_infilght_op_bytes=1048576000osd_mkfs_type = xfsfilestore_max_sync_interval=10osd_client_message_size_cap = 0osd_client_message_cap = 0osd_enable_op_tracker = falsefilestore_fd_cache_size = 64filestore_fd_cache_shards = 32filestore_op_threads = 6

[mon]mon data =/home/bmpa/tmp_cbt/ceph/mon.$idmon_max_pool_pg_num=166496mon_osd_max_split_count = 10000mon_pg_warn_max_per_osd = 10000

[mon.a]host = ft02mon addr = 192.168.142.202:6789

Page 22: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

Configuration Detail - CBT YAML File

cluster:

user: "bmpa"

head: "ft01"

clients: ["ft01", "ft02", "ft03", "ft04", "ft05", "ft06"]

osds: ["hswNode01", "hswNode02", "hswNode03", "hswNode04", "hswNode05"]

mons:

ft02:

a: "192.168.142.202:6789"

osds_per_node: 16

fs: xfs

mkfs_opts: '-f -i size=2048 -n size=64k'

mount_opts: '-o inode64,noatime,logbsize=256k'

conf_file: '/home/bmpa/cbt/ceph.conf'

use_existing: False

newstore_block: True

rebuild_every_test: False

clusterid: "ceph"

iterations: 1

tmp_dir: "/home/bmpa/tmp_cbt"

pool_profiles:

2rep:

pg_size: 8192

pgp_size: 8192

replication: 2 22

benchmarks:

librbdfio:

time: 300

ramp: 300

vol_size: 10

mode: ['randrw']

rwmixread: [0,70,100]

op_size: [4096]

procs_per_volume: [1]

volumes_per_client: [10]

use_existing_volumes: False

iodepth: [4,8,16,32,64,128]

osd_ra: [4096]

norandommap: True

cmd_path: '/usr/local/bin/fio'

pool_profile: '2rep'

log_avg_msec: 250

`

Page 23: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

Storage Node Diagram

23

Two CPU Sockets: Socket 0 and Socket 1 Socket 0

• 2 NVMes• Intel X540-AT2 (10Gbps)• 64GB: 8x 8GB 2133 DIMMs

Socket 1• 2 NVMes• 64GB: 8x 8GB 2133 DIMMs

Explore additional optimizations using cgroups, IRQ affinity

Page 24: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

High Performance Ceph Node Hardware Building Blocks

• Generally available server designs built for high density and high performance

• High density 1U standard high volume server• Dual socket 3rd Generation Xeon E5 (2699v3)

• 10 Front-removable 2.5” Formfactor Drive slots, 8639 connector

• Multiple 10Gb network ports, additional slots for 40Gb networking

• Intel DC P3700 NVMe drives are available in 2.5” drive form-factor• Allowing easier service in a datacenter environment

Page 25: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

MySQL configuration file (my.cnf)

performance_schema=offinnodb_buffer_pool_size = 25Ginnodb_flush_method = O_DIRECTinnodb_log_file_size=4Gthread_cache_size=16innodb_file_per_tableinnodb_checksums = 0innodb_flush_log_at_trx_commit = 0innodb_write_io_threads = 8innodb_page_cleaners= 16innodb_read_io_threads = 8max_connections = 50000

[mysqldump]quickquote-namesmax_allowed_packet = 16M

[mysql]

!includedir /etc/mysql/conf.d/

[client]port = 3306socket = /var/run/mysqld/mysqld.sock[mysqld_safe]socket = /var/run/mysqld/mysqld.socknice = 0[ mysqld]user = mysqlpid-file = /var/run/mysqld/mysqld.pidsocket = /var/run/mysqld/mysqld.sockport = 3306datadir = /databasedir = /usrtmpdir = /tmplc-messages-dir = /usr/share/mysqlskip-external-lockingbind-address = 0.0.0.0max_allowed_packet = 16Mthread_stack = 192Kthread_cache_size = 8query_cache_limit = 1Mquery_cache_size = 16Mlog_error = /var/log/mysql/error.logexpire_logs_days = 10max_binlog_size = 100M

Page 26: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

Sysbench commands

preparesysbench --test=/root/benchmarks/sysbench/sysbench/tests/db/parallel_prepare.lua --mysql-user=sbtest --mysql-password=sbtest --oltp-tables-count=32 --num-threads=128 --oltp-table-size=14000000 --mysql-table-engine=innodb --mysql-port=$1 --mysql-host=172.17.0.1 run

READsysbench --mysql-host=${host} --mysql-port=${mysql_port} \--mysql-user=sbtest --mysql-password=sbtest --mysql-db=sbtest --mysql-engine=innodb --oltp-tables-count=32 --oltp-table-size=14000000 --test=/root/benchmarks/sysbench/sysbench/tests/db/oltp.lua --oltp-read-only=on --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-point-selects=10 --rand-type=uniform --num-threads=${threads} --report-interval=60 --warmup-time=400 --max-time=300 --max-requests=0 --percentile=99 run

WRITEsysbench --mysql-host=${host} --mysql-port=${mysql_port} --mysql-user=sbtest --mysql-password=sbtest --mysql-db=sbtest --mysql-engine=innodb --oltp-tables-count=32 --oltp-table-size=14000000 --test=/root/benchmarks/sysbench/sysbench/tests/db/oltp.lua --oltp-read-only=off --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=100 --oltp-point-selects=0 --rand-type=uniform --num-threads=${threads} --report-interval=60 --warmup-time=400 --max-time=300 --max-requests=0 --percentile=99 run

Page 27: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

Docker Commands

Database containersdocker run -ti --privileged --volume /sys:/sys --volume /dev:/dev -d -p 2201:22 -p 13306:3306 --cpuset-cpus="1-16,36-43" -m 48G --oom-kill-disable --name database1 ubuntu:14.04.3_20160414-db /bin/bash

Client containersdocker run -ti -p 3301:22 -d --name client1 ubuntu:14.04.3_20160414-sysbench /bin/bash

Page 28: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

RBD Commands

ceph osd pool create database 8192 8192

rbd create --size 204800 vol1 --pool database --image-feature layering

rbd snap create database/vol1@master

rbd snap ls database/vol1

rbd snap protect database/vol1@master

rbd clone database/vol1@master database/vol2

rbd feature disable database/vol2 exclusive-lock object-map fast-diff deep-flatten

rbd flatten database/vol2

Page 29: Accelerating Ceph for Database Workloads with an all … · Accelerating Ceph for Database Workloads with an ... Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 100% Rand

*Other names and brands may be claimed as the property of others.

An "All-NVMe” high-density Ceph Cluster Configuration

Ceph Storage Cluster

CBT / Zabbix / Monitoring

• 5-Node all-NVMe Ceph Cluster based on Intel Xeon E5-2699v4, 44 core HT, 128GB DDR4

• Storage: Each system with 4x P3700 800GB NVMe, partitioned into 4 OSD’s each, 16 OSD’s total per node• Networking: 2x10GbE public, 2x10GbE cluster, partitioned, replication factor 2

• Ceph 10.1.2 Jewel Release, CentOS 7.2, 3.10.0-327.13.1.el7 Kernel

• 10x FIO/Sysbench Clients: Intel Xeon E5-2699 v3 @ 2.30 GHz, 36 cores w/ HT, 128GB DDR4

• Docker with kernel RBD volumes – 2 database and 2 client containers per node• Database containers – 16 vCPUs, 32GB RAM, 250GB RBD volume• Client containers – 16 vCPUs, 32GB RAM

SuperMicro FatTwin(4x dual-socket XeonE5 v3)

Ce

ph

OS

D1

NVMe1 NVMe3

NVMe2 NVMe4

Ce

ph

OS

D2

Ce

ph

OS

D3

Ce

ph

OS

D4

Ce

ph

OS

D1

6

Ce

ph

OS

D1

NVMe1 NVMe3

NVMe2 NVMe4

Ce

ph

OS

D2

Ce

ph

OS

D3

Ce

ph

OS

D4

Ce

ph

OS

D1

6

Ce

ph

OS

D1

NVMe1 NVMe3

NVMe2 NVMe4

Ce

ph

OS

D2

Ce

ph

OS

D3

Ce

ph

OS

D4

Ce

ph

OS

D1

6

Ce

ph

OS

D1

NVMe1 NVMe3

NVMe2 NVMe4

Ce

ph

OS

D2

Ce

ph

OS

D3

Ce

ph

OS

D4

Ce

ph

OS

D1

6

Ce

ph

OS

D1

NVMe1 NVMe3

NVMe2 NVMe4

Ce

ph

OS

D2

Ce

ph

OS

D3

Ce

ph

OS

D4

Ce

ph

OS

D1

6

SuperMicro 1028U SuperMicro 1028U SuperMicro 1028U SuperMicro 1028U SuperMicro 1028U

Intel Xeon E5 v4 22 Core CPUsIntel P3700 NVMe PCI-e Flash

Easily serviceable NVMe Drives

SuperMicro FatTwin(4x dual-socket XeonE5 v3)

FIO RBD ClientFIO RBD Client

FIO RBD ClientFIO RBD Client

FIO RBD ClientFIO/Sysbench FIO/Sysbench

Intel PCSD(4x dual-socket Xeon E5 v3)

Ceph cluster network (192.168.144.0/24) – 2x10Gbps

Ceph network (192.168.142.0/24) – 2x10Gbps

FIO RBD ClientFIO RBD Client

FIO RBD ClientFIO/Sysbench

SuperMicro FatTwin(1x dual-socket XeonE5 v3)

Ceph MON