Chendi Xue, [email protected] , Software Engineer Jian Zhang, [email protected] , Senior Software Engineer 04/2016
Chendi Xue, [email protected], Software Engineer
Jian Zhang, [email protected], Senior Software Engineer
04/2016
Agenda
• Motivation
• Introduction of CeTune
• CeTune Internal
• Ceph performance tuning case studies
• Summary
2
Agenda
• Motivation
• Introduction of CeTune
• CeTune Internal
• Ceph performance tuning case studies
• Summary
3
Motivation• Background
• Ceph is an open-source, massively scalable, software-defined storage system.
• Ceph is the most popular block storage backend for Openstack based cloud storage
solution
• Ceph is more and more popular
• What is the problem ?
• End users face numerous challenges to drive best performance
• Increasing requests from end users on:
• How to troubleshooting the Ceph cluster?
• How to identify the best tuning knobs from many (500+) parameters
• How to handle the unexpected performance regression between frequent
releases.
• Why CeTune ?
• Shorten end user’s landing time of Ceph based storage solution
• Close the enterprise readiness gap for Ceph – stability, usability, performance etc.
• Build optimized IA based reference architectures and influence customers to adopt it.
Agenda
• Motivation
• Introduction of CeTune
• CeTune Internal
• Ceph performance tuning case studies
• Summary
5
CeTune Architecture• Deployment:
• Install Ceph by apt-get/yum
• Deploy Ceph with OSD, RBD client or radosgw client
• Support clean redeploy and incremental deploy
• Benchmark:
• qemuRBD, fioRBD, COSBench, user defined workload
• seqwrite, seqread,randwrite,randread, mixreadwrite
• Analyze:
• System_metrics: iostat, sar, interrupt
• Performance_counter
• Latency_breakdown
• Tuner:
• Ceph configuration tuning
• pool tuning
• disk tuning, system tuning
• Visualizer:
• Output as html
• Download to csv
Deployer
Benchmarker
Analyzer
Tuner
Visualizer
ConfigHandler common.py
6
CeTune Configuration
File
Terms
• CeTune controller
• Reads config files and controls the process to deploy, benchmark and analyze the collected data;
• CeTune workers
• controlled by CeTune controller working as workload generator, system metrics collector.
• CeTune Configuration files
• all.conf
• tuner.conf
• testcase.conf
7
CeTune Workflow
generate Ceph conf
Deploy Ceph
enter benchmark phase
apply tuning
Check & apply tuning
Pre-check RBD/RGW status, create and init if necessary
Check if need to reinstall Ceph
Collect system/Ceph/lttnglog, start benchmark
Wait until benchmark complete or received a interrupt signal
do data analysis
Save processed data as a json file
Output all reporters as html files
Apply User defined hook
script
8
How to use CeTune
9
Configuration view: In this view, add cluster description: which nodes, who is the OSD device and who is the ssd device, etc.
10
Configuration view: Ceph configuration, CeTune will refresh Ceph configuration before each benchmark test.
Ceph tuning, pool configuration and Ceph.conf tuningWhen CeTune start to run benchmark, will firstly compared Ceph tuning with this configuration, apply tuning if needed then start test
How to use CeTune
11
Configuration view: Choose workload type, result dir, configuration benchmark.
Benchmark configuration
Benchmark testcase configuration
How to use CeTune
12
Status Monitoring View: Reports current CeTune working status, so user can interrupt CeTune if necessary.
How to use CeTune
13
Reports view: Result report provides two report view, summary view shows history report list, double click to view the detail report of one specific test run.
How to use CeTune
Double click one row
14
Reports view
Ceph.conf configuration, cluster description, CeTune_process_log,Fio/COSBench error log
How to use CeTune
15
Reports view – System metrics: CPU, memory, I/O statistics…
How to use CeTune
System runtime logs:iostat, cpu ratio, memory usage.
16
Reports view – Ceph perf counter metrics
How to use CeTune
Ceph performance counter data.
17
Reports view – Latency breakdown with LTTng
How to use CeTune
Ceph codes latency breakdown data
Agenda
• Motivation
• Introduction of CeTune
• CeTune Internal
• Ceph performance tuning case studies
• Summary
18
ModulesCeTune comprises
Five distinct components:
– Controller(Tuner): Controls other four modules to do the work; automatically detect and apply tuning differences between tests.
– Deploy Handler: Focus on installing and deploying, only support Ceph
– Benchmark Handler: Support kinds of benchmark tools and methods, also capable of test local block device, and user defined workload.
– Analyzer Handler: Designed to be flexible of adding more analyzers into Analyzer Handler, and output as a structured JSON file.
– Visualizer Handler: Transform any data in the structured JSON into html tables, line chart graphs and csv files.
Two interfaces:
– CLI interface
– WebUI inferface
Controller
DeployHandler
CeTune configuration
①
③
Web Interface
④Asynchronized
BenchmarkHandler
AnalyzerHandler
VisualizerHandler
⑤
CeTune StatusMonitor
CeTune process LOG
Common
②
19
Benchmark
• RBD
• Fio running inside VM, fio-RBD engine
• Cosbench
• Using Radosgw interface to test object
• CephFS
• Fio CephFS engine( not recommend, will working on a more generic benchmark at CeTune v2)
• Generic devices
• Distribute Fio test job to multi nodes multi disks.
BenchmarkHandler
Benchmark.py
qemuRBD.py
Controller
TunerHandler testcases.conf
fioRBD.py generic.py COSBench.py fioCephFS.py Plugin?
④
③
a hook for user defined benchmark
20
Analyzer
• System metrics:
• iostat: partition
• sar: CPU, mem, NIC
• Interrupt
• Top: raw data
• Performance counter:
• Indicates software behavior
• Stable and well format in Ceph codes
• Well supported in CeTune
AnalyzerHandler
BenchmarkHandler Controller
IostatHanlder
SarHanlder
PerfCounterHanlder
FioHandler
CosbenchHandler
Res dir
Iostat filesSar filesFio files…
② ③Result.json
21
Visualizer
• Read from result.json, Output as html
• Can be used to visualize other Jason results following the same format
Result.json1 {2 "summary": {3 "run_id": {4 "2": {5 "Status": "Completed\n",-6 "Op_size": "64k",-7 "Op_Type": "seqwrite",-8 "QD": "qd64",-9 "Driver": "qemuRBD",-10 "SN_Number": 3,-11 "CN_Number": 4,-12 "Worker": "20",-13 "Runtime": "100",-14 "IOPS": "12518",-15 "BW(MB/s)": "782.922",-16 "Latency(ms)": "102.758",-17 "SN_IOPS": "3687.939",-18 "SN_BW(MB/s)": "1556.812",-19 "SN_Latency(ms)": "50.789"20 }21 },-22 "Download": {} 27 },-28 "workload": {},-
292 "Ceph": {},-138624 "client": {},-150080 "vclient": {},-227790 "runtime": 100,-227791 "status": "Completed\n",-227792 "session_name": "2-20-qemuRBD-seqwrite-64k-qd64-40g-0-100-vdb"227793 }
tab
table
row
colume, data
22
Agenda
• Motivation
• Introduction of CeTune
• CeTune Internal
• Ceph performance tuning case studies
• Summary
23
op_size op_type QD engine serverNum clientNum RBDNumRuntime
(sec0fio_IOPS
fio_bw(MB/s)
fio_latency(ms)
OSD_IOPSOSD_bw(MB/s)
OSD_latency(ms)
Scenario1:omap on
OSD4k randwrite qd8
qemuRBD
4 4 140 1200 642 2 1609 4152 27 215
Scenario2:omap on
separate ssd4k randwrite qd8
qemuRBD
4 4 140 1200 1579 6 695 6000 36 373
Above is two 4k randwrite performance tested on two scenarios,
• (1) metadata on the same device as OSD,
• (2) Omap move to the same device as journal.
From result, we can see
• Scenario2 doubled the Fio IOPS compared with scenario1, from 642 to 1579
• Comparing the backend and frontend IOPS ratio: scenario1 gets 1:6.5 and scenario2 gets 1:3.8
• Comparing the backend and frontend BW ratio: scenario1 gets 1:9.8 and scenario2 gets 1:5.6
Case Study: The metadata overheads
24
Case Study: The metadata overheads
0
200
400
600
800
10005
10
27
40
53
61
69
89
11
2
12
2
14
4
18
2
21
0
23
5
26
9
29
2
31
7
34
0
36
0
37
9
38
7
40
9
42
7
45
1
47
2
49
8
52
4
54
8
56
5
58
8
59
8
61
4
63
0
65
0
66
2
68
8
70
4
72
1
74
3
75
7
77
3
78
3
80
8
82
3
83
8
85
3
86
9
88
0
90
0
91
0
92
0
93
7
95
4
95
9
96
5
97
8
98
2
1,0
01
1,0
19
1,0
28
1,0
51
1,0
73
1,0
82
1,1
02
1,1
16
1,1
34
1,1
49
1,1
69
1,1
84
0
200
400
600
800
1000
1
45
84
13
7
17
0
19
5
21
5
23
5
26
4
29
4
31
9
34
8
37
7
40
0
41
7
43
7
46
0
48
7
50
4
51
9
53
0
55
4
58
0
59
2
60
9
62
5
63
9
64
7
66
0
67
7
68
9
70
2
71
3
72
5
73
2
74
7
76
3
77
6
78
7
80
2
82
1
83
2
85
0
86
2
88
2
89
3
90
4
91
6
94
1
95
3
96
0
97
3
98
9
1,0
00
1,0
14
1,0
25
1,0
35
1,0
45
1,0
54
1,0
69
1,0
81
1,0
88
1,0
93
1,1
05
1,1
21
1,1
34
1,1
41
1,1
56
1,1
72
Scenario1: omap on OSD device
Scenario2: omap on journal Device
This is blktrace result, using issued to Disk Blocksize, we can easily separate omap write from 4k randwrite. Omap write happens frequently.
25
Case Study: Thread number tunings
op_size op_type QD engine serverNum clientNum RBDNumRuntime
(sec0fio_IOPS
fio_bw(MB/s)
fio_latency(ms)
OSD_IOPSOSD_bw(MB/s)
OSD_latency (ms)
4k randread qd8 vdb 4 2 40 400 3389 13 93 3729. 16 15.9
4k randread qd8 vdb 4 2 80 300 3693 14 172 3761 14 16.4
Long frontend latency, and short backend latency
Randread 40 VM, each VMwith 100 IOPS capping, only 3389 IOPS total ??
fio_latency is 94ms but OSD disk latency on 16ms??
From CeTune processed latency graph, we get some hints, one OSD op_latency is as high as 1sec, but its process_latency is only 25msec.
Which means op are waiting in OSD queue to process, should we add more OSD_op_threads?
26
Case Study: Thread number tunings
After adding OSD_op_threads, problem solved. op_r_latency matches op_r_process_latency
Fio latency back to 40ms, and OSD side op real process time is about 25-30ms. Which makes more sense.
op_size op_type QD engineserverNu
mclientNum
RBDNum
Runtime (sec)
fio_IOPSfio_bw(MB/s)
fio_latency(ms)OSD_IOP
SOSD_bw(MB/s)
OSD_latency(ms)
Before tune 4k randread qd8 vdb 4 2 40 401 3389 13 93 3729 16 15
Before tune 4k randread qd8 vdb 4 2 80 301 3693 14 172 3761 14 16
After tune 4k randread qd8 vdb 4 2 40 400 3979 15 40 3943 15 21
After tune 4k randread qd8 vdb 4 2 80 400 7441 29 85 7295 28 57
27
28
Case Study: Ceph All Flash Tunings
• Up to 7.6x performance improvement for 4K random write, 140K IOPS
4K Random Write Tunings
Default Single OSD
Tuning-1 2 OSD instances per SSD
Tuning-2 Tuning2+Debug 0
Tuning-3 tuning3+ op_tracker off, tuning fd cache
Tuning-4 Tuning4+jemalloc
Tuning-5 Tuning4 + Rocksdb to store omap
Agenda
• Motivation
• Introduction of CeTune
• CeTune Internal
• Ceph performance tuning case studies
• Summary
29
Summary
Ceph is becoming increasingly popular in PRC storage market
End users faced some challenges do drive the best performance of the Ceph cluster
CeTune is designed to deploy, benchmark, analyze, tuning and visualize a Ceph cluster through user friendly WebUI
CeTune is extensible for thrid-party workloads
30
32
Legal Notices and Disclaimers
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.
No computer system can be absolutelyure.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.
Intel, Xeon and the Intel logo are trademarks of Intel Corporation in the United States and other countries.
*Other names and brands may be claimed as the property of others.
© 2015 Intel Corporation.
All Flash Setup Configuration Details
• Ceph version is 0.94.2
• XFS as file system for Data Disk
• 4 partitions of each SSD and two of tem for OSD daemon
• replication setting (2 replicas), 2048 pgs/OSD
34
Client Cluster
CPU Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz 36C/72T
Memory 64GB
NIC 10Gb
Disks 1 HDD for OS
Ceph Cluster
CPU Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz 36C/72T
Memory 64 GB
NIC 10GbE
Disks 4 x 400 GB DC3700 SSD (INTEL SSDSC2BB120G4) each cluster
Ceph cluster
OS Ubuntu 14.04.2
Kernel 3.16.0
Ceph 0.94.2
Client host
OS Ubuntu 14.04.2
Kernel 3.16.0
Configuration Details
Client Nodes
CPU 2 x Intel Xeon E5-2680 @ 2.8Hz (20-core, 40 threads) (Qty: 3)
Memory 128 GB (8GB * 16 DDR3 1333 MHZ)
NIC 2x 10Gb 82599EB, ECMP (20Gb), 64 GB (8 x 8GB DDR3 @ 1600 MHz)
Disks 1 HDD for OS
Client VM
CPU 1 X VCPU VCPUPIN
Memory 512 MB
Ceph Nodes
CPU 1 x Intel Xeon E3-1275 V2 @ 3.5 GHz (4-core, 8 threads)
Memory 32 GB (4 x 8GB DDR3 @ 1600 MHz)
NIC 1 X 82599ES 10GbE SFP+, 4x 82574L 1GbE RJ45
HBA/C204 {SAS2008 PCI-Express Fusion-MPT SAS-2} / {6 Series/C200 Series Chipset Family SATA AHCI Controller}
Disks
1 x INTEL SSDSC2BW48 2.5’’ 480GB for OS
1 x Intel P3600 2TB PCI-E SSD (Journal)
2 x Intel S3500 400GB 2.5’’ SSD as journal
10 x Seagate ST3000NM0033-9ZM 3.5’’ 3TB 7200rpm SATA HDD (Data)
35