Accelerating Ceph performance Profiling and Tuning with CeTune

Chendi Xue, [email protected], Software Engineer

Jian Zhang, [email protected], Senior Software Engineer

04/2016

mailto:[email protected]

mailto:[email protected]

Agenda

• Motivation

• Introduction of CeTune

• CeTune Internal

• Ceph performance tuning case studies

• Summary

2

Agenda

• Motivation


• CeTune Internal


• Summary

3

Motivation• Background

• Ceph is an open-source, massively scalable, software-defined storage system.

• Ceph is the most popular block storage backend for Openstack based cloud storage

solution

• Ceph is more and more popular

• What is the problem ?

• End users face numerous challenges to drive best performance

• Increasing requests from end users on:

• How to troubleshooting the Ceph cluster?

• How to identify the best tuning knobs from many (500+) parameters

• How to handle the unexpected performance regression between frequent

releases.

• Why CeTune ?

• Shorten end user’s landing time of Ceph based storage solution

• Close the enterprise readiness gap for Ceph – stability, usability, performance etc.

• Build optimized IA based reference architectures and influence customers to adopt it.

Agenda

• Motivation


• CeTune Internal


• Summary

5

CeTune Architecture• Deployment:

• Install Ceph by apt-get/yum

• Deploy Ceph with OSD, RBD client or radosgw client

• Support clean redeploy and incremental deploy

• Benchmark:

• qemuRBD, fioRBD, COSBench, user defined workload

• seqwrite, seqread,randwrite,randread, mixreadwrite

• Analyze:

• System_metrics: iostat, sar, interrupt

• Performance_counter

• Latency_breakdown

• Tuner:

• Ceph configuration tuning

• pool tuning

• disk tuning, system tuning

• Visualizer:

• Output as html

• Download to csv

Deployer

Benchmarker

Analyzer

Tuner

Visualizer

ConfigHandler common.py

6

CeTune Configuration

File

Terms

• CeTune controller

• Reads config files and controls the process to deploy, benchmark and analyze the collected data;

• CeTune workers

• controlled by CeTune controller working as workload generator, system metrics collector.

• CeTune Configuration files

• all.conf

• tuner.conf

• testcase.conf

7

CeTune Workflow

generate Ceph conf

Deploy Ceph

enter benchmark phase

apply tuning

Check & apply tuning

Pre-check RBD/RGW status, create and init if necessary

Check if need to reinstall Ceph

Collect system/Ceph/lttnglog, start benchmark

Wait until benchmark complete or received a interrupt signal

do data analysis

Save processed data as a json file

Output all reporters as html files

Apply User defined hook

script

8

How to use CeTune

9

Configuration view: In this view, add cluster description: which nodes, who is the OSD device and who is the ssd device, etc.

10

Configuration view: Ceph configuration, CeTune will refresh Ceph configuration before each benchmark test.

Ceph tuning, pool configuration and Ceph.conf tuningWhen CeTune start to run benchmark, will firstly compared Ceph tuning with this configuration, apply tuning if needed then start test

How to use CeTune

11

Configuration view: Choose workload type, result dir, configuration benchmark.

Benchmark configuration

Benchmark testcase configuration

How to use CeTune

12

Status Monitoring View: Reports current CeTune working status, so user can interrupt CeTune if necessary.

How to use CeTune

13

Reports view: Result report provides two report view, summary view shows history report list, double click to view the detail report of one specific test run.

How to use CeTune

Double click one row

14

Reports view

Ceph.conf configuration, cluster description, CeTune_process_log,Fio/COSBench error log

How to use CeTune

15

Reports view – System metrics: CPU, memory, I/O statistics…

How to use CeTune

System runtime logs:iostat, cpu ratio, memory usage.

16

Reports view – Ceph perf counter metrics

How to use CeTune

Ceph performance counter data.

17

Reports view – Latency breakdown with LTTng

How to use CeTune

Ceph codes latency breakdown data

Agenda

• Motivation


• CeTune Internal


• Summary

18

ModulesCeTune comprises

Five distinct components:

– Controller(Tuner): Controls other four modules to do the work; automatically detect and apply tuning differences between tests.

– Deploy Handler: Focus on installing and deploying, only support Ceph

– Benchmark Handler: Support kinds of benchmark tools and methods, also capable of test local block device, and user defined workload.

– Analyzer Handler: Designed to be flexible of adding more analyzers into Analyzer Handler, and output as a structured JSON file.

– Visualizer Handler: Transform any data in the structured JSON into html tables, line chart graphs and csv files.

Two interfaces:

– CLI interface

– WebUI inferface

Controller

DeployHandler

CeTune configuration

①

③

Web Interface

④Asynchronized

BenchmarkHandler

AnalyzerHandler

VisualizerHandler

⑤

CeTune StatusMonitor

CeTune process LOG

Common

②

19

Benchmark

• RBD

• Fio running inside VM, fio-RBD engine

• Cosbench

• Using Radosgw interface to test object

• CephFS

• Fio CephFS engine( not recommend, will working on a more generic benchmark at CeTune v2)

• Generic devices

• Distribute Fio test job to multi nodes multi disks.

BenchmarkHandler

Benchmark.py

qemuRBD.py

Controller

TunerHandler testcases.conf

fioRBD.py generic.py COSBench.py fioCephFS.py Plugin?

④

③

a hook for user defined benchmark

20

Analyzer

• System metrics:

• iostat: partition

• sar: CPU, mem, NIC

• Interrupt

• Top: raw data

• Performance counter:

• Indicates software behavior

• Stable and well format in Ceph codes

• Well supported in CeTune

AnalyzerHandler

BenchmarkHandler Controller

IostatHanlder

SarHanlder

PerfCounterHanlder

FioHandler

CosbenchHandler

Res dir

Iostat filesSar filesFio files…

② ③Result.json

21

Visualizer

• Read from result.json, Output as html

• Can be used to visualize other Jason results following the same format

Result.json1 {2 "summary": {3 "run_id": {4 "2": {5 "Status": "Completed\n",-6 "Op_size": "64k",-7 "Op_Type": "seqwrite",-8 "QD": "qd64",-9 "Driver": "qemuRBD",-10 "SN_Number": 3,-11 "CN_Number": 4,-12 "Worker": "20",-13 "Runtime": "100",-14 "IOPS": "12518",-15 "BW(MB/s)": "782.922",-16 "Latency(ms)": "102.758",-17 "SN_IOPS": "3687.939",-18 "SN_BW(MB/s)": "1556.812",-19 "SN_Latency(ms)": "50.789"20 }21 },-22 "Download": {} 27 },-28 "workload": {},-

292 "Ceph": {},-138624 "client": {},-150080 "vclient": {},-227790 "runtime": 100,-227791 "status": "Completed\n",-227792 "session_name": "2-20-qemuRBD-seqwrite-64k-qd64-40g-0-100-vdb"227793 }

tab

table

row

colume, data

22

Agenda

• Motivation


• CeTune Internal


• Summary

23

op_size op_type QD engine serverNum clientNum RBDNumRuntime

(sec0fio_IOPS

fio_bw(MB/s)

fio_latency(ms)

OSD_IOPSOSD_bw(MB/s)

OSD_latency(ms)

Scenario1:omap on

OSD4k randwrite qd8

qemuRBD

4 4 140 1200 642 2 1609 4152 27 215

Scenario2:omap on

separate ssd4k randwrite qd8

qemuRBD

4 4 140 1200 1579 6 695 6000 36 373

Above is two 4k randwrite performance tested on two scenarios,

• (1) metadata on the same device as OSD,

• (2) Omap move to the same device as journal.

From result, we can see

• Scenario2 doubled the Fio IOPS compared with scenario1, from 642 to 1579

• Comparing the backend and frontend IOPS ratio: scenario1 gets 1:6.5 and scenario2 gets 1:3.8

• Comparing the backend and frontend BW ratio: scenario1 gets 1:9.8 and scenario2 gets 1:5.6

Case Study: The metadata overheads

24

Case Study: The metadata overheads

0

200

400

600

800

10005

10

27

40

53

61

69

89

11

2

12

2

14

4

18

2

21

0

23

5

26

9

29

2

31

7

34

0

36

0

37

9

38

7

40

9

42

7

45

1

47

2

49

8

52

4

54

8

56

5

58

8

59

8

61

4

63

0

65

0

66

2

68

8

70

4

72

1

74

3

75

7

77

3

78

3

80

8

82

3

83

8

85

3

86

9

88

0

90

0

91

0

92

0

93

7

95

4

95

9

96

5

97

8

98

2

1,0

01

1,0

19

1,0

28

1,0

51

1,0

73

1,0

82

1,1

02

1,1

16

1,1

34

1,1

49

1,1

69

1,1

84

0

200

400

600

800

1000

1

45

84

13

7

17

0

19

5

21

5

23

5

26

4

29

4

31

9

34

8

37

7

40

0

41

7

43

7

46

0

48

7

50

4

51

9

53

0

55

4

58

0

59

2

60

9

62

5

63

9

64

7

66

0

67

7

68

9

70

2

71

3

72

5

73

2

74

7

76

3

77

6

78

7

80

2

82

1

83

2

85

0

86

2

88

2

89

3

90

4

91

6

94

1

95

3

96

0

97

3

98

9

1,0

00

1,0

14

1,0

25

1,0

35

1,0

45

1,0

54

1,0

69

1,0

81

1,0

88

1,0

93

1,1

05

1,1

21

1,1

34

1,1

41

1,1

56

1,1

72

Scenario1: omap on OSD device

Scenario2: omap on journal Device

This is blktrace result, using issued to Disk Blocksize, we can easily separate omap write from 4k randwrite. Omap write happens frequently.

25

Case Study: Thread number tunings

op_size op_type QD engine serverNum clientNum RBDNumRuntime

(sec0fio_IOPS

fio_bw(MB/s)

fio_latency(ms)

OSD_IOPSOSD_bw(MB/s)

OSD_latency (ms)

4k randread qd8 vdb 4 2 40 400 3389 13 93 3729. 16 15.9

4k randread qd8 vdb 4 2 80 300 3693 14 172 3761 14 16.4

Long frontend latency, and short backend latency

Randread 40 VM, each VMwith 100 IOPS capping, only 3389 IOPS total ??

fio_latency is 94ms but OSD disk latency on 16ms??

From CeTune processed latency graph, we get some hints, one OSD op_latency is as high as 1sec, but its process_latency is only 25msec.

Which means op are waiting in OSD queue to process, should we add more OSD_op_threads?

26

Case Study: Thread number tunings

After adding OSD_op_threads, problem solved. op_r_latency matches op_r_process_latency

Fio latency back to 40ms, and OSD side op real process time is about 25-30ms. Which makes more sense.

op_size op_type QD engineserverNu

mclientNum

RBDNum

Runtime (sec)

fio_IOPSfio_bw(MB/s)

fio_latency(ms)OSD_IOP

SOSD_bw(MB/s)

OSD_latency(ms)

Before tune 4k randread qd8 vdb 4 2 40 401 3389 13 93 3729 16 15

Before tune 4k randread qd8 vdb 4 2 80 301 3693 14 172 3761 14 16

After tune 4k randread qd8 vdb 4 2 40 400 3979 15 40 3943 15 21

After tune 4k randread qd8 vdb 4 2 80 400 7441 29 85 7295 28 57

27

28

Case Study: Ceph All Flash Tunings

• Up to 7.6x performance improvement for 4K random write, 140K IOPS

4K Random Write Tunings

Default Single OSD

Tuning-1 2 OSD instances per SSD

Tuning-2 Tuning2+Debug 0

Tuning-3 tuning3+ op_tracker off, tuning fd cache

Tuning-4 Tuning4+jemalloc

Tuning-5 Tuning4 + Rocksdb to store omap

Agenda

• Motivation


• CeTune Internal


• Summary

29

Summary

Ceph is becoming increasingly popular in PRC storage market

End users faced some challenges do drive the best performance of the Ceph cluster

CeTune is designed to deploy, benchmark, analyze, tuning and visualize a Ceph cluster through user friendly WebUI

CeTune is extensible for thrid-party workloads

30

32

Legal Notices and Disclaimers

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.

No computer system can be absolutelyure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Intel, Xeon and the Intel logo are trademarks of Intel Corporation in the United States and other countries.

*Other names and brands may be claimed as the property of others.

© 2015 Intel Corporation.

http://www.intel.com/performance

All Flash Setup Configuration Details

• Ceph version is 0.94.2

• XFS as file system for Data Disk

• 4 partitions of each SSD and two of tem for OSD daemon

• replication setting (2 replicas), 2048 pgs/OSD

34

Client Cluster

CPU Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz 36C/72T

Memory 64GB

NIC 10Gb

Disks 1 HDD for OS

Ceph Cluster

CPU Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz 36C/72T

Memory 64 GB

NIC 10GbE

Disks 4 x 400 GB DC3700 SSD (INTEL SSDSC2BB120G4) each cluster

Ceph cluster

OS Ubuntu 14.04.2

Kernel 3.16.0

Ceph 0.94.2

Client host

OS Ubuntu 14.04.2

Kernel 3.16.0

Configuration Details

Client Nodes

CPU 2 x Intel Xeon E5-2680 @ 2.8Hz (20-core, 40 threads) (Qty: 3)

Memory 128 GB (8GB * 16 DDR3 1333 MHZ)

NIC 2x 10Gb 82599EB, ECMP (20Gb), 64 GB (8 x 8GB DDR3 @ 1600 MHz)

Disks 1 HDD for OS

Client VM

CPU 1 X VCPU VCPUPIN

Memory 512 MB

Ceph Nodes

CPU 1 x Intel Xeon E3-1275 V2 @ 3.5 GHz (4-core, 8 threads)

Memory 32 GB (4 x 8GB DDR3 @ 1600 MHz)

NIC 1 X 82599ES 10GbE SFP+, 4x 82574L 1GbE RJ45

HBA/C204 {SAS2008 PCI-Express Fusion-MPT SAS-2} / {6 Series/C200 Series Chipset Family SATA AHCI Controller}

Disks

1 x INTEL SSDSC2BW48 2.5’’ 480GB for OS

1 x Intel P3600 2TB PCI-E SSD (Journal)

2 x Intel S3500 400GB 2.5’’ SSD as journal

10 x Seagate ST3000NM0033-9ZM 3.5’’ 3TB 7200rpm SATA HDD (Data)

35

Accelerating Ceph performance Profiling and Tuning with CeTune

Documents