ceph optimization on ssd ilsoo byun-short

Ilsoo Byun ([email protected])Network-IT Convergence R&D CenterSDS Tech. Lab

CEPH Optimization on SSD

Introduction

3

NIC 기술원 소개

• COSMOS(Composable, Open, Scalable, Mobile-Oriented Sys-tem)• Telco 인프라 혁신

• 개방형 All-IT 인프라 구축

Software Defined Storage

4

Traditional Storage Emerging SDS

Architecture Proprietary H/W, Proprietary S/W Only

Commodity H/W, Open Source S/W Available

Benefit Turnkey Solution Low Cost, Flexible Mgmt., High Scalability

ConsiderationsHigh Cost, Inflexible Mgmt., Limited Scalability, Vendor Lock-in

Tuning & Development Efforts Needed

+SSD

Why using SSD make differ-ences?• SSD

• Interface: Same Physical Interface as HDDs(SATA/SAS) or Same Logical Interface as HDDs(NVMe)

• Media Characteristics: Lower Latency, Higher Paral-lelism

• Why SSD was so successful?• Interface Compatibility, and

• In mobile systems: Higher reliability due to no mov-ing part

• In enterprise systems: Higher performance (1 SSD can replace 10~100 HDDs in terms of performance)

5

SSD Optimization Example• Linux IO Scheduler• No-op, CFQ, Deadline

6

Performance Optimization

Storage(HDD, SSD, …)

Network(1g, 10g, …)

Computing(CPU, Memory, …)

Theory Measurements Analysis Optimization

• H/W Configuration• Parameter Optimiza-

tion• Source Modification

2

1

7

• Good design will reduce performance problems:• Identify the areas that will yield the largest per-

formance boosts• Over widest variety of situations• Focus attention on these areas

• Consider Peak load, Not average ones.• Performance issues arise when loads are high

Performance Tuning Principles

Source: http://mscom.co.il/downloads/sqlpresentations/12060.pdf8

Target WorkloadIOPS-Sensitive Workload

IOPSSensitive

• Block-based(iSCSI, RBD)

• SSD• Block Device,

DB• Random IO

ThroughputSensitive

CapacitySensitive

• File-based(NFS, CIFS)

• SSD, HDD• Contents Shar-

ing• Sequential IO

• Object-based (S3)

• HDD• Archiving,

Backup• Sequential IO

9

CEPH Architecture

10

CLIENTS

LIBRADOSRADOS

LIBRADOS

CEPH FSRGWRBD

MON MON MON MDS

OSD OSD OSD OSD

OSD OSD OSD OSD

CRUSH

11

Placement Group #1 Placement Group #2

OSD #1 OSD #2 OSD #3 OSD #4 OSD #5

obj obj obj obj obj obj

CRUSH

HASH pgid = hash(oid) & mask

File

Stripe & Replication1. pgid2. replication factor3. crush map4. placement rules

RADOS

Scalibility

Reference H/W

10G

clusterpubic

• Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz x 2

• 256GB Memory• 480GB SATA SSD x 6• CentOS 7• Ceph(Master-0923)

12

KRBD, /dev/rbd0

Transaction

OSDOSD

peering

IO Request

ObjectStore

ObjectStore FileStore BlueStore

MemStore

OSD & ObjectStore

SSD

14

Transaction

• All or Nothing• Ordering– Strong Consistency

ObjectStore Transaction

15

All or NothingHow to support atomicity of a transaction

File-Store

Blue-Store

Journal

Storage

interval=5ms

2. Ack1. Write

3. Store

4. Applied

Storage

RocksDB

1. Write

2. Metadata Update

3. Ack & Applied

16

• Theorem: You can have at most two of these properties for any shared-data system

Consistency Availability

Tolerance to network Partition

CAP Theorem

17

Consistency Implementation• Strong Consistency• w(X) = A

A A A A

18

Consistency Implementation• Weak Consistency• w(X) = A• r(X) = ?

A A A A

w(X) = Ar(X) = Ar(X) = ?

r(X) = Ar(X) = A

19

CEPH Consistency Model

PG PG PG PG

Acting Set

Hashing

CRUSH

20

CEPH Consistency Model

PG PG PG PG

Acting Set

Hashing

CRUSH

Causal Con-sistency!Block

21

IBM Cleversafe dsNet

• Eventual Consistency 를 적용함으로 인해 상위에 파일시스템을 올릴 경우 오동작

• IBM 에서는 Temporal 하고 Critical 하지 않은 오류라고 언급함• Write 시 Orphaned Data• Delete 시 Read Error

Eventual Consistency

A

B

A

B

A BC

Strong ConsistencyB AC

ACK

ACK ACK

Why is consistency model impor-tant?

22

• Ceph Tech Talk (2016/1/28)– PostgreSQL on Ceph under Mesos/Aurora with

Docker• Read: Databases don’t have IO depth of 64.

It’s 1.• Write: Databases want each and every transac-

tion to be acknowledged by the storage layer.–Full round-trip down to the storage layer

Example of Strong Consistent Appli-cation

23

Metadata Processing

BlueStore Design

BlockDevice BlockDevice BlockDevice

Allo

cato

r RocksDB

BlueRocksEnv

BlueFS

Data Meta-Data

WAL DBSLOW

25

Distributed MetadataTraditional Ceph

Metadata

Storage Storage Storage

Storage Storage Storage

Client

query

Client

Storage

Meta-data

Storage

Meta-data

Storage

Meta-data

Storage

Meta-data

Storage

Meta-data

Storage

Meta-data

CRUSHBottleneck

26

Onodeatomic_int nrefghobject_t oidstring key

ExtentMapextent_map

Collection

c

bluestore_onode_tuint64_t niduint64_t sizemap<string, bufferptr> at-trs;

onode

Extent<<intrusive::set>>*

Blob

blob

shard_infouint32_t offsetuint32_t bytesuint32_t extents

extent_map_shards <<vector>>*

Shardbool loadedbool dirty

shards<<vector>> *

27

BlueStore Write Transacion

Transaction::OP_TOUCH

/dev/rbd0

write 4KB

Transaction:: OP_SETALLOCHINT

Transaction:: OP_WRITE

Transaction:: OP_SETATTRS

Transaction:: OP_OMAP_SETKEYS

block

block.db

OSDBlueStore

Metadata

Data

28

FileStore Write Transacion


/dev/rbd0

write 4KB


Transaction:: OP_WRITE

Transaction:: OP_SETATTRS


Journal

SSD

OSDFileStore

29

BlueStore Transaction Process-ing



RocksDB TransactionsTransaction:: OP_SETATTRS


Transaction:: OP_WRITE WriteContext

Allocator

RocksDB

Update Meta-data

Preparation Data Meta Data

bluestore_allocator

= bitmap | stupid

30

Write

Metadata Update Overhead

HDD

SSD

3,000us 20us

180us 20us

0.7%

11%?

Write Update Metadata

SSD

180us 40us ↑

22%↑

31

32

Metadata Tuning• Encode, only if the data is modified (Blob::encode)• Compression• ex. 00000123 -> 5123

• Sharding

Onode Metadatadirty dirty

Write

Shard

33

Pre-allocation vs No pre-alloca-tion

4 64 128 256 40960

10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000

100,000

Metadata Overhead (Randwrite,4KB,100MB)

No pre-alloc Pre-alloc

Shard size

Meta

data

size

(KB)

Default

Sending Ack

IO Path

SSD

op_w

qkv_committin

gfinisher queue

pipe out_q Ack

write

flush

ROCKSDBsync transaction

Request

Storage

RocksDB

1. Write

2. Metadata Update

3. Ack & Applied

35

Shar

ded

OpW

Q

Shard Shard Shard

Wor

kers

ObjectStore

osd_op_num_shards

osd_op_threads

{# of Shard} * {# of Shard Worker}= Total Threads

osd_op_queue = prio|wpq

36

{PGID} % {osd_op_num_shards}= Target Shard

37

IO Path Bottleneck

SSD

op_w

qkv_committin

gfinisher queue

pipe out_q

Request

Ack

write

flusth


38

IO Path Optimized

SSD

op_w

qkv_committin

g

pipe

out

_q

Request

Ack

write

flusth


Shard

finish

er

queu

e

bstore_shard_finishers = true

39

50%↑• 09/23, 4619bc09703d429abd41554b693294236c192268

bstore_shard_finishers (Boolean)

Next Target

SSD

op_w

qkv_committin

g

pipe

out

_q

Request

Ack

write

flusth


Shard

finish

er

queu

e

40

RocksDB Sync Transaction Over-head

Request

SSTFile

LogfileMemtable

Transaction Log

Flush

Batch-commit

41

Etc.

CPU Usage• 4KB Random Write• 155 K IOPS

http://www.xblafans.com/wp-content/uploads//2011/09/WHY-NO-WORK.jpg

• Low!• Unbalanced!

43

HT & RSS• Effect of HT(Hyper Threading)• Distortion of CPU Usage

• RSS(Receive Side Scaling)

CPU#1 CPU#2

RAM

RAM

RAM

RAM

RAM

RAM

NIC

QPI

PCIe

DIMM

CPU#1

source: https://software.intel.com/en-us/articles/introduction-to-hyper-threading-technology 44

rx_usecs = 10

rx_usecs = 0

Interrupt Coalescing

45

Optimization Result• 4KB Random Write• 164 K IOPS

10%↑

46

Conclusions• SSD shows different performance characteristics from HDD• S/W Optimization on SSD requires in-depth knowledges

about the system• We explain the experiences getting from optimizing CEPH

• Ceph Consistency Model• Metadata processing• Shard Finishers• OS Optimizations

47

Q&A

Thank You

ceph optimization on ssd ilsoo byun-short

Technology