Ilsoo Byun ([email protected]) Network-IT Convergence R&D Center SDS Tech. Lab CEPH Optimization on SSD
Ilsoo Byun ([email protected])Network-IT Convergence R&D CenterSDS Tech. Lab
CEPH Optimization on SSD
Introduction
3
NIC 기술원 소개
• COSMOS(Composable, Open, Scalable, Mobile-Oriented Sys-tem)• Telco 인프라 혁신
• 개방형 All-IT 인프라 구축
Software Defined Storage
4
Traditional Storage Emerging SDS
Architecture Proprietary H/W, Proprietary S/W Only
Commodity H/W, Open Source S/W Available
Benefit Turnkey Solution Low Cost, Flexible Mgmt., High Scalability
ConsiderationsHigh Cost, Inflexible Mgmt., Limited Scalability, Vendor Lock-in
Tuning & Development Efforts Needed
+SSD
Why using SSD make differ-ences?• SSD
• Interface: Same Physical Interface as HDDs(SATA/SAS) or Same Logical Interface as HDDs(NVMe)
• Media Characteristics: Lower Latency, Higher Paral-lelism
• Why SSD was so successful?• Interface Compatibility, and
• In mobile systems: Higher reliability due to no mov-ing part
• In enterprise systems: Higher performance (1 SSD can replace 10~100 HDDs in terms of performance)
5
SSD Optimization Example• Linux IO Scheduler• No-op, CFQ, Deadline
6
Performance Optimization
Storage(HDD, SSD, …)
Network(1g, 10g, …)
Computing(CPU, Memory, …)
Theory Measurements Analysis Optimization
• H/W Configuration• Parameter Optimiza-
tion• Source Modification
2
1
7
• Good design will reduce performance problems:• Identify the areas that will yield the largest per-
formance boosts• Over widest variety of situations• Focus attention on these areas
• Consider Peak load, Not average ones.• Performance issues arise when loads are high
Performance Tuning Principles
Source: http://mscom.co.il/downloads/sqlpresentations/12060.pdf8
Target WorkloadIOPS-Sensitive Workload
IOPSSensitive
• Block-based(iSCSI, RBD)
• SSD• Block Device,
DB• Random IO
ThroughputSensitive
CapacitySensitive
• File-based(NFS, CIFS)
• SSD, HDD• Contents Shar-
ing• Sequential IO
• Object-based (S3)
• HDD• Archiving,
Backup• Sequential IO
9
CEPH Architecture
10
CLIENTS
LIBRADOSRADOS
LIBRADOS
CEPH FSRGWRBD
MON MON MON MDS
OSD OSD OSD OSD
OSD OSD OSD OSD
CRUSH
11
Placement Group #1 Placement Group #2
OSD #1 OSD #2 OSD #3 OSD #4 OSD #5
obj obj obj obj obj obj
CRUSH
HASH pgid = hash(oid) & mask
File
Stripe & Replication1. pgid2. replication factor3. crush map4. placement rules
RADOS
Scalibility
Reference H/W
10G
clusterpubic
• Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz x 2
• 256GB Memory• 480GB SATA SSD x 6• CentOS 7• Ceph(Master-0923)
12
KRBD, /dev/rbd0
Transaction
OSDOSD
peering
IO Request
ObjectStore
ObjectStore FileStore BlueStore
MemStore
OSD & ObjectStore
SSD
14
Transaction
• All or Nothing• Ordering– Strong Consistency
ObjectStore Transaction
15
All or NothingHow to support atomicity of a transaction
File-Store
Blue-Store
Journal
Storage
interval=5ms
2. Ack1. Write
3. Store
4. Applied
Storage
RocksDB
1. Write
2. Metadata Update
3. Ack & Applied
16
• Theorem: You can have at most two of these properties for any shared-data system
Consistency Availability
Tolerance to network Partition
CAP Theorem
17
Consistency Implementation• Strong Consistency• w(X) = A
A A A A
18
Consistency Implementation• Weak Consistency• w(X) = A• r(X) = ?
A A A A
w(X) = Ar(X) = Ar(X) = ?
r(X) = Ar(X) = A
19
CEPH Consistency Model
PG PG PG PG
Acting Set
Hashing
CRUSH
20
CEPH Consistency Model
PG PG PG PG
Acting Set
Hashing
CRUSH
Causal Con-sistency!Block
21
IBM Cleversafe dsNet
• Eventual Consistency 를 적용함으로 인해 상위에 파일시스템을 올릴 경우 오동작
• IBM 에서는 Temporal 하고 Critical 하지 않은 오류라고 언급함• Write 시 Orphaned Data• Delete 시 Read Error
Eventual Consistency
A
B
A
B
A BC
Strong ConsistencyB AC
ACK
ACK ACK
Why is consistency model impor-tant?
22
• Ceph Tech Talk (2016/1/28)– PostgreSQL on Ceph under Mesos/Aurora with
Docker• Read: Databases don’t have IO depth of 64.
It’s 1.• Write: Databases want each and every transac-
tion to be acknowledged by the storage layer.–Full round-trip down to the storage layer
Example of Strong Consistent Appli-cation
23
Metadata Processing
BlueStore Design
BlockDevice BlockDevice BlockDevice
Allo
cato
r RocksDB
BlueRocksEnv
BlueFS
Data Meta-Data
WAL DBSLOW
25
Distributed MetadataTraditional Ceph
Metadata
Storage Storage Storage
Storage Storage Storage
Client
query
Client
Storage
Meta-data
Storage
Meta-data
Storage
Meta-data
Storage
Meta-data
Storage
Meta-data
Storage
Meta-data
CRUSHBottleneck
26
Onodeatomic_int nrefghobject_t oidstring key
ExtentMapextent_map
Collection
c
bluestore_onode_tuint64_t niduint64_t sizemap<string, bufferptr> at-trs;
onode
Extent<<intrusive::set>>*
Blob
blob
shard_infouint32_t offsetuint32_t bytesuint32_t extents
extent_map_shards <<vector>>*
Shardbool loadedbool dirty
shards<<vector>> *
27
BlueStore Write Transacion
Transaction::OP_TOUCH
/dev/rbd0
write 4KB
Transaction:: OP_SETALLOCHINT
Transaction:: OP_WRITE
Transaction:: OP_SETATTRS
Transaction:: OP_OMAP_SETKEYS
block
block.db
OSDBlueStore
Metadata
Data
28
FileStore Write Transacion
Transaction::OP_TOUCH
/dev/rbd0
write 4KB
Transaction:: OP_SETALLOCHINT
Transaction:: OP_WRITE
Transaction:: OP_SETATTRS
Transaction:: OP_OMAP_SETKEYS
Journal
SSD
OSDFileStore
29
BlueStore Transaction Process-ing
Transaction::OP_TOUCH
Transaction:: OP_SETALLOCHINT
RocksDB TransactionsTransaction:: OP_SETATTRS
Transaction:: OP_OMAP_SETKEYS
Transaction:: OP_WRITE WriteContext
Allocator
RocksDB
Update Meta-data
Preparation Data Meta Data
bluestore_allocator
= bitmap | stupid
30
Write
Metadata Update Overhead
HDD
SSD
3,000us 20us
180us 20us
0.7%
11%?
Write Update Metadata
SSD
180us 40us ↑
22%↑
31
32
Metadata Tuning• Encode, only if the data is modified (Blob::encode)• Compression• ex. 00000123 -> 5123
• Sharding
Onode Metadatadirty dirty
Write
Shard
33
Pre-allocation vs No pre-alloca-tion
4 64 128 256 40960
10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000
100,000
Metadata Overhead (Randwrite,4KB,100MB)
No pre-alloc Pre-alloc
Shard size
Meta
data
size
(KB)
Default
Sending Ack
IO Path
SSD
op_w
qkv_committin
gfinisher queue
pipe out_q Ack
write
flush
ROCKSDBsync transaction
Request
Storage
RocksDB
1. Write
2. Metadata Update
3. Ack & Applied
35
Shar
ded
OpW
Q
Shard Shard Shard
Wor
kers
ObjectStore
osd_op_num_shards
osd_op_threads
{# of Shard} * {# of Shard Worker}= Total Threads
osd_op_queue = prio|wpq
36
{PGID} % {osd_op_num_shards}= Target Shard
37
IO Path Bottleneck
SSD
op_w
qkv_committin
gfinisher queue
pipe out_q
Request
Ack
write
flusth
ROCKSDBsync transaction
38
IO Path Optimized
SSD
op_w
qkv_committin
g
pipe
out
_q
Request
Ack
write
flusth
ROCKSDBsync transaction
Shard
finish
er
queu
e
bstore_shard_finishers = true
39
50%↑• 09/23, 4619bc09703d429abd41554b693294236c192268
bstore_shard_finishers (Boolean)
Next Target
SSD
op_w
qkv_committin
g
pipe
out
_q
Request
Ack
write
flusth
ROCKSDBsync transaction
Shard
finish
er
queu
e
40
RocksDB Sync Transaction Over-head
Request
SSTFile
LogfileMemtable
Transaction Log
Flush
Batch-commit
41
Etc.
CPU Usage• 4KB Random Write• 155 K IOPS
http://www.xblafans.com/wp-content/uploads//2011/09/WHY-NO-WORK.jpg
• Low!• Unbalanced!
43
HT & RSS• Effect of HT(Hyper Threading)• Distortion of CPU Usage
• RSS(Receive Side Scaling)
CPU#1 CPU#2
RAM
RAM
RAM
RAM
RAM
RAM
NIC
QPI
PCIe
DIMM
CPU#1
source: https://software.intel.com/en-us/articles/introduction-to-hyper-threading-technology 44
rx_usecs = 10
rx_usecs = 0
Interrupt Coalescing
45
Optimization Result• 4KB Random Write• 164 K IOPS
10%↑
46
Conclusions• SSD shows different performance characteristics from HDD• S/W Optimization on SSD requires in-depth knowledges
about the system• We explain the experiences getting from optimizing CEPH
• Ceph Consistency Model• Metadata processing• Shard Finishers• OS Optimizations
47
Q&A
Thank You