Understanding Write Behaviors of Storage Backends in Ceph Object Store Dong-Yun Lee, Kisik Jeong, Sang-Hoon Han, Jin-Soo Kim, Joo-Young Hwang † and Sangyeun Cho † †
Understanding Write Behaviors of Storage Backendsin Ceph Object Store
Dong-Yun Lee, Kisik Jeong, Sang-Hoon Han, Jin-Soo Kim,
Joo-Young Hwang† and Sangyeun Cho†
†
How Ceph Amplifies Writes
2
Data
client
Store, please I can provide a block device service for you!
How Ceph Amplifies Writes
3
client
DataCeph
metadataCeph
journal
Make updates atomic and consistent
Describe data
By local file system (XFS)By local file system (XFS)
How Ceph Amplifies Writes
4
client
DataCeph
metadataCeph
journal
File system metadata
File system journal
How Ceph Amplifies Writes
5
client
DataCeph
metadataCeph
journal
File system metadata
File system journal
DataCeph
metadataCeph
journal
File system metadata
File system journal
DataCeph
metadataCeph
journal
File system metadata
File system journal
Replication for reliability
Replication for reliability
How Ceph Amplifies Writes
6
client
DataCeph
metadataCeph
journalDataCeph
metadataCeph
journalDataCeph
metadataCeph
journal
File system metadata
File system journal
File system metadata
File system journal
File system metadata
File system journal
Single write can make several hidden I/Os What about in long-term situation?
0 2 4 6 8 10 12 14 16
File
Sto
re
Ceph data Ceph metadata Ceph journal File system metadata File system journal
How Much Does Ceph Amplifies Writes
7
0 2 4 6 8 10 12 14 16WAF
DataCeph
metadataCeph
journalFile system metadata
File system journal
0 2 4 6 8 10 12 14 16
File
Sto
re
Ceph data Ceph metadata Ceph journal File system metadata File system journal
How Much Does Ceph Amplifies Writes
8
0 2 4 6 8 10 12 14 16WAF
DataCeph
metadataCeph
journalFile system metadata
File system journal
0 2 4 6 8 10 12 14 16
File
Sto
re
Ceph data Ceph metadata Ceph journal File system metadata File system journal
How Much Does Ceph Amplifies Writes
9
0 2 4 6 8 10 12 14 16WAF
DataCeph
metadataCeph
journalFile system metadata
File system journal
0 2 4 6 8 10 12 14 16
File
Sto
re
Ceph data Ceph metadata Ceph journal File system metadata File system journal
How Much Does Ceph Amplifies Writes
10
0 2 4 6 8 10 12 14 16WAF
DataCeph
metadataCeph
journalFile system metadata
File system journal
0 2 4 6 8 10 12 14 16
File
Sto
re
Ceph data Ceph metadata Ceph journal File system metadata File system journal
How Much Does Ceph Amplifies Writes
11
0 2 4 6 8 10 12 14 16WAF
DataCeph
metadataCeph
journalFile system metadata
File system journal
How Much Does Ceph Amplifies Writes
12
0 2 4 6 8 10 12 14 16
File
Sto
re
Ceph data Ceph metadata Ceph journal File system metadata File system journal
WAF
DataCeph
metadataCeph
journalFile system metadata
File system journal
How Much Does Ceph Amplifies Writes
13
0 2 4 6 8 10 12 14 16
File
Sto
re
Ceph data Ceph metadata Ceph journal File system metadata File system journal
DataCeph
metadataCeph
journalFile system metadata
File system journal
3 0.702 5.994 0.419 3.635Writes amplified by over 13x
Our Motivation & Goal
How/Why are writes highly amplified in Ceph?• With the exact numbers
Why do we focus on write amplification?• WAF(Write amplification Factor) affects the overall performance
• When using SSDs, it hurts the lifespan of SSDs
• Larger WAF → smaller effective bandwidth
• Redundant journaling of journal may exist
Goals of this paper• Understanding of write behaviors of Ceph
• Empirical study of write amplification in Ceph
14
Outline
Introduction
Background
Evaluation environment
Result & Analysis
Conclusion
15
Architecture of Ceph RADOS Block Device (RBD)
16
MONOSD
#0OSD
#1
OSD#2
OSD#3
• • •
client
Ceph Object Storage Daemon
publicnetwork
Ceph MONitor Daemon
Architecture of Ceph RADOS Block Device (RBD)
17
librbd, librados
/dev/rbd
publicnetwork
MONOSD
#0OSD
#1
OSD#2
OSD#3
• • •
I/O Stack
CEPH RADOS Block Device
D
Architecture of Ceph RADOS Block Device (RBD)
18
librbd, librados
/dev/rbd
MONOSD
#0OSD
#1
OSD#2
OSD#3
• • •
I/O Stack
D
Object (4MB)
publicnetwork
Architecture of Ceph RADOS Block Device (RBD)
19
librbd, librados
/dev/rbd
MONOSD
#0OSD
#1
OSD#2
OSD#3
• • •
I/O Stack
D
Gets which OSD the object should be sent to
(CRUSH algorithm)
publicnetwork
Architecture of Ceph RADOS Block Device (RBD)
20*Figure Courtesy: Sébastien Han https://www.sebastien-han.fr
BlueStoreKStore
Storage
XFS
FileStore
*
OSD #2 OSD #10OSD #4OSD #1OSD #0
• • •• • • • • •
OSD Internal layersData
AttributesKey-Value Data
Replication 3x
D
Architecture of Ceph RADOS Block Device (RBD)
21
BlueStoreKStore
Storage
XFS
FileStore
*
OSD #2 OSD #10OSD #4OSD #1OSD #0
• • •• • • • • •
OSD Internal layersD Ceph Storage Backends (Object Store)
*Figure Courtesy: Sébastien Han https://www.sebastien-han.fr
Transaction
Ceph Storage Backends: (1) FileStore
Manages each object as a file
Write flow in FileStore1. Write-Ahead journaling
− For consistency and performance
2. Performs actual write to file after journaling− Can be absorbed by page cache
3. Calls syncfs + flush journal entries for every 5 seconds
NVMe SSD HDDs or SSDs
Ceph Journal XFS file system
ObjectsMetadataAttributes
Ceph journalCeph data Ceph metadata
FS metadata FS journal
Write-AheadJournaling
DB WALLevelDB
22
<Breakdown of FileStore>
Ceph Storage Backends: (2) KStore
Using existing key-value stores• Encapsulates everything to key-
value pair
Supports LevelDB, RocksDB and Kinetic Store
Write flow in KStore1. Simply calls key-value APIs with
the key-value pair
23
<Breakdown of KStore>
HDDs or SSDs
XFS file system
Objects, Metadata, Attributes
Ceph data Ceph metadataCompaction
FS metadataFS journal
DB WALLevelDB or RocksDB
Ceph Storage Backends: (3) BlueStore
To avoid limitations of FileStore• Double-write issue due to
journaling− Directly stores data to raw device
Write flow in BlueStore1. Puts data into raw block device
2. Sets metadata to RocksDB on BlueFS (user-level file system)
24
<Breakdown of BlueStore>
HDDs or SSDs NVMe SSD
Raw Device BlueFS
ObjectsMetadataAttributes
Ceph dataZero-filled data
RocksDB DB
DB WALRocksDB
NVMe SSD
BlueFS
RocksDB WAL
Outline
Introduction
Background
Evaluation environment
Result & Analysis
Conclusion
25
H/W Evaluation Environment
26
OSD Servers (x4)
Model DELL R730
Processor Intel® Xeon® CPU E5-2640 v3
Memory 32 GB
Storage HGST UCTSSC600 600 GB x4Samsung PM1633 960 GB x4Intel® 750 series 400 GB x2
Admin Server / Client (x1)
Model DELL R730XD
Processor Intel® Xeon® CPU E5-2640 v3
Memory 128 GB
Switch (x2)
Public Network
DELL N4032 10Gbps Ethernet
StorageNetwork
Mellanox SX6012 40Gbps InfiniBand P
ub
lic Netw
ork
Sto
rage
Net
wo
rk
S/W Evaluation Environment
Linux 4.4.43 kernel • Some modifications to collect and classify write requests
Ceph Jewel LTS version (v10.2.5)
From client side• Configures a 64GiB KRBD
− 4 OSDs per OSD server (total 16 OSDs)
• Generates workloads using fio while collecting trace and diskstats
Perform 2 different workloads• Microbenchmark
• Long-term workload
27
Outline
Introduction
Background
Evaluation environment
Result & Analysis
Conclusion
28
Key Findings in Microbenchmark
FileStore• Large WAF when write request size is small (over 40 at 4KB write)
• WAF converges to 6 (3x by replication, 2x by Ceph journaling)
KStore• Sudden WAF jumps due to memtable flush
• WAF converges to 3 (by replication)
BlueStore• Because the minimum extent size is set to 64KB
− Zero-filled data for the hole within the object
− WAL (Write-Ahead Logging) for small overwrite to guarantee durability
• WAF converges to 3 (by replication)
29
Long-Term Workload Methodology
We focus on 4KB random writes• In VDI, most of the writes are random and 4KB in size
Workload scenario1. Install Ceph and create a krbd partition
2. Drop page cache, call sync and wait for 600 secs
3. Issue 4KB random writes with QD=128 until the total write amount reaches 90% of the capacity (57.6GiB)
Run tests with 16 HDDs first and repeat with 16 SSDs
Calculate and breakdown WAF for given time period30
Long-term Results: (1) FileStore
31
0
1000
2000
3000
4000
5000
0
1
2
3
4
0 2025 4051 6077 8132
IOP
S (o
ps/
sec)
Wri
te A
mo
un
t (G
B/1
5se
c)
Time (sec)
0
15000
30000
45000
0
2
4
6
8
0 93 186 280 379
IOP
S (o
ps/
sec)
Wri
te A
mo
un
t (G
B/3
sec)
Time (sec)
Ceph data + Ceph metadata Ceph journal File system metadata File system journal IOPS
<HDD>
<SSD>
Large fluctuation due to repeated throttling
No throttling: SAS SSDs with XFS are enough fast
Performance throttling due to slow HDDs
Ceph journal first absorbs random writes
0
1000
2000
3000
4000
5000
0123456789
0 4711 9422 14134 18905
IOP
S (o
ps/
sec)
Wri
te A
mo
un
t (G
B/1
5se
c)
Time (sec)
0
15000
30000
45000
0
2
4
6
8
10
12
0 424 848 1272 1700
IOP
S (o
ps/
sec)
Wri
te A
mo
un
t (G
B/3
sec)
Time (sec)
Ceph data Ceph metadata Compaction File system metadata File system journal IOPS
Long-term Results: (2) KStore (RocksDB)
32
<HDD>
<SSD>
Huge Compaction Traffic in both cases
Huge compaction traffic in both casesFluctuation gets worse
<HDD>
<SSD>
0
15000
30000
45000
0
5
10
15
20
0 105 210 316 421
IOP
S (o
ps/
sec)
Wri
te A
mo
un
t (G
B/3
sec)
Time (sec)
Ceph data RocksDB RocksDB WAL Compaction Zero-filled data IOPS
0
1000
2000
3000
4000
5000
0
1
2
3
4
5
6
7
0 2251 4501 6752 9032
IOP
S (o
ps/
sec)
Wri
te A
mo
un
t (G
B/1
5se
c)
Time (sec)
Long-term Results: (3) BlueStore
33
Since allocating extents is in sequential order,first incoming writes become sequential writes
Zero-filled data traffic is huge at fistZero-filled data traffic is huge at fist
SSDs have superior random write performance
Overall Results of WAF
34
0 5 10 15 20 25 30 35 40 45
SSD
HDD
SSD
HDD
SSD
HDD
Blu
eSt
ore
KSt
ore
(Ro
cksD
B)
File
Sto
re
Ceph data Ceph metadata Ceph journal Compaction
Zero-filled data File system metadata File system journal
60 70 80 WAF
Overall Results of WAF
35
0 5 10 15 20 25 30 35 40 45
SSD
HDD
SSD
HDD
SSD
HDD
Blu
eSt
ore
KSt
ore
(Ro
cksD
B)
File
Sto
re
Ceph data Ceph metadata Ceph journal Compaction
Zero-filled data File system metadata File system journal
60 70 80 WAF
WAF for Ceph journal is about 6 in both cases, not 3→ Ceph journal triples the write traffic
FS metadata + FS journal = 4.2 (HDD), 4.054 (SSD)
FS metadata + FS journal = 4.2 (HDD), 4.054 (SSD)
Overall Results of WAF
36
0 5 10 15 20 25 30 35 40 45
SSD
HDD
SSD
HDD
SSD
HDD
Blu
eSt
ore
KSt
ore
(Ro
cksD
B)
File
Sto
re
Ceph data Ceph metadata Ceph journal Compaction
Zero-filled data File system metadata File system journal
60 70 80 WAF
Huge compaction traffic matters
Overall Results of WAF
37
0 5 10 15 20 25 30 35 40 45
SSD
HDD
SSD
HDD
SSD
HDD
Blu
eSt
ore
KSt
ore
(Ro
cksD
B)
File
Sto
re
Ceph data Ceph metadata Ceph journal Compaction
Zero-filled data File system metadata File system journal
60 70 80 WAF
WAF is still larger than FileStore’seven without zero-filled data
Compaction traffic for RocksDB is not ignorable
Conclusion
Writes are amplified by more than 13x in Ceph• No matter which storage backend is used
FileStore• External Ceph journaling triples write traffic
• File system overhead exceeds the original data traffic
KStore• Suffers huge compaction overhead
BlueStore• Small write requests are logged on RocksDB WAL
• Unignorable zero-filled data & compaction traffic
38
Thank you!