Understanding Write Behaviors of Storage Backends in Ceph ...csl.skku.edu/papers/msst17-slides.pdf · Understanding Write Behaviors of Storage Backends in Ceph Object Store Dong-Yun

Understanding Write Behaviors of Storage Backendsin Ceph Object Store

Dong-Yun Lee, Kisik Jeong, Sang-Hoon Han, Jin-Soo Kim,

Joo-Young Hwang† and Sangyeun Cho†

†

How Ceph Amplifies Writes

2

Data

client

Store, please I can provide a block device service for you!

https://www.google.co.kr/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwj8w8Ht8r3QAhWEabwKHfo1An4QjRwIBw&url=https://github.com/ceph/ceph&bvm=bv.139782543,d.dGc&psig=AFQjCNFDZ5earJzQdwt0vVRqboyTJVdLMA&ust=1479956767046146



3

client

DataCeph

metadataCeph

journal

Make updates atomic and consistent

Describe data



By local file system (XFS)By local file system (XFS)


4

client

DataCeph

metadataCeph

journal

File system metadata

File system journal




5

client

DataCeph

metadataCeph

journal


File system journal

DataCeph

metadataCeph

journal


File system journal

DataCeph

metadataCeph

journal


File system journal

Replication for reliability

Replication for reliability




6

client

DataCeph

metadataCeph

journalDataCeph

metadataCeph

journalDataCeph

metadataCeph

journal


File system journal


File system journal


File system journal

Single write can make several hidden I/Os What about in long-term situation?



0 2 4 6 8 10 12 14 16

File

Sto

re

Ceph data Ceph metadata Ceph journal File system metadata File system journal

How Much Does Ceph Amplifies Writes

7

0 2 4 6 8 10 12 14 16WAF

DataCeph

metadataCeph

journalFile system metadata

File system journal

0 2 4 6 8 10 12 14 16

File

Sto

re



8

0 2 4 6 8 10 12 14 16WAF

DataCeph

metadataCeph


File system journal

0 2 4 6 8 10 12 14 16

File

Sto

re



9

0 2 4 6 8 10 12 14 16WAF

DataCeph

metadataCeph


File system journal

0 2 4 6 8 10 12 14 16

File

Sto

re



10

0 2 4 6 8 10 12 14 16WAF

DataCeph

metadataCeph


File system journal

0 2 4 6 8 10 12 14 16

File

Sto

re



11

0 2 4 6 8 10 12 14 16WAF

DataCeph

metadataCeph


File system journal


12

0 2 4 6 8 10 12 14 16

File

Sto

re


WAF

DataCeph

metadataCeph


File system journal


13

0 2 4 6 8 10 12 14 16

File

Sto

re


DataCeph

metadataCeph


File system journal

3 0.702 5.994 0.419 3.635Writes amplified by over 13x

Our Motivation & Goal

How/Why are writes highly amplified in Ceph?• With the exact numbers

Why do we focus on write amplification?• WAF(Write amplification Factor) affects the overall performance

• When using SSDs, it hurts the lifespan of SSDs

• Larger WAF → smaller effective bandwidth

• Redundant journaling of journal may exist

Goals of this paper• Understanding of write behaviors of Ceph

• Empirical study of write amplification in Ceph

14

Outline

Introduction

Background

Evaluation environment

Result & Analysis

Conclusion

15

Architecture of Ceph RADOS Block Device (RBD)

16

MONOSD

#0OSD

#1

OSD#2

OSD#3

• • •

client

Ceph Object Storage Daemon

publicnetwork

Ceph MONitor Daemon




17

librbd, librados

/dev/rbd

publicnetwork

MONOSD

#0OSD

#1

OSD#2

OSD#3

• • •

I/O Stack

CEPH RADOS Block Device

D


18

librbd, librados

/dev/rbd

MONOSD

#0OSD

#1

OSD#2

OSD#3

• • •

I/O Stack

D

Object (4MB)

publicnetwork


19

librbd, librados

/dev/rbd

MONOSD

#0OSD

#1

OSD#2

OSD#3

• • •

I/O Stack

D

Gets which OSD the object should be sent to

(CRUSH algorithm)

publicnetwork


20*Figure Courtesy: Sébastien Han https://www.sebastien-han.fr

BlueStoreKStore

Storage

XFS

FileStore

*

OSD #2 OSD #10OSD #4OSD #1OSD #0

• • •• • • • • •

OSD Internal layersData

AttributesKey-Value Data

Replication 3x

D


21

BlueStoreKStore

Storage

XFS

FileStore

*

OSD #2 OSD #10OSD #4OSD #1OSD #0

• • •• • • • • •

OSD Internal layersD Ceph Storage Backends (Object Store)

*Figure Courtesy: Sébastien Han https://www.sebastien-han.fr

Transaction

Ceph Storage Backends: (1) FileStore

Manages each object as a file

Write flow in FileStore1. Write-Ahead journaling

− For consistency and performance

2. Performs actual write to file after journaling− Can be absorbed by page cache

3. Calls syncfs + flush journal entries for every 5 seconds

NVMe SSD HDDs or SSDs

Ceph Journal XFS file system

ObjectsMetadataAttributes

Ceph journalCeph data Ceph metadata

FS metadata FS journal

Write-AheadJournaling

DB WALLevelDB

22

<Breakdown of FileStore>

Ceph Storage Backends: (2) KStore

Using existing key-value stores• Encapsulates everything to key-

value pair

Supports LevelDB, RocksDB and Kinetic Store

Write flow in KStore1. Simply calls key-value APIs with

the key-value pair

23

<Breakdown of KStore>

HDDs or SSDs

XFS file system

Objects, Metadata, Attributes

Ceph data Ceph metadataCompaction

FS metadataFS journal

DB WALLevelDB or RocksDB

Ceph Storage Backends: (3) BlueStore

To avoid limitations of FileStore• Double-write issue due to

journaling− Directly stores data to raw device

Write flow in BlueStore1. Puts data into raw block device

2. Sets metadata to RocksDB on BlueFS (user-level file system)

24

<Breakdown of BlueStore>

HDDs or SSDs NVMe SSD

Raw Device BlueFS

ObjectsMetadataAttributes

Ceph dataZero-filled data

RocksDB DB

DB WALRocksDB

NVMe SSD

BlueFS

RocksDB WAL

Outline

Introduction

Background


Result & Analysis

Conclusion

25

H/W Evaluation Environment

26

OSD Servers (x4)

Model DELL R730

Processor Intel® Xeon® CPU E5-2640 v3

Memory 32 GB

Storage HGST UCTSSC600 600 GB x4Samsung PM1633 960 GB x4Intel® 750 series 400 GB x2

Admin Server / Client (x1)

Model DELL R730XD

Processor Intel® Xeon® CPU E5-2640 v3

Memory 128 GB

Switch (x2)

Public Network

DELL N4032 10Gbps Ethernet

StorageNetwork

Mellanox SX6012 40Gbps InfiniBand P

ub

lic Netw

ork

Sto

rage

Net

wo

rk

S/W Evaluation Environment

Linux 4.4.43 kernel • Some modifications to collect and classify write requests

Ceph Jewel LTS version (v10.2.5)

From client side• Configures a 64GiB KRBD

− 4 OSDs per OSD server (total 16 OSDs)

• Generates workloads using fio while collecting trace and diskstats

Perform 2 different workloads• Microbenchmark

• Long-term workload

27

Outline

Introduction

Background


Result & Analysis

Conclusion

28

Key Findings in Microbenchmark

FileStore• Large WAF when write request size is small (over 40 at 4KB write)

• WAF converges to 6 (3x by replication, 2x by Ceph journaling)

KStore• Sudden WAF jumps due to memtable flush

• WAF converges to 3 (by replication)

BlueStore• Because the minimum extent size is set to 64KB

− Zero-filled data for the hole within the object

− WAL (Write-Ahead Logging) for small overwrite to guarantee durability

• WAF converges to 3 (by replication)

29

Long-Term Workload Methodology

We focus on 4KB random writes• In VDI, most of the writes are random and 4KB in size

Workload scenario1. Install Ceph and create a krbd partition

2. Drop page cache, call sync and wait for 600 secs

3. Issue 4KB random writes with QD=128 until the total write amount reaches 90% of the capacity (57.6GiB)

Run tests with 16 HDDs first and repeat with 16 SSDs

Calculate and breakdown WAF for given time period30

Long-term Results: (1) FileStore

31

0

1000

2000

3000

4000

5000

0

1

2

3

4

0 2025 4051 6077 8132

IOP

S (o

ps/

sec)

Wri

te A

mo

un

t (G

B/1

5se

c)

Time (sec)

0

15000

30000

45000

0

2

4

6

8

0 93 186 280 379

IOP

S (o

ps/

sec)

Wri

te A

mo

un

t (G

B/3

sec)

Time (sec)

Ceph data + Ceph metadata Ceph journal File system metadata File system journal IOPS

<HDD>

<SSD>

Large fluctuation due to repeated throttling

No throttling: SAS SSDs with XFS are enough fast

Performance throttling due to slow HDDs

Ceph journal first absorbs random writes

0

1000

2000

3000

4000

5000

0123456789

0 4711 9422 14134 18905

IOP

S (o

ps/

sec)

Wri

te A

mo

un

t (G

B/1

5se

c)

Time (sec)

0

15000

30000

45000

0

2

4

6

8

10

12

0 424 848 1272 1700

IOP

S (o

ps/

sec)

Wri

te A

mo

un

t (G

B/3

sec)

Time (sec)

Ceph data Ceph metadata Compaction File system metadata File system journal IOPS

Long-term Results: (2) KStore (RocksDB)

32

<HDD>

<SSD>

Huge Compaction Traffic in both cases

Huge compaction traffic in both casesFluctuation gets worse

<HDD>

<SSD>

0

15000

30000

45000

0

5

10

15

20

0 105 210 316 421

IOP

S (o

ps/

sec)

Wri

te A

mo

un

t (G

B/3

sec)

Time (sec)

Ceph data RocksDB RocksDB WAL Compaction Zero-filled data IOPS

0

1000

2000

3000

4000

5000

0

1

2

3

4

5

6

7

0 2251 4501 6752 9032

IOP

S (o

ps/

sec)

Wri

te A

mo

un

t (G

B/1

5se

c)

Time (sec)

Long-term Results: (3) BlueStore

33

Since allocating extents is in sequential order,first incoming writes become sequential writes

Zero-filled data traffic is huge at fistZero-filled data traffic is huge at fist

SSDs have superior random write performance

Overall Results of WAF

34

0 5 10 15 20 25 30 35 40 45

SSD

HDD

SSD

HDD

SSD

HDD

Blu

eSt

ore

KSt

ore

(Ro

cksD

B)

File

Sto

re

Ceph data Ceph metadata Ceph journal Compaction

Zero-filled data File system metadata File system journal

60 70 80 WAF


35

0 5 10 15 20 25 30 35 40 45

SSD

HDD

SSD

HDD

SSD

HDD

Blu

eSt

ore

KSt

ore

(Ro

cksD

B)

File

Sto

re



60 70 80 WAF

WAF for Ceph journal is about 6 in both cases, not 3→ Ceph journal triples the write traffic

FS metadata + FS journal = 4.2 (HDD), 4.054 (SSD)

FS metadata + FS journal = 4.2 (HDD), 4.054 (SSD)


36

0 5 10 15 20 25 30 35 40 45

SSD

HDD

SSD

HDD

SSD

HDD

Blu

eSt

ore

KSt

ore

(Ro

cksD

B)

File

Sto

re



60 70 80 WAF

Huge compaction traffic matters


37

0 5 10 15 20 25 30 35 40 45

SSD

HDD

SSD

HDD

SSD

HDD

Blu

eSt

ore

KSt

ore

(Ro

cksD

B)

File

Sto

re



60 70 80 WAF

WAF is still larger than FileStore’seven without zero-filled data

Compaction traffic for RocksDB is not ignorable

Conclusion

Writes are amplified by more than 13x in Ceph• No matter which storage backend is used

FileStore• External Ceph journaling triples write traffic

• File system overhead exceeds the original data traffic

KStore• Suffers huge compaction overhead

BlueStore• Small write requests are logged on RocksDB WAL

• Unignorable zero-filled data & compaction traffic

38

Thank you!

Understanding Write Behaviors of Storage Backends in Ceph ...csl.skku.edu/papers/msst17-slides.pdf · Understanding Write Behaviors of Storage Backends in Ceph Object Store Dong-Yun

Documents