Top Banner
Copyright © 2018 CNEX Labs pblk the OCSSD FTL Linux FAST Summit’18 Javier González
30

pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Jun 03, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

pblk – the OCSSD FTL

Linux FAST Summit’18• Javier González

Page 2: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

Read Latency with 0% Writes

2

Random Read 4K

Percentiles

Rea

d L

aten

cy

Page 3: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

Read Latency with 20% Writes

3

Percentiles

Rea

d L

aten

cyRandom Read 4K + Random Write 4K

Signficant outliers!Worst-case 30X

4ms!

Page 4: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

NAND capacity grows bigger

4

Source: William Tidwell -The Harder Alternative – Managing NAND capacity in the 3D age

▪ Capacity is overgrowing bandwidth

▪ Small form factors only aggravate the problem

Page 5: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

Host-based QoS is best effort

5

Traditional SSD

LUN 0

LUN 1 LUN 2 LUN 3 LUN 4 LUN 5

LUN

N-1LUN 6 LUN 7

App 1

App 2

App 2

▪ Increased capacity force schedulers to

share resources that before could be

dedicated.

▪ Applications optimize their internal I/O

structures to “help” SSDs do a better job:

- Log-structured databases

- Journaled File Systems

- Host Garbage Collection

▪ The SSD re-orders, -schedules, -maps I/Os

based on hints and patterns

▪ There is no good host/device interface to

do common QoS

- Both sides fighting each other for QoS!

Page 6: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

Open-Channel SSD Approach

6

Open-Channel SSD

LUN 0

LUN 1 LUN 2 LUN 3 LUN 4 LUN 5

LUN

N-1LUN 6 LUN 7

App 1 App 2 App 2

Traditional SSD

LUN 0

LUN 1 LUN 2 LUN 3 LUN 4 LUN 5

LUN

N-1LUN 6 LUN 7

App 1

App 2

App 2

▪ Tier 1 applications I/O patterns are:

- well understood

- heavily modified to be workload-optimized (OS support too)

▪ Tier X (>1) applications can be easily compartmentalize

▪ Tier 1 and Tier X applications are forced to coexist in order to maximize resource utilization

Page 7: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

Open-Channel Comparison

7

TRADITIONAL SSD Fully host-managed Open-Channel SSD (1.2)

Host-drivenOpen-Channel SSD (2.0)

Page 8: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

Open-Channel SSD 2.0

8

▪ New NAND generations only need to

integrate media-specific changes on each

SSD generation

▪ Interface with the host remains the same

- Advances as in NVMe: OCSSD 2.1, 2.2, etc.

▪ Host software is media-agnostic

- Media abstracted by generic geometry

- Wear-level indexes and thresholds

▪ Media-specific actions through feedback loop

- Refresh data

Page 9: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

Open-Channel SSD 2.0

1. Identification: Expose the geometry of the SSD

- Parallelism: # Channels, # LUNs, # Chunks, and number of LBAs within a chunk.

- Media timings – Read, write, and erase.

- Write requirements – Minimum write size and optimal write size

2. I/O submission: Richer I/O interfaces

- Support for vector I/O (R/W/E) using scatter/gather address list

- Support for NVMe read and write semantics (zoned devices)

- Continuous access – no maintenance windows

3. Host / SSD communication: Richer admin interfaces

- Chunk states through Report Chunk command (get log page)

• LBA start address

• Write pointer (host guarantees to write sequentially within a chunk)

• Block State (Free, Open, Full, Bad)

• Wear Index

- Active NAND management feedback loop using NVMe AER.

• Drive tells host to rewrite chunks when necessary.

9

Logical Block Address with Geometry Encoded

Channel LUNs Chunk Sector

Logical Block Address (LBA)

Sector

MSB LSB

MSB LSB

So

lid S

tate

Drive LUNs

Channel 0

Me

dia

Co

ntr

olle

r

NVMe Interface

Channel 1

Page 10: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

LightNVM Architecture

NVMe Device Driver

- Detection of OCSSD

- Implement support for commands

LightNVM Subsystem

- Core functionality

- Target management (e.g., pblk)

- Sysfs integration

High-level I/O Interface

- Block device using pblk

- Application integration with liblightnvm

Open-Channel SSD

NVMe Device Driver

LightNVM Subsystem

pblk

Hardware

Kernel

Space

User

SpaceApplication(s)

File System

PPA Addressing

Sca

lar

Re

ad

/Write

(op

tio

na

l)

Ge

om

etr

y

Ve

cto

red

R/W

/E

(2)

(1)

(3)

10

Page 11: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

pblk- Host-side Flash Translation Layer

▪ Multi-target support - I/O isolation

▪ Fully associative L2P table (4KB mapping)

▪ Host-side write buffer to guarantee read and writes

▪ Cost-based garbage collector, using valid sector

count as metric

▪ Capacity based rate limiter. Function of user and

GC I/O present in write buffer

▪ Scan-based L2P recovery. Scan metadata in

closed lines and OOB on open lines

▪ sysfs interface for statistics and tuning

11

Page 12: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

pblk: Responsibilities and Location

▪ pblk: /drivers/lightnvm

▪ LightNVM: /drivers/lightnvm & /drivers/nvme/host/lightnvm.c & /include/linux/lightnvm.h

12

Page 13: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

pblk: I/O path

13

Line 0

Line 1

Line 2

Line N

P0 P1 P2 PN

ConfigurableMappingStrategy

Metadata

Ring Write BufferContext

User Data

RespectMediaConstrains

generic_make_rq

write

1. Reserve space in buffer

2. Copy user data

3. Save write context

4. Complete I/O to block layer

1. Map buffer L2P in current line

- Update L2P table on wrap-up

2. Map metadata for previous line

3. Map erase for next line

4. Submit I/O set

Su

bm

issio

n P

ath

1. Update buffer pointers

2. Deal with W/E errors

Co

mp

leti

on

Pa

th

read

L2P Lookup

Open-Channel SSD

User I/O threads…

GC I/O thread

Page 14: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

pblk: Garbage Collection

▪ Cost-based recycling mode based on valid sector count

- Wear-leveling being implemented (depending on 2.0)

▪ Naïve GC on (current implementation)

- Requires rate-limiter to guarantee space for GC

- Introduces write amplification

- Unpredictable bandwidth (steady state)

▪ Hot / Cold data separation (in-progress)

- Improve write amplification

- Predictable steady state (static / dynamic)

- Use LUN bandwidth as a natural rate limiter

- GC dedicated write buffer – enable vector copy

▪ Two GC modes are available:

- Move data using the host’s CPU

- Use vector copy command

• Move data directly in the controller

14

Hot Data

Hot Data

Cold Data

Cold Data

Hot Data

Hot Data

Cold Data

Cold Data

Hot Data Cold Data

Host usable areaOver-Provisioned

Area

Stat

ic A

lloca

tio

nD

ynam

ic A

lloca

tio

n New Data

New Data

New Data

New Data

New Data

GC Data

GC Data

GC Data

Over-

Pro

vis

ion

ed

A

rea

Page 15: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

Pblk - Status

▪ Fairly stable for its age – targeting production in 2018

- All basic functionality implemented

▪ Ongoing features

- Hot / Cold data separation

- RAILS: Trade write bandwidth and capacity for latency – implemented by Heiner Litz

- Wear-levelling

- FTL log

▪ Generalization

- Can it be useful for append-only file systems to manage random areas (e.g., metadata)

- Convert into device mapper. Ideas?

- Port pblk to user space. Ideas?

▪ Integrations

- Implement data placement and scheduling into F2FS. Other proposals?

15

Page 16: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

Open-Channel SSD Ecosystem - Status

▪ Active community

- Multiple drives in development by commercial SSD vendors

- Multiple contributions to open-source

- Active research using Open-Channel SSDs

▪ Growing software stack

- LightNVM subsystem since Linux kernel 4.4.

- User-space library (liblightnvm) support from Linux kernel 4.11.

- pblk host FTL available from Linux kernel 4.12.

▪ Joint Development Framework (consortium) being formed in 2018

- Apply industry input and standarize (CSP, NAND Vendors, SSD Vendors, Controller Vendors)

- Result in form of 2.1, 3.0, something else?

16

Page 17: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

Pblk – the OCSSD FTL

Linux FAST Summit’18• Javier González

Page 18: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

pblk: Data Placement

▪ L2P table

- 4KB granularity (1GB per 1TB)

▪ Pre-populated bitmap encoding map (*)

- Bitmap encodes bad blocks and metadata

- Save expensive calculations on fast path (+1 vs. division/modulus)

- Trivial to change stripping strategy

▪ L2P mapping is decoupled from I/O scheduling

- Simplifies adding new mapping strategies

- Simplifies error handling

- Does not necessarily affect disk format

- Default:

• Stripe across channels and LUNs to optimize for throughput

• Metadata at beginning and ending of each line

▪ (*) missing patch for non power-of-2 NAND configurations

18

Page 19: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

pblk: I/O Scheduling

▪ Goals

- Fully utilize the bandwidth of the media

• 1 core (E5-2620, 2.4GHz) can move ~3.7GB/s (~1MIOPS)

- Minimize impact of reaching steady state (i.e., user + GC)

- Rate-limit user and GC I/O according to the device’s capacity

▪ Single write thread

- Submits user write I/Os as buffer entries are mapped

- Submits write I/Os for previous line metadata

• Align with user data to minimize disturbances

- Submits erase I/Os for next line

• Align with user data to minimize disturbances

• Distribute price of erasing across all lines

19

Page 20: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

pblk: Recovery

▪ Per line metadata:

- Distributed log across lines (user / GC)

20

▪ smeta

- Mark line as “open” when it is allocated

- Give line a sequence number

- Create a reverse line list

- Store the LUNs forming the line

- Store active write LUNs

▪ emeta

- Replicate smeta for consistency

- Store updated bad block bitmap for line

- Store L2P portion for line (lba list)

- Store valid sector count (VSC) for all lines

▪ Per page metadata:

- 16 bytes per 4KB

- Store lba mapped to 4KB sector in OOB area (8 bytes)

▪ Recovery: Scan all lines and reconstruct L2P in order - first closed lines, then open lines

Page 21: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

pblk: Debug, Tracing and Monitoring

▪ Monitor pblk’s state through sysfs

- /sys/class/nvme/nvme0/nvme0n1/lightnvm (static device information)

- /sys/block/$PBLK_BLOCK_DEV/pblk/

▪ Debug mode that allows sanity check on all command submission and internal state

- CONFIG_NVM_DEBUG=y

▪ Implementing tracing points

- Better tool integration

- Less performance impact

▪ Implementing pblk tool

- Equivalent to mkfs, but for a FTL

- Allow sanity check, migration, recovery, etc.

- Use liblightnvm

21

Page 22: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

Multi-Tenant Workloads

22

NVMe SSD

pblk on OCSSD

2 Tenants(1W/1R)

4 Tenants(3W/1R)

8 Tenants(7W/1R)

Page 23: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

pblk: getting started

▪ Instantiate pblk using nvme-cli tool

- Example: sudo nvme lnvm create -d nvme0n1 -t pblk -n pblk0 -b 0 -e 127 –f

- Block device in /dev/pblk0

23

▪ QEMU

- OCSSD backend in QEMU. Simulates

controller/media constrains

- Repository:

[email protected]:OpenChannelSSD/qemu-nvme.git

- Look at options in hw/block/nvme.c

- nvme,drive=mynvme,serial=deadbeef,namespaces=

1,lver=1,lmetasize=16,ll2pmode=0,nlbaf=5,lba_inde

x=3,mdts=10,lnum_lun=4,lnum_pln=2,lsec_size=4096,lsecs_per_pg=4,lpgs_per_blk=512,ldebug=0 \

▪ Available CNEX SDK for research and

collaboration

Page 24: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

Solid-State Drive

Parallel Units

Flash Translation Layer Channel X

Media Error Handling

Media Retention Management

Me

dia

Co

ntr

olle

rResponsibilities

Host Interface

Channel Y

Internals of an SSD

Read/Write/Erase

Read/Write

Tens of Parallel Units!

Transforms R/W/E to R/W

Manage Media Constraints

ECC, RAID, Retention

Read (50-100us)

Write (1-10ms)

Erase (3-15ms)

24

Page 25: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

Open-Channel SSD Benefits

▪Allow software to innovate faster than hardware

- Decouple placement and scheduling from media management

- Workload-specific optimizations

▪Rapid enablement of new NAND generations

- Reuse FTL logic in hosts allows for faster time to market

- Decoupled architectures is less error-prone

▪Support a broad set of applications on shared hardware

- Guarantee parallelism and I/O isolation

- Do not require maintenance windows

▪Vendor neutrality and supply chain diversity

- Standardized specification supported by cloud and device vendors

- Similar model to standard NVMe

25

Page 26: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

Page 27: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs 27

Page 28: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs 28

Page 29: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

Status

▪Active community

- Multiple drives in development by commercial SSD vendors

- Multiple contributions to open-source

- Active research using Open-Channel SSDs

▪Growing software stack

- LightNVM subsystem since Linux kernel 4.4.

- User-space library (liblightnvm) support from Linux kernel 4.11.

- pblk host FTL availiable from Linux kernel 4.12.

▪CNEX | Microsoft strategic collaboration on Open-Channel SSDs announced at

FMS 2017

- Joint Development Framework (consortium) being formed in 2018

Page 30: pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

CNEX Labs, Inc.

Teaming with NAND Flash manufacturers and industry leaders in storage and networking to deliver the next big

innovation for solid-state-storage.