pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Copyright © 2018 CNEX Labs

pblk – the OCSSD FTL

Linux FAST Summit’18• Javier González


Read Latency with 0% Writes

2

Random Read 4K

Percentiles

Rea

d L

aten

cy


Read Latency with 20% Writes

3

Percentiles

Rea

d L

aten

cyRandom Read 4K + Random Write 4K

Signficant outliers!Worst-case 30X

4ms!


NAND capacity grows bigger

4

Source: William Tidwell -The Harder Alternative – Managing NAND capacity in the 3D age

▪ Capacity is overgrowing bandwidth

▪ Small form factors only aggravate the problem


Host-based QoS is best effort

5

Traditional SSD

LUN 0

…

LUN 1 LUN 2 LUN 3 LUN 4 LUN 5

LUN

N-1LUN 6 LUN 7

App 1

App 2

App 2

▪ Increased capacity force schedulers to

share resources that before could be

dedicated.

▪ Applications optimize their internal I/O

structures to “help” SSDs do a better job:

- Log-structured databases

- Journaled File Systems

- Host Garbage Collection

▪ The SSD re-orders, -schedules, -maps I/Os

based on hints and patterns

▪ There is no good host/device interface to

do common QoS

- Both sides fighting each other for QoS!


Open-Channel SSD Approach

6

Open-Channel SSD

LUN 0

…


LUN

N-1LUN 6 LUN 7

App 1 App 2 App 2

Traditional SSD

LUN 0

…


LUN

N-1LUN 6 LUN 7

App 1

App 2

App 2

▪ Tier 1 applications I/O patterns are:

- well understood

- heavily modified to be workload-optimized (OS support too)

▪ Tier X (>1) applications can be easily compartmentalize

▪ Tier 1 and Tier X applications are forced to coexist in order to maximize resource utilization


Open-Channel Comparison

7

TRADITIONAL SSD Fully host-managed Open-Channel SSD (1.2)

Host-drivenOpen-Channel SSD (2.0)


Open-Channel SSD 2.0

8

▪ New NAND generations only need to

integrate media-specific changes on each

SSD generation

▪ Interface with the host remains the same

- Advances as in NVMe: OCSSD 2.1, 2.2, etc.

▪ Host software is media-agnostic

- Media abstracted by generic geometry

- Wear-level indexes and thresholds

▪ Media-specific actions through feedback loop

- Refresh data


Open-Channel SSD 2.0

1. Identification: Expose the geometry of the SSD

- Parallelism: # Channels, # LUNs, # Chunks, and number of LBAs within a chunk.

- Media timings – Read, write, and erase.

- Write requirements – Minimum write size and optimal write size

2. I/O submission: Richer I/O interfaces

- Support for vector I/O (R/W/E) using scatter/gather address list

- Support for NVMe read and write semantics (zoned devices)

- Continuous access – no maintenance windows

3. Host / SSD communication: Richer admin interfaces

- Chunk states through Report Chunk command (get log page)

• LBA start address

• Write pointer (host guarantees to write sequentially within a chunk)

• Block State (Free, Open, Full, Bad)

• Wear Index

- Active NAND management feedback loop using NVMe AER.

• Drive tells host to rewrite chunks when necessary.

9

Logical Block Address with Geometry Encoded

Channel LUNs Chunk Sector

Logical Block Address (LBA)

Sector

MSB LSB

MSB LSB

So

lid S

tate

Drive LUNs

Channel 0

Me

dia

Co

ntr

olle

r

NVMe Interface

Channel 1


LightNVM Architecture

NVMe Device Driver

- Detection of OCSSD

- Implement support for commands

LightNVM Subsystem

- Core functionality

- Target management (e.g., pblk)

- Sysfs integration

High-level I/O Interface

- Block device using pblk

- Application integration with liblightnvm

Open-Channel SSD

NVMe Device Driver

LightNVM Subsystem

pblk

Hardware

Kernel

Space

User

SpaceApplication(s)

File System

PPA Addressing

Sca

lar

Re

ad

/Write

(op

tio

na

l)

Ge

om

etr

y

Ve

cto

red

R/W

/E

(2)

(1)

(3)

10


pblk- Host-side Flash Translation Layer

▪ Multi-target support - I/O isolation

▪ Fully associative L2P table (4KB mapping)

▪ Host-side write buffer to guarantee read and writes

▪ Cost-based garbage collector, using valid sector

count as metric

▪ Capacity based rate limiter. Function of user and

GC I/O present in write buffer

▪ Scan-based L2P recovery. Scan metadata in

closed lines and OOB on open lines

▪ sysfs interface for statistics and tuning

11


pblk: Responsibilities and Location

▪ pblk: /drivers/lightnvm

▪ LightNVM: /drivers/lightnvm & /drivers/nvme/host/lightnvm.c & /include/linux/lightnvm.h

12


pblk: I/O path

13

Line 0

Line 1

Line 2

Line N

…

P0 P1 P2 PN

ConfigurableMappingStrategy

Metadata

Ring Write BufferContext

User Data

RespectMediaConstrains

generic_make_rq

write

…

1. Reserve space in buffer

2. Copy user data

3. Save write context

4. Complete I/O to block layer

1. Map buffer L2P in current line

- Update L2P table on wrap-up

2. Map metadata for previous line

3. Map erase for next line

4. Submit I/O set

Su

bm

issio

n P

ath

1. Update buffer pointers

2. Deal with W/E errors

Co

mp

leti

on

Pa

th

read

L2P Lookup

Open-Channel SSD

User I/O threads…

GC I/O thread


pblk: Garbage Collection

▪ Cost-based recycling mode based on valid sector count

- Wear-leveling being implemented (depending on 2.0)

▪ Naïve GC on (current implementation)

- Requires rate-limiter to guarantee space for GC

- Introduces write amplification

- Unpredictable bandwidth (steady state)

▪ Hot / Cold data separation (in-progress)

- Improve write amplification

- Predictable steady state (static / dynamic)

- Use LUN bandwidth as a natural rate limiter

- GC dedicated write buffer – enable vector copy

▪ Two GC modes are available:

- Move data using the host’s CPU

- Use vector copy command

• Move data directly in the controller

14

Hot Data

Hot Data

Cold Data

Cold Data

Hot Data

Hot Data

Cold Data

Cold Data

Hot Data Cold Data

Host usable areaOver-Provisioned

Area

Stat

ic A

lloca

tio

nD

ynam

ic A

lloca

tio

n New Data

New Data

New Data

New Data

New Data

GC Data

GC Data

GC Data

Over-

Pro

vis

ion

ed

A

rea


Pblk - Status

▪ Fairly stable for its age – targeting production in 2018

- All basic functionality implemented

▪ Ongoing features

- Hot / Cold data separation

- RAILS: Trade write bandwidth and capacity for latency – implemented by Heiner Litz

- Wear-levelling

- FTL log

▪ Generalization

- Can it be useful for append-only file systems to manage random areas (e.g., metadata)

- Convert into device mapper. Ideas?

- Port pblk to user space. Ideas?

▪ Integrations

- Implement data placement and scheduling into F2FS. Other proposals?

15


Open-Channel SSD Ecosystem - Status

▪ Active community

- Multiple drives in development by commercial SSD vendors

- Multiple contributions to open-source

- Active research using Open-Channel SSDs

▪ Growing software stack

- LightNVM subsystem since Linux kernel 4.4.

- User-space library (liblightnvm) support from Linux kernel 4.11.

- pblk host FTL available from Linux kernel 4.12.

▪ Joint Development Framework (consortium) being formed in 2018

- Apply industry input and standarize (CSP, NAND Vendors, SSD Vendors, Controller Vendors)

- Result in form of 2.1, 3.0, something else?

16


Pblk – the OCSSD FTL

Linux FAST Summit’18• Javier González


pblk: Data Placement

▪ L2P table

- 4KB granularity (1GB per 1TB)

▪ Pre-populated bitmap encoding map (*)

- Bitmap encodes bad blocks and metadata

- Save expensive calculations on fast path (+1 vs. division/modulus)

- Trivial to change stripping strategy

▪ L2P mapping is decoupled from I/O scheduling

- Simplifies adding new mapping strategies

- Simplifies error handling

- Does not necessarily affect disk format

- Default:

• Stripe across channels and LUNs to optimize for throughput

• Metadata at beginning and ending of each line

▪ (*) missing patch for non power-of-2 NAND configurations

18


pblk: I/O Scheduling

▪ Goals

- Fully utilize the bandwidth of the media

• 1 core (E5-2620, 2.4GHz) can move ~3.7GB/s (~1MIOPS)

- Minimize impact of reaching steady state (i.e., user + GC)

- Rate-limit user and GC I/O according to the device’s capacity

▪ Single write thread

- Submits user write I/Os as buffer entries are mapped

- Submits write I/Os for previous line metadata

• Align with user data to minimize disturbances

- Submits erase I/Os for next line

• Align with user data to minimize disturbances

• Distribute price of erasing across all lines

19


pblk: Recovery

▪ Per line metadata:

- Distributed log across lines (user / GC)

20

▪ smeta

- Mark line as “open” when it is allocated

- Give line a sequence number

- Create a reverse line list

- Store the LUNs forming the line

- Store active write LUNs

▪ emeta

- Replicate smeta for consistency

- Store updated bad block bitmap for line

- Store L2P portion for line (lba list)

- Store valid sector count (VSC) for all lines

▪ Per page metadata:

- 16 bytes per 4KB

- Store lba mapped to 4KB sector in OOB area (8 bytes)

▪ Recovery: Scan all lines and reconstruct L2P in order - first closed lines, then open lines


pblk: Debug, Tracing and Monitoring

▪ Monitor pblk’s state through sysfs

- /sys/class/nvme/nvme0/nvme0n1/lightnvm (static device information)

- /sys/block/$PBLK_BLOCK_DEV/pblk/

▪ Debug mode that allows sanity check on all command submission and internal state

- CONFIG_NVM_DEBUG=y

▪ Implementing tracing points

- Better tool integration

- Less performance impact

▪ Implementing pblk tool

- Equivalent to mkfs, but for a FTL

- Allow sanity check, migration, recovery, etc.

- Use liblightnvm

21


Multi-Tenant Workloads

22

NVMe SSD

pblk on OCSSD

2 Tenants(1W/1R)

4 Tenants(3W/1R)

8 Tenants(7W/1R)


pblk: getting started

▪ Instantiate pblk using nvme-cli tool

- Example: sudo nvme lnvm create -d nvme0n1 -t pblk -n pblk0 -b 0 -e 127 –f

- Block device in /dev/pblk0

23

▪ QEMU

- OCSSD backend in QEMU. Simulates

controller/media constrains

- Repository:

[email protected]:OpenChannelSSD/qemu-nvme.git

- Look at options in hw/block/nvme.c

- nvme,drive=mynvme,serial=deadbeef,namespaces=

1,lver=1,lmetasize=16,ll2pmode=0,nlbaf=5,lba_inde

x=3,mdts=10,lnum_lun=4,lnum_pln=2,lsec_size=4096,lsecs_per_pg=4,lpgs_per_blk=512,ldebug=0 \

▪ Available CNEX SDK for research and

collaboration


Solid-State Drive

Parallel Units

Flash Translation Layer Channel X

Media Error Handling

Media Retention Management

Me

dia

Co

ntr

olle

rResponsibilities

Host Interface

Channel Y

Internals of an SSD

Read/Write/Erase

Read/Write

Tens of Parallel Units!

Transforms R/W/E to R/W

Manage Media Constraints

ECC, RAID, Retention

Read (50-100us)

Write (1-10ms)

Erase (3-15ms)

24


Open-Channel SSD Benefits

▪Allow software to innovate faster than hardware

- Decouple placement and scheduling from media management

- Workload-specific optimizations

▪Rapid enablement of new NAND generations

- Reuse FTL logic in hosts allows for faster time to market

- Decoupled architectures is less error-prone

▪Support a broad set of applications on shared hardware

- Guarantee parallelism and I/O isolation

- Do not require maintenance windows

▪Vendor neutrality and supply chain diversity

- Standardized specification supported by cloud and device vendors

- Similar model to standard NVMe

25


Copyright © 2018 CNEX Labs 27

Copyright © 2018 CNEX Labs 28


Status

▪Active community

- Multiple drives in development by commercial SSD vendors

- Multiple contributions to open-source

- Active research using Open-Channel SSDs

▪Growing software stack

- LightNVM subsystem since Linux kernel 4.4.

- User-space library (liblightnvm) support from Linux kernel 4.11.

- pblk host FTL availiable from Linux kernel 4.12.

▪CNEX | Microsoft strategic collaboration on Open-Channel SSDs announced at

FMS 2017

- Joint Development Framework (consortium) being formed in 2018


CNEX Labs, Inc.

Teaming with NAND Flash manufacturers and industry leaders in storage and networking to deliver the next big

innovation for solid-state-storage.

pblk the OCSSD FTL - WordPress.com · 2018-07-23 · - Decoupled architectures is less error-prone Support a broad set of applications on shared hardware - Guarantee parallelism and

Documents