Copyright © 2018 CNEX Labs pblk – the OCSSD FTL Linux FAST Summit’18 • Javier González
Copyright © 2018 CNEX Labs
pblk – the OCSSD FTL
Linux FAST Summit’18• Javier González
Copyright © 2018 CNEX Labs
Read Latency with 0% Writes
2
Random Read 4K
Percentiles
Rea
d L
aten
cy
Copyright © 2018 CNEX Labs
Read Latency with 20% Writes
3
Percentiles
Rea
d L
aten
cyRandom Read 4K + Random Write 4K
Signficant outliers!Worst-case 30X
4ms!
Copyright © 2018 CNEX Labs
NAND capacity grows bigger
4
Source: William Tidwell -The Harder Alternative – Managing NAND capacity in the 3D age
▪ Capacity is overgrowing bandwidth
▪ Small form factors only aggravate the problem
Copyright © 2018 CNEX Labs
Host-based QoS is best effort
5
Traditional SSD
LUN 0
…
LUN 1 LUN 2 LUN 3 LUN 4 LUN 5
LUN
N-1LUN 6 LUN 7
App 1
App 2
App 2
▪ Increased capacity force schedulers to
share resources that before could be
dedicated.
▪ Applications optimize their internal I/O
structures to “help” SSDs do a better job:
- Log-structured databases
- Journaled File Systems
- Host Garbage Collection
▪ The SSD re-orders, -schedules, -maps I/Os
based on hints and patterns
▪ There is no good host/device interface to
do common QoS
- Both sides fighting each other for QoS!
Copyright © 2018 CNEX Labs
Open-Channel SSD Approach
6
Open-Channel SSD
LUN 0
…
LUN 1 LUN 2 LUN 3 LUN 4 LUN 5
LUN
N-1LUN 6 LUN 7
App 1 App 2 App 2
Traditional SSD
LUN 0
…
LUN 1 LUN 2 LUN 3 LUN 4 LUN 5
LUN
N-1LUN 6 LUN 7
App 1
App 2
App 2
▪ Tier 1 applications I/O patterns are:
- well understood
- heavily modified to be workload-optimized (OS support too)
▪ Tier X (>1) applications can be easily compartmentalize
▪ Tier 1 and Tier X applications are forced to coexist in order to maximize resource utilization
Copyright © 2018 CNEX Labs
Open-Channel Comparison
7
TRADITIONAL SSD Fully host-managed Open-Channel SSD (1.2)
Host-drivenOpen-Channel SSD (2.0)
Copyright © 2018 CNEX Labs
Open-Channel SSD 2.0
8
▪ New NAND generations only need to
integrate media-specific changes on each
SSD generation
▪ Interface with the host remains the same
- Advances as in NVMe: OCSSD 2.1, 2.2, etc.
▪ Host software is media-agnostic
- Media abstracted by generic geometry
- Wear-level indexes and thresholds
▪ Media-specific actions through feedback loop
- Refresh data
Copyright © 2018 CNEX Labs
Open-Channel SSD 2.0
1. Identification: Expose the geometry of the SSD
- Parallelism: # Channels, # LUNs, # Chunks, and number of LBAs within a chunk.
- Media timings – Read, write, and erase.
- Write requirements – Minimum write size and optimal write size
2. I/O submission: Richer I/O interfaces
- Support for vector I/O (R/W/E) using scatter/gather address list
- Support for NVMe read and write semantics (zoned devices)
- Continuous access – no maintenance windows
3. Host / SSD communication: Richer admin interfaces
- Chunk states through Report Chunk command (get log page)
• LBA start address
• Write pointer (host guarantees to write sequentially within a chunk)
• Block State (Free, Open, Full, Bad)
• Wear Index
- Active NAND management feedback loop using NVMe AER.
• Drive tells host to rewrite chunks when necessary.
9
Logical Block Address with Geometry Encoded
Channel LUNs Chunk Sector
Logical Block Address (LBA)
Sector
MSB LSB
MSB LSB
So
lid S
tate
Drive LUNs
Channel 0
Me
dia
Co
ntr
olle
r
NVMe Interface
Channel 1
Copyright © 2018 CNEX Labs
LightNVM Architecture
NVMe Device Driver
- Detection of OCSSD
- Implement support for commands
LightNVM Subsystem
- Core functionality
- Target management (e.g., pblk)
- Sysfs integration
High-level I/O Interface
- Block device using pblk
- Application integration with liblightnvm
Open-Channel SSD
NVMe Device Driver
LightNVM Subsystem
pblk
Hardware
Kernel
Space
User
SpaceApplication(s)
File System
PPA Addressing
Sca
lar
Re
ad
/Write
(op
tio
na
l)
Ge
om
etr
y
Ve
cto
red
R/W
/E
(2)
(1)
(3)
10
Copyright © 2018 CNEX Labs
pblk- Host-side Flash Translation Layer
▪ Multi-target support - I/O isolation
▪ Fully associative L2P table (4KB mapping)
▪ Host-side write buffer to guarantee read and writes
▪ Cost-based garbage collector, using valid sector
count as metric
▪ Capacity based rate limiter. Function of user and
GC I/O present in write buffer
▪ Scan-based L2P recovery. Scan metadata in
closed lines and OOB on open lines
▪ sysfs interface for statistics and tuning
11
Copyright © 2018 CNEX Labs
pblk: Responsibilities and Location
▪ pblk: /drivers/lightnvm
▪ LightNVM: /drivers/lightnvm & /drivers/nvme/host/lightnvm.c & /include/linux/lightnvm.h
12
Copyright © 2018 CNEX Labs
pblk: I/O path
13
Line 0
Line 1
Line 2
Line N
…
P0 P1 P2 PN
ConfigurableMappingStrategy
Metadata
Ring Write BufferContext
User Data
RespectMediaConstrains
generic_make_rq
write
…
1. Reserve space in buffer
2. Copy user data
3. Save write context
4. Complete I/O to block layer
1. Map buffer L2P in current line
- Update L2P table on wrap-up
2. Map metadata for previous line
3. Map erase for next line
4. Submit I/O set
Su
bm
issio
n P
ath
1. Update buffer pointers
2. Deal with W/E errors
Co
mp
leti
on
Pa
th
read
L2P Lookup
Open-Channel SSD
User I/O threads…
GC I/O thread
Copyright © 2018 CNEX Labs
pblk: Garbage Collection
▪ Cost-based recycling mode based on valid sector count
- Wear-leveling being implemented (depending on 2.0)
▪ Naïve GC on (current implementation)
- Requires rate-limiter to guarantee space for GC
- Introduces write amplification
- Unpredictable bandwidth (steady state)
▪ Hot / Cold data separation (in-progress)
- Improve write amplification
- Predictable steady state (static / dynamic)
- Use LUN bandwidth as a natural rate limiter
- GC dedicated write buffer – enable vector copy
▪ Two GC modes are available:
- Move data using the host’s CPU
- Use vector copy command
• Move data directly in the controller
14
Hot Data
Hot Data
Cold Data
Cold Data
Hot Data
Hot Data
Cold Data
Cold Data
Hot Data Cold Data
Host usable areaOver-Provisioned
Area
Stat
ic A
lloca
tio
nD
ynam
ic A
lloca
tio
n New Data
New Data
New Data
New Data
New Data
GC Data
GC Data
GC Data
Over-
Pro
vis
ion
ed
A
rea
Copyright © 2018 CNEX Labs
Pblk - Status
▪ Fairly stable for its age – targeting production in 2018
- All basic functionality implemented
▪ Ongoing features
- Hot / Cold data separation
- RAILS: Trade write bandwidth and capacity for latency – implemented by Heiner Litz
- Wear-levelling
- FTL log
▪ Generalization
- Can it be useful for append-only file systems to manage random areas (e.g., metadata)
- Convert into device mapper. Ideas?
- Port pblk to user space. Ideas?
▪ Integrations
- Implement data placement and scheduling into F2FS. Other proposals?
15
Copyright © 2018 CNEX Labs
Open-Channel SSD Ecosystem - Status
▪ Active community
- Multiple drives in development by commercial SSD vendors
- Multiple contributions to open-source
- Active research using Open-Channel SSDs
▪ Growing software stack
- LightNVM subsystem since Linux kernel 4.4.
- User-space library (liblightnvm) support from Linux kernel 4.11.
- pblk host FTL available from Linux kernel 4.12.
▪ Joint Development Framework (consortium) being formed in 2018
- Apply industry input and standarize (CSP, NAND Vendors, SSD Vendors, Controller Vendors)
- Result in form of 2.1, 3.0, something else?
16
Copyright © 2018 CNEX Labs
Pblk – the OCSSD FTL
Linux FAST Summit’18• Javier González
Copyright © 2018 CNEX Labs
pblk: Data Placement
▪ L2P table
- 4KB granularity (1GB per 1TB)
▪ Pre-populated bitmap encoding map (*)
- Bitmap encodes bad blocks and metadata
- Save expensive calculations on fast path (+1 vs. division/modulus)
- Trivial to change stripping strategy
▪ L2P mapping is decoupled from I/O scheduling
- Simplifies adding new mapping strategies
- Simplifies error handling
- Does not necessarily affect disk format
- Default:
• Stripe across channels and LUNs to optimize for throughput
• Metadata at beginning and ending of each line
▪ (*) missing patch for non power-of-2 NAND configurations
18
Copyright © 2018 CNEX Labs
pblk: I/O Scheduling
▪ Goals
- Fully utilize the bandwidth of the media
• 1 core (E5-2620, 2.4GHz) can move ~3.7GB/s (~1MIOPS)
- Minimize impact of reaching steady state (i.e., user + GC)
- Rate-limit user and GC I/O according to the device’s capacity
▪ Single write thread
- Submits user write I/Os as buffer entries are mapped
- Submits write I/Os for previous line metadata
• Align with user data to minimize disturbances
- Submits erase I/Os for next line
• Align with user data to minimize disturbances
• Distribute price of erasing across all lines
19
Copyright © 2018 CNEX Labs
pblk: Recovery
▪ Per line metadata:
- Distributed log across lines (user / GC)
20
▪ smeta
- Mark line as “open” when it is allocated
- Give line a sequence number
- Create a reverse line list
- Store the LUNs forming the line
- Store active write LUNs
▪ emeta
- Replicate smeta for consistency
- Store updated bad block bitmap for line
- Store L2P portion for line (lba list)
- Store valid sector count (VSC) for all lines
▪ Per page metadata:
- 16 bytes per 4KB
- Store lba mapped to 4KB sector in OOB area (8 bytes)
▪ Recovery: Scan all lines and reconstruct L2P in order - first closed lines, then open lines
Copyright © 2018 CNEX Labs
pblk: Debug, Tracing and Monitoring
▪ Monitor pblk’s state through sysfs
- /sys/class/nvme/nvme0/nvme0n1/lightnvm (static device information)
- /sys/block/$PBLK_BLOCK_DEV/pblk/
▪ Debug mode that allows sanity check on all command submission and internal state
- CONFIG_NVM_DEBUG=y
▪ Implementing tracing points
- Better tool integration
- Less performance impact
▪ Implementing pblk tool
- Equivalent to mkfs, but for a FTL
- Allow sanity check, migration, recovery, etc.
- Use liblightnvm
21
Copyright © 2018 CNEX Labs
Multi-Tenant Workloads
22
NVMe SSD
pblk on OCSSD
2 Tenants(1W/1R)
4 Tenants(3W/1R)
8 Tenants(7W/1R)
Copyright © 2018 CNEX Labs
pblk: getting started
▪ Instantiate pblk using nvme-cli tool
- Example: sudo nvme lnvm create -d nvme0n1 -t pblk -n pblk0 -b 0 -e 127 –f
- Block device in /dev/pblk0
23
▪ QEMU
- OCSSD backend in QEMU. Simulates
controller/media constrains
- Repository:
[email protected]:OpenChannelSSD/qemu-nvme.git
- Look at options in hw/block/nvme.c
- nvme,drive=mynvme,serial=deadbeef,namespaces=
1,lver=1,lmetasize=16,ll2pmode=0,nlbaf=5,lba_inde
x=3,mdts=10,lnum_lun=4,lnum_pln=2,lsec_size=4096,lsecs_per_pg=4,lpgs_per_blk=512,ldebug=0 \
▪ Available CNEX SDK for research and
collaboration
Copyright © 2018 CNEX Labs
Solid-State Drive
Parallel Units
Flash Translation Layer Channel X
Media Error Handling
Media Retention Management
Me
dia
Co
ntr
olle
rResponsibilities
Host Interface
Channel Y
Internals of an SSD
Read/Write/Erase
Read/Write
Tens of Parallel Units!
Transforms R/W/E to R/W
Manage Media Constraints
ECC, RAID, Retention
Read (50-100us)
Write (1-10ms)
Erase (3-15ms)
24
Copyright © 2018 CNEX Labs
Open-Channel SSD Benefits
▪Allow software to innovate faster than hardware
- Decouple placement and scheduling from media management
- Workload-specific optimizations
▪Rapid enablement of new NAND generations
- Reuse FTL logic in hosts allows for faster time to market
- Decoupled architectures is less error-prone
▪Support a broad set of applications on shared hardware
- Guarantee parallelism and I/O isolation
- Do not require maintenance windows
▪Vendor neutrality and supply chain diversity
- Standardized specification supported by cloud and device vendors
- Similar model to standard NVMe
25
Copyright © 2018 CNEX Labs
Copyright © 2018 CNEX Labs 27
Copyright © 2018 CNEX Labs 28
Copyright © 2018 CNEX Labs
Status
▪Active community
- Multiple drives in development by commercial SSD vendors
- Multiple contributions to open-source
- Active research using Open-Channel SSDs
▪Growing software stack
- LightNVM subsystem since Linux kernel 4.4.
- User-space library (liblightnvm) support from Linux kernel 4.11.
- pblk host FTL availiable from Linux kernel 4.12.
▪CNEX | Microsoft strategic collaboration on Open-Channel SSDs announced at
FMS 2017
- Joint Development Framework (consortium) being formed in 2018
Copyright © 2018 CNEX Labs
CNEX Labs, Inc.
Teaming with NAND Flash manufacturers and industry leaders in storage and networking to deliver the next big
innovation for solid-state-storage.