Computational Storage over NVMe · 2018-12-06 · Computational Storage over NVMe FPGAs and Accelerating Storage Workloads Sean Gibb, ... NVMe SSDs x86 Client High-speed Network.

Computational Storage over NVMe

FPGAs and Accelerating Storage Workloads

Sean Gibb, PhD, VP Software

NVMe Developer Days 2018 San Diego, CA

1

What is Computational Storage?

•  SSDs, NICs and other storage devices have compute capabilities

•  How do these devices currently inform the host about their capabilities?

•  How does the host handle these devices?


2

Host CPU

Memory Centric

Compute

Smart Remote Storage

Smart NVMe SSD

Smart HBA

Ethernet? RDMA? TCP/IP? GenZ?

PCIe

DDR

PCIe

Composability and Computational Storage


3

•  NVMe SSDs have drastically increased storage I/O bandwidth

•  Storage and analytics workloads increasingly taxing on Host CPU

•  Composable systems allow resources to be shared efficiently across data center

•  Computational storage is a step towards composable systems

•  Moving compute to or near the data reduces latency and power

•  FPGAs and other programmable systems provide compelling solutions

Source: Mark Carlson (SNIA)

NoLoad = NVMe Offload


4

U.2 or COTS PCIe FPGA Card

Storage (NVMe SSDs, HDDs, etc)

Low Latency Interconnect (PCIe)

Service

Why NVMe for Computational Storage?

•  Storage Offload Requires: •  Low latency •  High throughput •  Low CPU overhead •  Multicore awareness •  QoS awareness

•  Storage Offload Benefits From: •  Disaggregation •  Peer-to-Peer


5

•  NVMe Provides: •  Low latency •  High throughput •  Low CPU overhead •  Multicore awareness •  QoS awareness •  Disaggregation •  Peer-to-Peer Capabilities

Why develop and maintain a driver when NVMe capabilities align so well with offload engine needs and you can have world-class driver writers working on your driver? Real question is “Why not NVMe?”

NoLoad Controller Architecture


6

•  Host CPU communicates with offload engines via NVMe controller using standard NVMe commands

•  NVMe controller pushes and pulls commands and data via DMA engine

•  NVMe controller is in-house developed soft controller on a RISC-V

•  Board has external DDR for engines that require large data storage

•  Controller supports command queue and data CMB (using portion of DDR)

•  Developed an offload engine wrapper to handle details of NVMe NoLoadAcceleratorBoard

FPGA

HostCPU

PCIeControllerandDMAEngine

NVMeController

Accelerators

DDRController

DDR

PCIe

DDR

TM

CMB

NVMe for Storage Offload

•  Present as NVMe 1.3 device with multiple namespaces •  One namespace per offload engine

•  Offload engines map to namespaces and are discovered using identify namespace command

•  Vendor specific fields provide offload engine specific information

•  Configure and initialize offload engine via in-situ data path configuration •  Input data and configuration transferred using NVMe Writes

•  Write to namespace associated with offload engine

•  Output data and status are retrieved using NVMe Reads •  Read from namespace associated with offload engine


7

NVMe for Storage Offload

•  In-house NVMe controller supports advanced features •  Queue and Data CMB, SGLs, NVMe-oF

•  Support peer-to-peer (P2P) operation •  No customized drivers required – all inbox drivers! •  Leverage industry-standard NVMe test tools

•  FIO and nvme-cli •  Assist with deployment and benchmarking

•  Take advantage of rich NVMe ecosystem •  Can leverage servers and storage systems developed for NVMe


8

libnoload


9

•  Developed user API to assist with common tasks associated with acceleration over NVMe

•  Provides C and C++ libraries •  Handles discovering NoLoad adapters and

enumerating offload engines on the adapters

•  Provides support to lock/unlock offload engines

•  Provides thin wrappers over system calls for writing data to and reading results back from the offload engines

•  Handles seamless integration with our offload engine interface IP

•  API is BSD licensed

SPDK

Management

nvme-cli nvme-of …

libnoload

Applications

Userspace

Hardware

OS

NVMe-over-Fabrics (NVMe-oF)


10

•  NVMe-oF allows namespaces to be shared across networks

•  Expose NVMe namespaces to client machines using inbox drivers

•  NoLoad is a standard namespace:

•  Can share it in the same way as any other NVMe device

RDMA or TCP/IP Network

Clients Servers NVMe SSDs

NoLoad ™ U.2

NVMe-oF


11

•  Clients request to borrow namespace(s) from server

•  Recall that offload engines map to namespaces

•  Client given access to the namespace (aka offload engine) over the connection

RDMA or TCP/IP Network

Clients Servers NVMe SSDs

NoLoad ™ U.2

NVMe-oF


12

•  Clients see newly acquired namespaces as local NVMe block devices

•  Normal NVMe operations can be executed as if resources local attached to client machine

Case Study: DEFLATE-oF


13

•  Demonstrate GZIP compression-over-fabrics

•  Both RoCE and TCP/IP networking •  Both NoLoad and generic NVMe SSD

located on remote server •  U.2 accelerator form factor (Gen 3x4) •  Local client running the application is

unaware that it is using an over-Fabrics DEFLATE offload engine

•  User space code is exactly the same direct attach, over-Fabrics, or peer-to-peer

•  Demonstrated at SC 2018

NIC

LocalClientrunningapplication

NoLoad™U.2

GenericNVMeSSDs

x86Client

High-speedNetwork

Case Study: DEFLATE-oF


14

Case Study: P2P DEFLATE


15

•  Demonstrate direct attached GZIP compression in an enclosure

•  Both NoLoads and generic NVMe SSDs located on HPE DL385 AMD EPYC server

•  U.2 accelerator form factor (Gen 3x4) •  Enabled with Peer to Peer (P2P) Linux

framework which has now been upstreamed into Linus’ kernel

•  Demonstrated at SC 2018 6NoLoad™U.2 18GenericNVMeSSDs

Case Study: P2P DEFLATE

•  Achieved 150Gb/s performance •  70x more efficient that software •  Software: 42Gb/s using 100% utilization across 2 sockets

(64 physical cores) •  Would take four 2U servers to get same throughput

performance •  4% CPU load

•  This small CPU load is dedicated to servicing the NVMe commands to the drives and NoLoad devices

•  No load on CPU memory •  Achieved thanks to P2P transfers


16

P2P DMA and Memory Usage

•  NVMe CMB (Controller Memory Buffer) is a PCIe BAR that can be used for Submission and Completion Queues, PRPs, SGLs, and data

•  PCIe drivers can register memory or request access to memory for DMA •  P2P DMA allows us to bypass CPU DRAM


17

Traditional DMAs P2P DMAs

CPU

NVM

e

DR

AM

NVM

e

SSD A SSD B

PCIe PCIe

DDR

NoLoad

JBOF w/ PCIe Switch

NVM

e

DR

AM

SSD A NoLoad 250-U2 with Compression

PCIe PCIe

DDR

CM

B

Computational Storage Standardization

•  Standardization efforts are gaining momentum •  Started with a BOF session at FMS 2018 •  SNIA recently ratified Computational Storage Technical Working Group (TWG) •  Press release November 12 from SNIA about Computational Storage TWG •  Driving work in multiple standards

•  Standardizing NVMe Computational Storage introduces standards-based ways of interfacing with PCIe attached storage-centric compute

•  Standardization is in the early days •  If you’re interested, get involved by checking out www.snia.org/computational


18

Computational Storage over NVMe · 2018-12-06 · Computational Storage over NVMe FPGAs and Accelerating Storage Workloads Sean Gibb, ... NVMe SSDs x86 Client High-speed Network.

Documents