Computational Storage over NVMe FPGAs and Accelerating Storage Workloads Sean Gibb, PhD, VP Software NVMe Developer Days 2018 San Diego, CA 1
Computational Storage over NVMe
FPGAs and Accelerating Storage Workloads
Sean Gibb, PhD, VP Software
NVMe Developer Days 2018 San Diego, CA
1
What is Computational Storage?
• SSDs, NICs and other storage devices have compute capabilities
• How do these devices currently inform the host about their capabilities?
• How does the host handle these devices?
NVMe Developer Days 2018 San Diego, CA
2
Host CPU
Memory Centric
Compute
Smart Remote Storage
Smart NVMe SSD
Smart HBA
Ethernet? RDMA? TCP/IP? GenZ?
PCIe
DDR
PCIe
Composability and Computational Storage
NVMe Developer Days 2018 San Diego, CA
3
• NVMe SSDs have drastically increased storage I/O bandwidth
• Storage and analytics workloads increasingly taxing on Host CPU
• Composable systems allow resources to be shared efficiently across data center
• Computational storage is a step towards composable systems
• Moving compute to or near the data reduces latency and power
• FPGAs and other programmable systems provide compelling solutions
Source: Mark Carlson (SNIA)
NoLoad = NVMe Offload
NVMe Developer Days 2018 San Diego, CA
4
U.2 or COTS PCIe FPGA Card
Storage (NVMe SSDs, HDDs, etc)
Low Latency Interconnect (PCIe)
Service
Why NVMe for Computational Storage?
• Storage Offload Requires: • Low latency • High throughput • Low CPU overhead • Multicore awareness • QoS awareness
• Storage Offload Benefits From: • Disaggregation • Peer-to-Peer
NVMe Developer Days 2018 San Diego, CA
5
• NVMe Provides: • Low latency • High throughput • Low CPU overhead • Multicore awareness • QoS awareness • Disaggregation • Peer-to-Peer Capabilities
Why develop and maintain a driver when NVMe capabilities align so well with offload engine needs and you can have world-class driver writers working on your driver? Real question is “Why not NVMe?”
NoLoad Controller Architecture
NVMe Developer Days 2018 San Diego, CA
6
• Host CPU communicates with offload engines via NVMe controller using standard NVMe commands
• NVMe controller pushes and pulls commands and data via DMA engine
• NVMe controller is in-house developed soft controller on a RISC-V
• Board has external DDR for engines that require large data storage
• Controller supports command queue and data CMB (using portion of DDR)
• Developed an offload engine wrapper to handle details of NVMe NoLoadAcceleratorBoard
FPGA
HostCPU
PCIeControllerandDMAEngine
NVMeController
Accelerators
DDRController
DDR
PCIe
DDR
TM
CMB
NVMe for Storage Offload
• Present as NVMe 1.3 device with multiple namespaces • One namespace per offload engine
• Offload engines map to namespaces and are discovered using identify namespace command
• Vendor specific fields provide offload engine specific information
• Configure and initialize offload engine via in-situ data path configuration • Input data and configuration transferred using NVMe Writes
• Write to namespace associated with offload engine
• Output data and status are retrieved using NVMe Reads • Read from namespace associated with offload engine
NVMe Developer Days 2018 San Diego, CA
7
NVMe for Storage Offload
• In-house NVMe controller supports advanced features • Queue and Data CMB, SGLs, NVMe-oF
• Support peer-to-peer (P2P) operation • No customized drivers required – all inbox drivers! • Leverage industry-standard NVMe test tools
• FIO and nvme-cli • Assist with deployment and benchmarking
• Take advantage of rich NVMe ecosystem • Can leverage servers and storage systems developed for NVMe
NVMe Developer Days 2018 San Diego, CA
8
libnoload
NVMe Developer Days 2018 San Diego, CA
9
• Developed user API to assist with common tasks associated with acceleration over NVMe
• Provides C and C++ libraries • Handles discovering NoLoad adapters and
enumerating offload engines on the adapters
• Provides support to lock/unlock offload engines
• Provides thin wrappers over system calls for writing data to and reading results back from the offload engines
• Handles seamless integration with our offload engine interface IP
• API is BSD licensed
SPDK
Management
nvme-cli nvme-of …
libnoload
Applications
Userspace
Hardware
OS
NVMe-over-Fabrics (NVMe-oF)
NVMe Developer Days 2018 San Diego, CA
10
• NVMe-oF allows namespaces to be shared across networks
• Expose NVMe namespaces to client machines using inbox drivers
• NoLoad is a standard namespace:
• Can share it in the same way as any other NVMe device
RDMA or TCP/IP Network
Clients Servers NVMe SSDs
NoLoad ™ U.2
NVMe-oF
NVMe Developer Days 2018 San Diego, CA
11
• Clients request to borrow namespace(s) from server
• Recall that offload engines map to namespaces
• Client given access to the namespace (aka offload engine) over the connection
RDMA or TCP/IP Network
Clients Servers NVMe SSDs
NoLoad ™ U.2
NVMe-oF
NVMe Developer Days 2018 San Diego, CA
12
• Clients see newly acquired namespaces as local NVMe block devices
• Normal NVMe operations can be executed as if resources local attached to client machine
Case Study: DEFLATE-oF
NVMe Developer Days 2018 San Diego, CA
13
• Demonstrate GZIP compression-over-fabrics
• Both RoCE and TCP/IP networking • Both NoLoad and generic NVMe SSD
located on remote server • U.2 accelerator form factor (Gen 3x4) • Local client running the application is
unaware that it is using an over-Fabrics DEFLATE offload engine
• User space code is exactly the same direct attach, over-Fabrics, or peer-to-peer
• Demonstrated at SC 2018
NIC
LocalClientrunningapplication
NoLoad™U.2
GenericNVMeSSDs
x86Client
High-speedNetwork
Case Study: DEFLATE-oF
NVMe Developer Days 2018 San Diego, CA
14
Case Study: P2P DEFLATE
NVMe Developer Days 2018 San Diego, CA
15
• Demonstrate direct attached GZIP compression in an enclosure
• Both NoLoads and generic NVMe SSDs located on HPE DL385 AMD EPYC server
• U.2 accelerator form factor (Gen 3x4) • Enabled with Peer to Peer (P2P) Linux
framework which has now been upstreamed into Linus’ kernel
• Demonstrated at SC 2018 6NoLoad™U.2 18GenericNVMeSSDs
Case Study: P2P DEFLATE
• Achieved 150Gb/s performance • 70x more efficient that software • Software: 42Gb/s using 100% utilization across 2 sockets
(64 physical cores) • Would take four 2U servers to get same throughput
performance • 4% CPU load
• This small CPU load is dedicated to servicing the NVMe commands to the drives and NoLoad devices
• No load on CPU memory • Achieved thanks to P2P transfers
NVMe Developer Days 2018 San Diego, CA
16
P2P DMA and Memory Usage
• NVMe CMB (Controller Memory Buffer) is a PCIe BAR that can be used for Submission and Completion Queues, PRPs, SGLs, and data
• PCIe drivers can register memory or request access to memory for DMA • P2P DMA allows us to bypass CPU DRAM
NVMe Developer Days 2018 San Diego, CA
17
Traditional DMAs P2P DMAs
CPU
NVM
e
DR
AM
NVM
e
SSD A SSD B
PCIe PCIe
DDR
NoLoad
JBOF w/ PCIe Switch
NVM
e
DR
AM
SSD A NoLoad 250-U2 with Compression
PCIe PCIe
DDR
CM
B
Computational Storage Standardization
• Standardization efforts are gaining momentum • Started with a BOF session at FMS 2018 • SNIA recently ratified Computational Storage Technical Working Group (TWG) • Press release November 12 from SNIA about Computational Storage TWG • Driving work in multiple standards
• Standardizing NVMe Computational Storage introduces standards-based ways of interfacing with PCIe attached storage-centric compute
• Standardization is in the early days • If you’re interested, get involved by checking out www.snia.org/computational
NVMe Developer Days 2018 San Diego, CA
18