Patrick Stuedi IBM Research Data Processing at the Speed of 100 Gbps using Apache Crail
The CRAIL Project: Overview
Crail Store
Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)
DRAM NVMe PCM GPU….
100 Gbps10 μsec
Fast Network, e.g., 100 Gbps RoCE
RDMATCP NVMeF SPDK
FS HDFSKVStreaming
Spark-IO Albis Pocket
The CRAIL Project: Overview
Crail Store
Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)
DRAM NVMe PCM GPU….
100 Gbps10 μsec
Fast Network, e.g., 100 Gbps RoCE
RDMATCP NVMeF SPDK
FS HDFSKVStreaming
Spark-IO Albis Pocket
fast sharing ofephemeral data
The CRAIL Project: Overview
Crail Store
Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)
DRAM NVMe PCM GPU….
100 Gbps10 μsec
Fast Network, e.g., 100 Gbps RoCE
RDMATCP NVMeF SPDK
FS HDFSKVStreaming
Spark-IO Albis Pocket
fast sharing ofephemeral data
shuffle/broadcastacceleration
The CRAIL Project: Overview
Crail Store
Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)
DRAM NVMe PCM GPU….
100 Gbps10 μsec
Fast Network, e.g., 100 Gbps RoCE
RDMATCP NVMeF SPDK
FS HDFSKVStreaming
Spark-IO Albis Pocket
fast sharing ofephemeral data
shuffle/broadcastacceleration
efficient storage of
relational data
The CRAIL Project: Overview
Crail Store
Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)
DRAM NVMe PCM GPU….
100 Gbps10 μsec
Fast Network, e.g., 100 Gbps RoCE
RDMATCP NVMeF SPDK
FS HDFSKVStreaming
Spark-IO Albis Pocket
fast sharing ofephemeral data
shuffle/broadcastacceleration
efficient storage of
relational data
data sharing forserverless
applications
put your #assignedhashtag here by setting the footer in view-header/footer● Why CRAIL
● Crail Store
● Workload specific I/O Processing– File Format, shuffle engine, serverless
● Use Cases:– Disaggregation
– Workloads: SQL, Machine Learning
Outline
#1 Performance Challenge (2)
Sorting Application
JVM
Netty
SorterSerializer
socketsData Processing Framework
TCP/IP
Ethernet
NIC
filesystem
block layer
iSCSI
SSD
#1 Performance Challenge (2)
Sorting Application
JVM
Netty
SorterSerializer
socketsData Processing Framework
TCP/IP
Ethernet
NIC
filesystem
block layer
iSCSI
SSD
HotNets’16Fetch chunk Over the network
Process chunkIn reduce task
#1 Performance Challenge (2)
Sorting Application
JVM
Netty
SorterSerializer
socketsData Processing Framework
TCP/IP
Ethernet
NIC
filesystem
block layer
iSCSI
SSD
HotNets’16
#1 Performance Challenge (2)
Sorting Application
JVM
Netty
SorterSerializer
socketsData Processing Framework
TCP/IP
Ethernet
NIC
filesystem
block layer
iSCSI
SSD
software overheadare spread
over the entirestack
HotNets’16
#2 Diversity
Diverse hardware technologies / complex programming APIs / many frameworks
SPDK RDMA Verbs
#3 Ephemeral Data
Ephemeral data has unique properties (e.g., wide range of I/O size)
serverless (AWS lambda) applications
Spark applications
put your #assignedhashtag here by setting the footer in view-header/footerAbstract hardware via high-level storage interface
CRAIL Approach (1)
put your #assignedhashtag here by setting the footer in view-header/footerAbstract hardware via high-level storage interface
CRAIL Approach (1)
hardwarespecificplugins
(storage tiers)
most I/O operationscan conveniently beimplemented on a storage abstraction
put your #assignedhashtag here by setting the footer in view-header/footerAbstract hardware via high-level storage interface
CRAIL Approach (1)
hardwarespecificplugins
(storage tiers)
most I/O operationscan conveniently beimplemented on a storage abstraction
what is theright API?(FS, KV, ?)
put your #assignedhashtag here by setting the footer in view-header/footerFilesystem-like interface:● Hierarchical namespace
– Helps to organize data (shuffle, tmp, etc) for different jobs
● Separate data from metadata plane– Reading/writing involves block metatdata lookup
– Cheap on a low-latency network (few usecs)
– Flexible: data objects can be of arbitrary size
● Specific data types – KeyValue files: last create wins
– Shuffle files: efficient reading of multiple files in a directory
● Let applications control the details– Data placement policy: which storage node or storage tier to use
CRAIL Approach (2)
put your #assignedhashtag here by setting the footer in view-header/footerCareful software design:
● Leverage user-level APIs– RDMA, NFMf, DPDK, SPDK
● Seperate data from control operations– Memory allocation, string parsing, etc.
● Efficient non-blocking operations– Avoid army of threads, let the hardware do the work
● Leverage byte-address storage– Transmit no more data than what is read/written
CRAIL Approach (3)
Crail Store: Architecture
hierarchical namespace, multiple datatypes
distributedstorage over DRAM and flash
Crail Store: Architecture
hierarchical namespace, multiple datatypes
distributedstorage over DRAM and flash
files mayspawn
multiplestorage tiers
can be readlike a single
file
Crail Store: Deployment Modes
Application compute
DRAM storage server
Flash storage server
Metadata server
compute/storageco-located
compute/storagedisaggregated
flash storagedisaggregation
Crail Store: Read Throughput
0
20
40
60
80
100
128B 256B 512B 1K 128K 256K 512K 1MB
Thro
ughput
[Gbit
/s]
Buffer size
Single-client (1 core) throughput
CrailAlluxio
0
2
4
6
8
10
12
128256
5121K 4K 8K 16K
32K64K
128K256K
512K
Th
rou
gh
pu
t (G
B/s
)
NVMf - directNVMf - bufferedDRAM - buffered
DRAM NVMf
Crail reaches line speed at for an I/O size of 1K
Buffer size Buffer size
Performance of a single client running on one core only
Crail Store: Read Latency
Remote DRAMNVMf
Buffer size
0
10
20
30
40
50
4B 1K 4K 16K 64K 256K
late
ncy
[u
s]
key size
124RAMCloud/read/CRAMCloud/read/JavaCrail (lookup & read)
Crail (lookup only)
Buffer size Buffer size
Remote NVMe SSD (3D XPoint)
Crail remote read latencies (DRAM and NVM) are very close to the hardware latencies
Buffer sizeBuffer size
Metadata server scalability
A single metadata server can process 10M Crail lookup ops/sec
0
5
10
15
20
25
30
35
0 10 20 30 40 50 60 70
IOPS [
mill
ions]
Number of clients
Namenode IOPS
2 Namenodes IOPS
4 Namenodes IOPS
Network interfaces
Running Workloads: MapReduce
Crail Store
Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)
DRAM NVMe PCM GPU….
100 Gbps10 μsec
Fast Network, e.g., 100 Gbps RoCE
RDMATCP NVMeF SPDK
FS HDFSKVStreaming
Albis PocketSpark-IO
Spark GroupBy (80M keys, 4K)SparkExec0
/shuffle/1/ f0 f1 fN/shuffle/2/ f0 f1 fN/shuffle/3 f0 f1 fN
SparkExec1
SparkExecN
SparkExec0
SparkExec1
SparkExecN
hash hash hash1
2
map: HashAppend (fle)
reduce: fetchBucket (dir)
Cra
il
Spark/Vanilla
5x2.5x2x
0
20
40
60
80
100
0 10 20 30 40 50 60 70 80 90 100 110 120
Thro
ughput
(Gbit
/s)
Elapsed time (seconds)
1 core4 cores8 cores
0
20
40
60
80
100
0 10 20 30 40 50 60 70 80 90 100 110 120
Th
rou
gh
pu
t (G
bit
/s)
Elapsed time (seconds)
1 core4 cores8 cores
Spark/Crail
val pairs = sc.parallelize(1 to tasks, tasks).flatmap(_ => { var values = new array[(Long,Array[Byte])](numKeys) values = initValues(values)}).cache().groupByKey().count()
Sorting 12.8 TB on 128 nodes
Sorting rate of Crail/Spark only 27% slower than rate of
Winner 2016
Native C distributed
sorting benchmark
Spark Spark/Crail Task ID
So
rtin
g t
ime
[se
c]
Ne
tw T
hp
ut
[Gb
/s]
DRAM & Flash Disaggregation
SparkExec1
SparkExecN
SparkExec0
storagenodes
computenodes 0
40
80
120
160
200Reduce
Map
Crail/DRAM
Crail/NVMe
Vanilla/Alluxio
Input/Output/Shuffle
Tim
e (
sec)
sorting runtime
Using Crail, a Spark 200GB sorting workload can be run with memory and flash disaggregated at no extra cost
Running Workloads: SQL
Crail Store
Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)
DRAM NVMe PCM GPU….
100 Gbps10 μsec
Fast Network, e.g., 100 Gbps RoCE
RDMATCP NVMeF SPDK
FS HDFSKVStreaming
Spark-IO PocketAlbis
Reading Relational Data
None of the common file formats delivers a performance close to the hardware speed
Goo
dput
[Gbp
s]
put your #assignedhashtag here by setting the footer in view-header/footer● Traditional Assumption: CPU is fast, I/O is slow– Use compression, encoding, etc.
– Pack data and metadata together
– Avoid metadata lookups
● Albis: new file format designed for fast I/O hardware
● Albis design principles– Avoid CPU pressure, i.e., no compression, encoding, etc.
– Simple metadata management
Revisiting Design Principles
Mismatch in case of fastI/O hardware!
Reading Relational with Albis
Albis/Crail delivers 2-30x performance improvements over other formats
Running Workloads: Serverless
Crail Store
Data Processing Framework (e.g., Spark, TensorFlow, λ Compute)
DRAM NVMe PCM GPU….
100 Gbps10 μsec
Fast Network, e.g., 100 Gbps RoCE
RDMATCP NVMeF SPDK
FS HDFSKVStreaming
Spark-IO Albis Pocket
put your #assignedhashtag here by setting the footer in view-header/footer● Data sharing implemented using remote storage– Enables fast and fine-grained scaling
● Problem: existing storage platforms not suitable– Slow (e.g., S3)
– No dynamic scaling (e.g. Redis)
– Designed for either small or large data sets
● Can we use Crail? Not as is.– Most clouds don’t support RDMA, NVMf, etc.
– Lacks automatic & elastic resource management
Serverless Computing
● An elastic distributed data store for ephemeral data sharing in serverless analytics
Resource usage ($/hr)
Exe
cutio
n tim
e (s
ec)
Pocket dynamicallyrightsizes storageresources (nodes,
media) in an attemptto find a spot with agood performance
price ratio
Pocket: Resource Utilization
Pocket cost-effectively allocates resources based on user/framework hints
put your #assignedhashtag here by setting the footer in view-header/footer● Effectively using high-performance I/O hardware for data processing is challenging
● Crail is an attempt to re-think how data processing systems should interact with network and storage hardware– User-level I/O
– Storage disaggregation
– Memory/flash convergence
– Elastic resource provisioning
Conclusions
put your #assignedhashtag here by setting the footer in view-header/footer● Crail: A High-Performance I/O Architecture for Distributed Data Processing, IEEE Data Bulletin 2017
● Albis: High-Performance File-format for Big Data, Usenix ATC’18
● Navigating Storage for Serverless Computing, Usenix ATC’18
● Pocket: Ephemeral Storage for Serverless Analytics, OSDI’18 (to appear)
● Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash, Spark Summit’17
● Serverless Machine Learning using Crail, Spark Summit’18
● Apache Crail, http://crail.apache.org
References