Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK.

Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for

Modern HPC Clusters

Dipti Shankar

Dr. Dhabaleswar K. Panda (Advisor)Dr. Xiaoyi Lu (Co-Advisor)

Department of Computer Science & EngineeringThe Ohio State University

SC 2018 Doctoral Showcase 2Network Based Computing Laboratory

Outline

• Introduction and Problem Statement

• Research Highlights and Results• Broader Impact

• Conclusion & Future Avenues


Key-Value Storage in HPC and Data Centers• General purpose distributed memory-centric storage

– Allows to aggregate spare memory from multiple nodes (e.g, Memcached)

• Accelerating Online and Offline Analytics in High-Performance Compute (HPC) environments

• Our Basis: Current High-performance and hybrid key-value stores for modern HPC clusters

v High-Performance Network Interconnects (e.g., InfiniBand) v Low end-to-end latencies with IP-over-InfiniBand

(IPoIB) and Remote Direct Memory Access (RDMA)

v ‘DRAM+SSD’ hybrid memory designs

v Extend storage capabilities beyond DRAM capabilities using high-speed SSDs

!"#$%"$#&

Web Frontend Servers (Memcached Clients)

(Memcached Servers)

(Database Servers)

High Performance

Networks

High Performance

Networks

(Offline Analytical Workloads: Software-Assisted Burst-Buffer )

(Online Analytical Workloads: OLTP/NoSQL Query Cache)


Drivers of Modern HPC Cluster Architectures

• Multi-core/many-core technologies

• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)

• Solid State Drives (e.g., PCIe/NVMe-SSDs), NVRAM (e.g., PCM, 3DXpoint), Parallel Filesystems (e.g., Lustre)

• Accelerators (e.g., NVIDIA GPGPUs)

• Production-scale HPC Clusters: SDSC Comet, TACC Stampede, OSC Owens, etc.

High Performance

Interconnects

Multi-core Processors with

vectorization support +

Accelerators (GPUs)

Large Memory

Nodes (DRAM)SSD, NVMe-SSD, NVRAM


• Our focus: Holistic approach to designing a high-performance and resilient

key-value storage for current and emerging HPC systems

– RDMA-enabled networking

– Hybrid NVM storage-awareness: High-speed NVMe-/PCIe-SSDs and upcoming

NVRAM technologies

– Heterogeneous compute devices (e.g., SIMD capabilities of CPUs and GPUs)

• Goals: Maximize end-to-end performance to

– Optimally exploit available compute/storage/network capabilities

– Enable data-intensive Big Data applications to leverage memory-centric key-value

storage to improve their overall performance

Holistic Approach for HPC-Centric KVS


Research Framework

HighPerformance, Hybrid and Resilient Key Value Storage

Application WorkloadsOnline

(E.g., SQL/NoSQL query cache/ LinkBench and TAO Graph KV workloads)

Offline (E.g., BurstBuffer over PFS for Hadoop for

MapReduce/Spark)

NonBlocking RDMAaware API Extensions

Fast Online Erasure Coding (Reliability)

NVRAMaware CommunicationProtocols (Persistence)

Accelerations on SIMDbased(e.g., GPU) Architecture

(Scalability)

HeterogeneityAware (DRAM/NVRAM/NVMe/PFS) KeyValue Storage Engine

Modern HPC System Architecture

Volatile/NonVolatileMemory Technologies (DRAM, NVRAM)

MultiCoreNodes

(w/ SIMD units)GPUs

RDMACapableNetworks (InfiniBand,

40Gbe RoCE)

Parallel FileSystem (Lustre)

Local Storage(PCIe SSD/ NVMeSSD)


High-Performance Non-Blocking API Semantics

v Heterogeneous Storage-Aware Key-Value Stores (e.g., ‘DRAM + PCIe/NVMe-SSD’)

• Higher data retention at the cost of SSD I/O; suitable for out-of-memory scenarios

• Performance limited by Blocking API semantics

v Goals: Achieve near in-memory speeds while being able to exploit hybrid memory

v Approach: Novel Non-blocking API Semantics to extend RDMA-Libmemcached library

• memcached_(iset/iget/bset/bget) APIs for SET/GET

• memcached_(test/wait) APIs for progressing communication

D. Shankar, X. Lu , N. Islam , M. W. Rahman , and D. K. Panda, “High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and

SSDs: Non-blocking Extensions, Designs, and Benefits”, IPDPS 2016

BLOCKINGAPI

NON-BLOCKINGAPIREQ.

NON-BLOCKINGAPIREPLY

CLIENT

SERV

ER

HYBRIDSLABMANAGER(RAM+SSD)

RDMA-ENHANCEDCOMMUNICATIONLIBRARY

RDMA-ENHANCEDCOMMUNICATIONLIBRARY

LIBMEMCACHEDLIBRARY

BlockingAPIFlow Non-BlockingAPIFlowNonB-b

Test/Wait

NonB-i


High-Performance Non-Blocking API Semantics

• Set/Get Latency with Non-Blocking API: Up to 8x gain in overall latency vs. blocking API semantics over RDMA+SSD hybrid design

• Up to 2.5x gain in throughput observed at client; Ability to overlap request and response phases to hide SSD I/O overheads

0

500

1000

1500

2000

2500

Set Get Set Get Set Get Set Get Set Get Set GetIPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-

Block H-RDMA-Opt-

NonB-i H-RDMA-Opt-

NonB-b

Aver

age

Late

ncy

(us)

Miss Penalty (Backend DB Access Overhead)Client WaitServer ResponseCache UpdateCache check+Load (Memory and/or SSD read)Slab Allocation (w/ SSD write on Out-of-Mem)

Overlap SSD I/O Access (near in-memory latency)


v Erasure Coding (EC): A low storage-overhead alternative to Replication

v Bottlenecks: (1) Encode/decode compute overheads (2) Communication overhead of scattering/ gathering distributed data/parity chunks

v Goal: Making Online EC viable for key-value stores

v Approach: Non-blocking RDMA-aware semantics to enable compute/communication overlap

v Encode/Decode offload capabilities integrated into Memcached client (CE/CD) and server (SE/SD)

Fast Online Erasure Coding with RDMA

EC

Perf

orman

ce

Ideal Scenario

MemoryEfficient

High Performance

Memory Overhead

No FT

Replication

RS (3,2) => 66% Storage Overhead vs 200% of Replication-Factor=3


Fast Online Erasure Coding with RDMA

D. Shankar , X. Lu , and D. K. Panda, “High-Performance and Resilient Key-Value Store with Online Erasure Coding for Big Data Workloads”, 37th International Conference on Distributed Computing Systems (ICDCS 2017)

0

500

1000

1500

2000

2500

3000

3500

512 4K 16K 64K 256K 1M

Tota

l Set

Lat

ency

(us

)

Value Size (Bytes)

Sync-Rep=3 Async-Rep=3

Era(3,2)-CE-CD Era(3,2)-SE-SD

Era(3,2)-SE-CD

~1.6x

~2.8x

0

50

100

150

200

250

4K 16K 32K 4K 16K 32K

YCSB-A YCSB-B

Ag

g. T

hro

ug

hp

ut

(Ko

ps/

sec)

Memc-RDMA-NoRep Async-Rep=3Era(3,2)-CE-CD Era(3,2)-SE-CD

~1.5x~1.34x

• Experiments with YCSB for Online EC vs. Async. Rep:

• 150 Clients on 10 nodes on SDSC Comet Cluster (IB FDR + 24-core Intel Haswell) over 5-node RDMA-Memcached Cluster

• (1) CE-CD gains ~1.34x for Update-Heavy workloads; SE-CD on-par (2) CE-CD/SE-CD on-par for Read-Heavy workloads

Intel Westmere + QDR-IB ClusterRep=3 vs. RS (3,2)


• Offline Data Analytics Use-Case: Hybrid and resilient key-value store-based Burst-Buffer system Over Lustre (Boldio)

• Overcome local storage limitations on HPC nodes; performance of `data locality’

• Light-weight transparent interface to Hadoop/ Spark applications

• Accelerating I/O-intensive Big Data workloads

– Non-blocking RDMA-Libmemcached APIs to maximize overlap

– Client-based replication or Online Erasure Coding with RDMA for resilience

– Asynchronous persistence to Lustre parallel file system at RDMA-Memcached Servers

D. Shankar, X. Lu, D. Panda, Boldio: A Hybrid and Resilient Burst-Buffer over Lustre for Accelerating Big Data I/O, IEEE International Conference on Big

Data 2016 (Short Paper)

Co-Designing Key-Value Store-based Burst Buffer over PFS


Co-Designing Key-Value Store-based Burst Buffer over PFS

• TestDFSIO on SDSC Gordon Cluster (16-core Intel Sandy Bridge and IB QDR) with 16-node MapReduce Cluster + 4-node Boldio Cluster

• Boldio can sustain 3x and 6.7x gains in read and write throughputs over stand-alone Lustre

0

20000

40000

60000

80000

60 GB 100 GB 60 GB 100 GB

Write Read

Agg.Throu

ghpu

t(MBp

s) Lustre-Direct Boldio

6.7x

3x

0

10000

20000

30000

40000

50000

60000

Write Read

DF

SIO

Agg

. T

hro

ugh

pu

t (M

Bp

s)

Lustre-DirectAlluxio-RemoteBoldio_Async-Rep=3Boldio_Online-EC

No EC overhead (Rep=3 vs. RS (3,2))

• TestDFSIO on Intel Westmere Cluster (8-core Intel Sandy Bridge and IB QDR); 8-node MapReduce Cluster + 5-node Boldio Cluster over Lustre

• Performance gains over designs like Alluxio(formerly Tachyon) in HPC environments with no local storage


Exploring Opportunities with NVRAM and RDMA

DRAM HCA

Mem. Cntrl

DRAM NVRAM

I/O Cntrllr

L3

PCIe Bus

HCA

Mem.Cntrl

NVRAM

I/O Cntrllr

L3

PCIe Bus Du

rab

ility P

ath

RDMA into NVRAM RDMA from NVRAMLocal Durability Point

Core L1/L2

Core L1/L2

Core L1/L2

Core L1/L2

Node1 (Initiator/Target)Core L1/L2

Core L1/L2

Core L1/L2

Core L1/L2

Node0 (Initiator/Target)• Emerging Non-volatile Memory technologies

(NVRAM)

• Potential: Byte-addressable and persistent; capable of RDMA

• Observations: RDMA writes into NVRAM needs to guarantee remote durability

• Appliance Method: Hardware-Assisted remote direct persistence

• General Purpose Server Method: Server-Assisted software-based persistence

• Opportunities: Designing high-performance RDMA-based Persistence Protocols (e.g., Persistent In-Memory KVS over NVRAM) D. Shankar, X. Lu, D. Panda, RDMP-KVS: Remote Direct Memory

Persistence Aware Key-Value Store for NVRAM Systems (Under

Submission)


• RDMA for Apache Spark (RDMA-Spark), Apache Hadoop 2.x (RDMA-Hadoop-2.x), RDMA for Apache HBase

• RDMA for Memcached (RDMA-Memcached)

– RDMA-aware `DRAM+SSD’ hybrid Memcached server design

– Non-Blocking RDMA-based Client API designs (RDMA-Libmemcached)

– Based on Memcached 1.5.3 and Libmemcached client 1.0.18

– Available for InfiniBand and RoCE

• OSU HiBD-Benchmarks (OHB)– Memcached Set/Get Micro-benchmarks for Blocking and Non-Blocking APIs, and Hybrid

Memcached designs

– YCSB plugin for RDMA-Memcached

– Also includes HDFS, HBase, Spark Micro-benchmarks

• http://hibd.cse.ohio-state.edu

• Users Base: 290 organizations, 34 countries, 27,950 downloads

Broader Impact: The High-Performance Big Data (HiBD) Project

http://hibd.cse.ohio-state.edu/


Conclusion & Future Avenuesv Holistic approach to designing key-value storage by exploiting the capabilities of HPC

clusters for (1) performance, (2) scalability, and, (3) data resilience/availability

v RDMA-capable Networks: (1) Proposed Non-blocking RDMA-based Libmemcached APIs

(2) Fast Online EC-based RDMA-Memcached designs

v Heterogeneous Storage-Awareness: (1) Leverage `RDMA+SSD’ hybrid designs, (2)

`RDMA+NVRAM’ Persistent Key-Value Storage

v Application Co-Design: Memory-centric data-intensive applications on HPC Clusters

v Online (e.g., SQL query cache, YCSB) and Offline Data Analytics (e.g., Boldio Burst-Buffer for

Hadoop I/O)

v Future Work: Ongoing work in this thesis direction

v Heterogeneous compute capabilities of CPU/GPU: End-to-end SIMD-aware KVS designs

v Exploring co-design of (1) Read-intensive Graph-based workloads (E.g., LinkBench, RedisGraph)

(2) Key-value storage engine for ML Parameter Server frameworks


[email protected]

http://www.cse.ohio-state.edu/~shankard

Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/

mailto:[email protected]

http://nowlab.cse.ohio-state.edu/

Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK.

Documents