Designing High-Performance, Resilient and Heterogeneity-Aware … · 2020-02-04 · Clusters with RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits”, 30th

Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters

Dipti Shankar ([email protected])Advisor: Dhabaleswar K. (DK) Panda ([email protected]), Co-Advisor: Dr. Xiaoyi Lu ([email protected])

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu

Acknowledgements

Introduction

Research Framework

Co-Designing Key-Value Store-based Burst Buffer over PFS

Conclusion and Future Work

High-Performance Non-Blocking API Semantics

Exploring Opportunities with NVRAM and RDMA

References[1] D. Shankar , X. Lu , and D. K. Panda, “High-Performance and Resilient Key-Value Store with Online Erasure Coding for Big Data Workloads”, 37th International Conference on Distributed Computing Systems (ICDCS 2017)[2] D. Shankar , X. Lu , and D. K. Panda, “Boldio: A Hybrid and Resilient Burst-Buffer Over Lustre for Accelerating Big Data I/O”, 2016 IEEE International Conference on Big Data (IEEE BigData 2016) [Short Paper][3] D. Shankar, X. Lu , N. Islam , M. W. Rahman , and D. K. Panda, “High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits”, 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2016)[4] D. Shankar, X. Lu , M. W. Rahman , N. Islam , and D. K. Panda, “Benchmarking Key-Value Stores on High-Performance Storage and Interconnects for Web-Scale Workloads”, 2015 IEEE International Conference on Big Data (IEEE BigData ’15) [Short Paper][5] D. Shankar, X. Lu , J. Jose , M. W. Rahman , N. Islam , and D. K. Panda, “Can RDMA Benefit On-Line Data Processing Workloads with Memcached and MySQL”, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2015) [Poster]

v The proposed framework enables key-value storage systems to exploit the capabilities of HPC clusters for maximizing performance and scalability, while ensuring data resilience/availability.

v Provides efficient non-blocking API semantics to design efficient read/write pipelines with resilience via RDMA-aware asynchronous replication and fast online EC

v Future work for this thesis: Works-in-Progressv Explore opportunities for exploiting the SIMD compute capabilities (e.g., GPU, AVX); End-to-end SIMD-aware

key-value storage system designsv Work on co-designing memory-centric data-intensive applications over key-value stores: (1) Read-Intensive

Graph-based Workloads (e.g., LinkBench, RedisGraph) (2) Key-value store engine for Parameter Server frameworks for ML workloads

(Offline Data Analytics: Burst-Buffer and Persistent Store)

v Key-Value Stores (e.g., Memcached) serve as the heart of many production-scale distributed systems and databases

v Accelerating Online and Offline Analytics in High-Performance Computing (HPC) environments

v Our Basis: High-performance and hybrid key-value storage

v Remote Direct Memory Access (RDMA) over high-performance network interconnects interconnects (e.g., InfiniBand, RoCE)

v ‘DRAM+NVMe/NVRAM’ hybrid memory designs

v Research Focus: Designing a high-performance key-value storage system that can leverage: (1) RDMA-capable networks (2) heterogeneous I/O and (3) compute capabilities on HPC clusters

v Goals: (1) End-to-end performance (2) Scalability (3) Resilience / High Availability

Software Distributionv The RDMA-based Memcached and Non-Blocking API designs (RDMA-Memcached) proposed in this research are available for

the community as a part of the HiBD project http://hibd.cse.ohio-state.edu/#memcachedv Micro-benchmarks and YCSB plugin for RDMA-Memcached available in as a part of of the OSU HiBD Micro-benchmark Suite

(OHB) http://hibd.cse.ohio-state.edu/#microbenchmarks

This research is supported in part by National Science Foundation grants #CNS-1513120, #IIS-1636846, and #CCF-1822987

(Online Data Processing: High-Performance Cache)

v Motivation: Hybrid `DRAM+PCIe/ NVMe-SSD’ Key-Value Stores• Higher data retention; fast random

reads• Performance limited by Blocking API

semantics

v Goals: Achieve near in-memory speeds while being able to exploit hybrid memory

Fast Online Erasure Coding with RDMA

v Erasure Coding (EC): Storage-efficient alternative to Replication for Resilience

v Goal: Making Online EC viable for key-value stores

v Bottlenecks: (1) Encode/decode computation (2) Scattering/gathering the data/parity chunks

v Emerging non-volatile memory technologies (NVRAM)v Potential: Byte-addressable and persistent;

capable of RDMAv Observations: RDMA writes into NVRAM

needs to guarantee remote durabilityv Opportunities: RDMA-based Persistence

Protocols for NVRAM Systems (Architecture of NVRAM-based System)

HighPerformance, Hybrid and Resilient Key Value Storage

Application WorkloadsOnline

(E.g., SQL/NoSQL query cache/ LinkBench and TAO Graph KV workloads)

Offline (E.g., BurstBuffer over PFS for Hadoop for

MapReduce/Spark)

NonBlocking RDMAaware API Extensions

Fast Online Erasure Coding (Reliability)

NVRAMaware CommunicationProtocols (Persistence)

Accelerations on SIMDbased(e.g., GPU) Architecture

(Scalability)

HeterogeneityAware (DRAM/NVRAM/NVMe/PFS) KeyValue Storage Engine

Modern HPC System Architecture

Volatile/NonVolatileMemory Technologies (DRAM, NVRAM)

MultiCoreNodes

(w/ SIMD units)GPUs

RDMACapableNetworks (InfiniBand,

40Gbe RoCE)

Parallel FileSystem (Lustre)

Local Storage(PCIe SSD/ NVMeSSD)

High-Performance Big Data (HiBD)http://hibd.cse.ohio-state.edu

!"#$%"$#&

Web Frontend Servers (Memcached Clients)

(Memcached Servers)

(Database Servers)

High Performance

Networks

High Performance

Networks

BLOCKINGAPI

NON-BLOCKINGAPIREQ.

NON-BLOCKINGAPIREPLY

CLIEN

TSERV

ER

HYBRIDSLABMANAGER(RAM+SSD)

RDMA-ENHANCEDCOMMUNICATIONLIBRARY

RDMA-ENHANCEDCOMMUNICATIONLIBRARY

LIBMEMCACHEDLIBRARY

BlockingAPIFlow Non-BlockingAPIFlow

shared storage such as Lustre. It acts as an interface betweenthe application and the burst-buffer layer by transparentlymapping Hadoop I/O streams to key/value store semantics.(2) Boldio Burst-Buffer Server, referred to as BBS in Fig-ure 2, running on each burst-buffer node, acts as a high-performance distributed staging layer for Hadoop I/O. Itleverages ‘RAM+SSD’ available on the burst-buffer nodesin a high-performance and high-retention manner.(3) Boldio Burst-Buffer Persistence Manager, referred to asBBP in Figure 2, is located at each BBS node. It persiststhe output files buffered in the Memcached slabs to Lustreasynchronously in a resilient fashion.

Based on the overall design in Figure 2, we highlight thedetails of the key technologies involved in designing Boldioin the following sections.

III. BOLDIO CLIENT AND SERVER DESIGNS

In this section, we present the internal designs of theBoldio client and server.

A. Mapping Hadoop File Stream to Memcached Semantics

To take advantage of a key/value store based burst buffersystem that is efficient and reliable, the files accessed viaHadoop I/O streams need to be mapped onto data blocksthat are represented by key/value pairs. In addition tothis, we need to enable resilience that is critical to BigData applications. We employ a client-initiated replicationtechnique through which the Boldio clients guarantee thatredundant copies of I/O data exist on the Boldio serversbefore the application completes. To enable this mappingand to identify replicated data blocks on the Boldio servers,we consider the following two aspects:(1) Data Mapping: The file stream is divided into chunksthat can fit in Memcached’s slabs (default size <1MB). Eachdata chunk is identified by a key/value pair i.e., (File Id+ File Offset, File Chunk). A metadata key/value pair i.e.,(File Id, File Meta), is also stored in the BBS cluster forevery file output, to get access file information for futureaccess. To identify replicated key/value pairs, we assigna replica identifier i.e., Rep Id, to each of the data andmetadata chunks. This identifier ranges from 0 to N-1, fora replication factor N, with 0 as the primary copy.(2) Data Distribution Schemes: To achieve efficient dataplacement, Boldio provides two data distribution schemes:(a) 1F MS: In this scheme, data chunks belonging to a fileare scattered across all Boldio servers using default con-sistent hashing in Libmemcached. Irrespective of whetherthe key/value pairs represent data or metadata, the serveris identified as: server key ← consistent hash(key) +replica id. We refer to this scheme as 1-File Many-Serveri.e., 1F MS. Figure 3(a) illustrates how data is distributedby three tasks consistently among two Boldio servers.(b) 1F 1S: In this scheme, all data chunks are co-locatedwith their metadata key. Boldio enables this 1-File 1-Serveri.e., 1F 1S scheme, by identifying the server using the

File Id portion of the key as follows: server key ←consistent hash(prefix(key)) + replica id. Figure 3(b)illustrates how the output file from different tasks are chun-ked and distributed in the BBS cluster with localization withrespect to a given file.

(a) 1F MS (b) 1F 1S

Figure 3. Data Distribution Schemes in Boldio Client

The 1F MS scheme enables a load-balanced distributionof the data blocks; thus, exploiting the available networkbandwidth and distributed memory to maximize perfor-mance. On the other hand, though the 1F 1S scheme maynot provide a balanced distribution, it simplifies the datapersistence and recovery processes by co-locating all datablocks of a replica of each buffered file.

B. Enabling Non-blocking I/O Requests

Using default key/value store semantics, each I/O requestto the burst-buffer layer is issued in a blocking manner. Dueto the limit on the key/value pair size in most key/valuestores (i.e., max. 1 MB), a large number of requests need tobe issued to read/write a single file for Big Data workloads.We introduced novel and high-performance non-blockingkey/value store API semantics for RDMA-enhanced Mem-cached in [17]. As shown in Figure 4, these non-blockingAPI extensions allow the user to separate the request issueand completion phases to overlap concurrent data requeststo reduce total execution time.

REQUESTKV1

0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 45.00 50.00Time(us)

Request/Response

RESPONSEKV1

RESPONSEKV1

RESPONSEKV2

RESPONSEKV2

REQUESTKV2

REQUESTKV1

REQUESTKV2

SET/GETKV1

SET/GETKV2

Non-blockingSET/GETKV1

Non-BlockingSET/GETKV2

Figure 4. Non-blocking Memcached API Semantics

In the Boldio client, we employ proposed non-blockingAPIs, i.e., memcached_iset and memcached_iget, toissue a bulk of Set/Get requests for the required data chunksand return to the Hadoop application as soon as the under-lying RDMA communication engine has communicated therequest to the Boldio servers; without blocking on the re-sponse. The memcached_wait and memcached_test

APIs are then used to check the progress of each operation ineither a blocking or non-blocking fashion, respectively. By

0

10000

20000

30000

40000

Read-Only(100%GET) Write-Heavy(50GET:50SET)

Aggr.Throu

ghpu

t(Ops/sec)

H-RDMA-Blocking

H-RDMA-NonB-iget

H-RDMA-NonB-bget

0

500

1000

1500

2000

2500

Set Get Set Get Set Get Set Get Set Get Set GetIPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-

Block H-RDMA-Opt-

NonB-i H-RDMA-Opt-

NonB-b

Aver

age

Late

ncy

(us)

Miss Penalty (Backend DB Access Overhead)Client WaitServer ResponseCache UpdateCache check+Load (Memory and/or SSD read)Slab Allocation (w/ SSD write on Out-of-Mem)

Overlap SSD I/O Access (near in-memory latency)

ECPerf

orma

nce

Ideal Scenario

MemoryEfficient

High Performance

Memory Overhead

No FT

Replication

0

500

1000

1500

2000

2500

3000

3500

512 4K 16K 64K 256K 1M

Tota

l Set

Lat

ency

(us

)

Value Size (Bytes)

Sync-Rep=3 Async-Rep=3

Era(3,2)-CE-CD Era(3,2)-SE-SD

Era(3,2)-SE-CD

~1.6x

~2.8x

Boldio: A Hybrid and Resilient Burst-Buffer Over Lustre for Accelerating

Big Data I/O

Dipti Shankar, Xiaoyi Lu, Dhabaleswar K. (DK) PandaDepartment of Computer Science and Engineering, The Ohio State University

Email: {shankard, luxi, panda}@cse.ohio-state.edu

Abstract—The limitation of local storage space in the HPCenvironments has placed an unprecedented demand on theperformance of the underlying shared parallel file systems.This has necessitated a scalable solution for running Big Datamiddleware (e.g., Hadoop) on HPC clusters. In this paper,we propose Boldio, a hybrid and resilient key-value store-based Burst-Buffer system Over Lustre for accelerating I/O-intensive Big Data workloads, that can leverage RDMA onhigh-performance interconnects and storage technologies suchas PCIe-/NVMe-SSDs, etc. We demonstrate that Boldio canimprove the performance of the I/O phase of Hadoop work-loads running on HPC clusters; serving as a light-weight, high-performance, and resilient remote I/O staging layer betweenthe application and Lustre. Performance evaluations show thatBoldio can improve the TestDFSIO write performance overLustre by up to 3x and TestDFSIO read performance by 7x,while reducing the execution time of Hadoop Sort benchmarkby up to 30%. We demonstrate that we can significantlyimprove Hadoop I/O throughput over popular in-memorydistributed storage systems such as Alluxio (formerly Tachyon),when high-speed local storage is limited.

Keywords-Burst-Buffer; Hadoop; Memcached; RDMA; Non-Blocking API;

I. INTRODUCTION

Traditionally, the Hadoop Distributed File System orHDFS [6] has been employed as the storage layer for BigData analytics on commodity clusters. Though the nodes onthe modern HPC clusters [16] are equipped with heteroge-neous storage devices (RAMDisk or SATA-/PCIe-/NVMe-SSDs), the capacities of local storage on the computenodes are essentially very limited [4], due to the Beowulfarchitectural model [18] employed. The I/O data nodesconstitute a dedicated and shared storage cluster that hostsa high-performance parallel file system (e.g., Lustre [22],GPFS [7]), and are connected to the compute nodes via high-performance interconnects such as InfiniBand [8]. Conse-quently, Big Data users on HPC clusters need to rely on par-allel storage systems such as Lustre for their I/O needs. SinceLustre is typically deployed on a separate sub-cluster, all I/Operformed are remote operations. This default approach isrepresented in Figure 1(a). With most Big Data workloadsbeing predominantly I/O-intensive, parallel storage systemstend to become a major performance bottleneck.

*This research is supported in part by National Science Foundation grants #CNS-1419123, #IIS-1447804, #CNS-1513120, #CCF-1565414 and #IIS-1636846. It usedthe Extreme Science and Engineering Discovery Environment (XSEDE), which issupported by National Science Foundation grant number OCI-1053575.

A. Motivation and Related Work

While Big Data workloads such as Hadoop MapReducehave been leveraging Lustre and Remote Direct MemoryAccess (RDMA) to run on HPC clusters [11, 15], severalrecent works have focused on accelerating Big Data I/Oin these environments. Enhanced HDFS designs were pro-posed [9, 10] to exploit the heterogeneous storage architec-ture in the HPC environment while employing advanced fea-tures such as RDMA; thus, enabling the use of Lustre withinHDFS. The recent emergence of in-memory computing hasdirected the spotlight towards memory-centric distributed filesystems such as Alluxio (formerly Tachyon) [2, 12] thatenable users to unify heterogeneous storage across nodesfor high throughput and scalability. Alluxio workers can bedeployed locally on the compute nodes (Local) or remotelyon separate sub-cluster of allocated nodes (Remote), asrepresented in Figure 1(b). While this approach is beingexplored to develop a two-level approach for Big Dataon HPC clusters [21], it was not designed specifically fortraditional HPC infrastructure and needs to incur networkaccess overhead for storing and retrieving remote I/O blocksdue to the limitations on the local storage.

(a) Direct over Lustre (b) Local/Remote In-Memory Alluxio

(c) Burst-Buffer overLustre (This paper)

Figure 1. Existing and Proposed Big Data I/O Sub-systems in HPC

On the other hand, recent research works [19, 20] havefocused on leveraging key-value store based burst-buffersystems for managing intense bursty I/O traffic generatedby checkpointing in HPC applications. This approach, rep-resented in Figure 1(c), enables data to be temporarilybuffered via high-performance and resilient key-value basedstorage layer (e.g., Memcached [1]) deployed in a separateset of compute/storage/large-memory nodes before persist-ing it to the underlying parallel file system, i.e., Lustre.Similarly, hardware-based burst-buffer approaches such asDDN IME [3] are also being actively explored. While the

• TestDFSIO on SDSC Gordon Cluster (16-core Intel Sandy Bridge + IB QDR)

• 16-node Hadoop Cluster + 4-node Boldio Cluster

• Performance gains over designs like Alluxio (Tachyon) over PFS

Dire

ctov

er L

ustr

e

Hadoop I/O Applications (MapReduce, Spark)

Hadoop File System Class Abstraction (LocalFileSystem)

Burst-Buffer Libmemcached Client

RDMA-enhanced Communication Engine

Non-Blocking API ARPE (CE/CD/Rep)

RDMA-enhanced Comm. Engine

Persistence Mgr.

…..

Co-Design BoldioFileSystem

Lustre Parallel File SystemMDS OSS OSS OSS

MDT MDT OST OST OST OST OST OST

Memcached Server Cluster

Libm

emca

ched

Cl

ient

Hyb-Mem Manager

(RAM/SSD)

ARPE (SE/SD)

RDMA-enhanced Comm. Engine

Persistence Mgr.

Hyb-Mem Manager

(RAM/SSD)

ARPE (SE/SD)

DRAM HCA

Mem. Cntrl

DRAM NVRAM

I/O Cntrllr

L3

PCIe Bus

HCA

Mem.Cntrl

NVRAM

I/O Cntrllr

L3

PCIe Bus Dura

bility

Pat

h

RDMA into NVRAM RDMA from NVRAMLocal Durability Point

Core L1/L2

Core L1/L2

Core L1/L2

Core L1/L2

Node1 (Initiator/Target)Core L1/L2

Core L1/L2

Core L1/L2

Core L1/L2

Node0 (Initiator/Target)

NonB-bTest/Wait

NonB-i

SDSC Comet Cluster

150 YCSB clients over 10 nodes

5-node Memcached server cluster

Intel Westmere + QDR-IB ClusterRep=3 vs. RS (3,2)

Replication = 3200% Storage

Overhead

Erasure Coding RS(3,2)66% Storage Overhead

2.5x

SDSC Comet (1 Server + 100 clients over 32 nodes)

(Client-Encode/Client-Decode)(Server-Encode/Client-Decode)

(Server-Encode/Server-Decode) (Client-Encode/Server-Decode)

Encode/Decode

D

D1 P1D2 D3 P2

Clie

nt

Server Server........

KV S

erve

r Cl

uste

r

Tasync_set Tasync_get

D

D

Encode/Decode

D1 D1D2 .....KV S

erve

r

KV Server Cluster

P2

Clie

nt

Tcomm

Tasync_set/get

Decode

D

D1 D3 P1

D1

KV S

erve

r

KV Server Cluster

D

Encode

D1 D2 ..... P2

D

TcommTasync_set

Tasync_get

Clie

nt

DecodeD

D1 D3 P1

D

Encode

D1

D1

D2 .....KV S

erve

r

KV Server Cluster

P2

D

Tcomm

Tasync_set

Tasync_get

Clie

nt

v Big Data I/O infrastructure (e.g., HDFS, Alluxio) vs. HPC Storage (e.g., GPFS, Lustre); limited ‘data locality’ in HPC

v Bottleneck: Heavy reliance on PFS (limited I/O bandwidth); affects data-intensive Big Data applications

v Approach: High-Performance and Hybrid and Resilient Key-Value Store-Based Burst-Buffer Over Lustre (Boldio)

• Hadoop I/O over Lustre; transparent FileSystemplugin for Hadoop MapReduce/Spark

• No dependence on local storage at compute nodes

• Resilience via Async. Replication or Online EC

• RDMA-Memcached as Burst-Buffer servers + Non-blocking client APIs for efficient I/O pipelines

0

50

100

150

200

250

4K 16K 32K 4K 16K 32K

YCSB-A YCSB-B

Agg

. Th

rou

ghp

ut

(Kop

s/se

c)

Memc-RDMA-NoRep Async-Rep=3Era(3,2)-CE-CD Era(3,2)-SE-CD

~1.5x~1.34x

0

20000

40000

60000

80000

60 GB 100 GB 60 GB 100 GB

Write Read

Agg.Th

roug

hput(M

Bps) Lustre-Direct Boldio

6.7x

3x

8-core Intel Westmere + IB QDR

8-node Hadoop Cluster0

100200300400500

WordCount InvIndx CloudBurst Spark TeraGen

Average L

ate

ncy

(us)

Lustre-Direct Alluxio-Remote Boldio

21%

4-node Boldio Cluster over Lustre

0

10000

20000

30000

40000

50000

60000

Write Read

DF

SIO

Agg

. T

hrou

ghpu

t (M

Bps

)

Lustre-DirectAlluxio-RemoteBoldio_Async-Rep=3Boldio_Online-EC

No EC overhead (Rep=3 vs. RS (3,2))

5-node Boldio Cluster over Lustre

Overlap = 80-90%

v Approach: Novel Non-blocking API Semantics

• Extensions for RDMA-based Libmemcached library

• memcached_(iset/bset/bget) for SET/GET operations

• memcached_(test/wait) for progressing communication

• Ability to overlap request and response phases; hide SSD I/O overheads

• Up to 8x gain in overall latency vs. blocking API semantics

(Architecture of Boldio)

v Approach: Non-blocking RDMA-aware semantics to enable compute/communication overlap

v Encode/Decode offloading integrated into Memcached client (CE/CD) and server (SE/SD)

v Experiments with Yahoo! Cloud Serving Benchmark (YCSB) for Online EC vs. Async. Rep: (1) Update-heavy: CE-CD outperforms; SE-CD on-par (2) Read-heavy: CE-CD/SE-CD on-par

http://nowlab.cse.ohio-state.edu/

http://hibd.cse.ohio-state.edu/

http://hibd.cse.ohio-state.edu/

http://hadoop-rdma.cse.ohio-state.edu/

Designing High-Performance, Resilient and Heterogeneity-Aware … · 2020-02-04 · Clusters with RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits”, 30th

Documents