Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar ([email protected]) Advisor: Dhabaleswar K. (DK) Panda ([email protected]), Co-Advisor: Dr. Xiaoyi Lu ([email protected]) Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu Acknowledgements Introduction Research Framework Co-Designing Key-Value Store-based Burst Buffer over PFS Conclusion and Future Work High-Performance Non-Blocking API Semantics Exploring Opportunities with NVRAM and RDMA References [1] D. Shankar , X. Lu , and D. K. Panda, “High-Performance and Resilient Key-Value Store with Online Erasure Coding for Big Data Workloads”, 37 th International Conference on Distributed Computing Systems (ICDCS 2017) [2] D. Shankar , X. Lu , and D. K. Panda, “Boldio: A Hybrid and Resilient Burst-Buffer Over Lustre for Accelerating Big Data I/O”, 2016 IEEE International Conference on Big Data (IEEE BigData 2016) [Short Paper] [3] D. Shankar, X. Lu , N. Islam , M. W. Rahman , and D. K. Panda, “High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits”, 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2016) [4] D. Shankar, X. Lu , M. W. Rahman , N. Islam , and D. K. Panda, “Benchmarking Key-Value Stores on High-Performance Storage and Interconnects for Web-Scale Workloads”, 2015 IEEE International Conference on Big Data (IEEE BigData ’15) [Short Paper] [5] D. Shankar, X. Lu , J. Jose , M. W. Rahman , N. Islam , and D. K. Panda, “Can RDMA Benefit On-Line Data Processing Workloads with Memcached and MySQL”, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2015) [Poster] v The proposed framework enables key-value storage systems to exploit the capabilities of HPC clusters for maximizing performance and scalability, while ensuring data resilience/availability. v Provides efficient non-blocking API semantics to design efficient read/write pipelines with resilience via RDMA- aware asynchronous replication and fast online EC v Future work for this thesis: Works-in-Progress v Explore opportunities for exploiting the SIMD compute capabilities (e.g., GPU, AVX); End-to-end SIMD-aware key-value storage system designs v Work on co-designing memory-centric data-intensive applications over key-value stores: (1) Read-Intensive Graph-based Workloads (e.g., LinkBench, RedisGraph) (2) Key-value store engine for Parameter Server frameworks for ML workloads (Offline Data Analytics: Burst-Buffer and Persistent Store) v Key-Value Stores (e.g., Memcached) serve as the heart of many production-scale distributed systems and databases v Accelerating Online and Offline Analytics in High-Performance Computing (HPC) environments v Our Basis: High-performance and hybrid key-value storage v Remote Direct Memory Access (RDMA) over high-performance network interconnects interconnects (e.g., InfiniBand, RoCE) v ‘DRAM+NVMe/NVRAM’ hybrid memory designs v Research Focus: Designing a high-performance key-value storage system that can leverage: (1) RDMA-capable networks (2) heterogeneous I/O and (3) compute capabilities on HPC clusters v Goals: (1) End-to-end performance (2) Scalability (3) Resilience / High Availability Software Distribution v The RDMA-based Memcached and Non-Blocking API designs (RDMA-Memcached) proposed in this research are available for the community as a part of the HiBD project http://hibd.cse.ohio-state.edu/#memcached v Micro-benchmarks and YCSB plugin for RDMA-Memcached available in as a part of of the OSU HiBD Micro-benchmark Suite (OHB) http://hibd.cse.ohio-state.edu/#microbenchmarks This research is supported in part by National Science Foundation grants #CNS-1513120, #IIS-1636846, and #CCF-1822987 (Online Data Processing: High-Performance Cache) v Motivation: Hybrid `DRAM+PCIe/ NVMe- SSD’ Key-Value Stores • Higher data retention; fast random reads • Performance limited by Blocking API semantics v Goals: Achieve near in-memory speeds while being able to exploit hybrid memory Fast Online Erasure Coding with RDMA v Erasure Coding (EC): Storage-efficient alternative to Replication for Resilience v Goal: Making Online EC viable for key-value stores v Bottlenecks: (1) Encode/decode computation (2) Scattering/gathering the data/parity chunks v Emerging non-volatile memory technologies (NVRAM) v Potential: Byte-addressable and persistent; capable of RDMA v Observations: RDMA writes into NVRAM needs to guarantee remote durability v Opportunities: RDMA-based Persistence Protocols for NVRAM Systems (Architecture of NVRAM-based System) HighPerformance, Hybrid and Resilient Key Value Storage Application Workloads Online (E.g., SQL/NoSQL query cache/ LinkBench and TAO Graph KV workloads) Offline (E.g., BurstBuffer over PFS for Hadoop for MapReduce/Spark) NonBlocking RDMAaware API Extensions Fast Online Erasure Coding (Reliability) NVRAMaware Communication Protocols (Persistence) Accelerations on SIMDbased (e.g., GPU) Architecture (Scalability) HeterogeneityAware (DRAM/NVRAM/NVMe/PFS) KeyValue Storage Engine Modern HPC System Architecture Volatile/NonVolatile Memory Technologies (DRAM, NVRAM) MultiCore Nodes (w/ SIMD units) GPUs RDMACapable Networks (InfiniBand, 40Gbe RoCE) Parallel File System (Lustre) Local Storage (PCIe SSD/ NVMeSSD) High-Performance Big Data (HiBD) http://hibd.cse.ohio-state.edu !"#$%"$# Web Frontend Servers (Memcached Clients) (Memcached Servers) (Database Servers) High Performance Networks High Performance Networks BLOCKING API NON-BLOCKING API REQ. NON-BLOCKING API REPLY CLIENT SERVER HYBRID SLAB MANAGER (RAM+SSD) RDMA-ENHANCED COMMUNICATION LIBRARY RDMA-ENHANCED COMMUNICATION LIBRARY LIBMEMCACHED LIBRARY Blocking API Flow Non-Blocking API Flow REQUEST KV1 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 45.00 50.00 Time (us) Request/Response RESPONSE KV1 RESPONSE KV1 RESPONSE KV2 RESPONSE KV2 REQUEST KV2 REQUEST KV1 REQUEST KV2 SET/GET KV1 SET/GET KV2 Non-blocking SET/GET KV1 Non-Blocking SET/GET KV2 Figure 4. Non-blocking Memcached API Semantics 0 10000 20000 30000 40000 Read-Only (100 % GET) Write-Heavy (50 GET: 50 SET) Aggr.Throughput (Ops/sec) H-RDMA-Blocking H-RDMA-NonB-iget H-RDMA-NonB-bget 0 500 1000 1500 2000 2500 Set Get Set Get Set Get Set Get Set Get Set Get IPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt- Block H-RDMA-Opt- NonB-i H-RDMA-Opt- NonB-b Average Latency (us) Miss Penalty (Backend DB Access Overhead) Client Wait Server Response Cache Update Cache check+Load (Memory and/or SSD read) Slab Allocation (w/ SSD write on Out-of-Mem) Overlap SSD I/O Access (near in-memory latency) EC Performance Ideal Scenario Memory Efficient High Performance Memory Overhead No FT Replication 0 500 1000 1500 2000 2500 3000 3500 512 4K 16K 64K 256K 1M Total Set Latency (us) Value Size (Bytes) Sync-Rep=3 Async-Rep=3 Era(3,2)-CE-CD Era(3,2)-SE-SD Era(3,2)-SE-CD ~1.6x ~2.8x • TestDFSIO on SDSC Gordon Cluster (16-core Intel Sandy Bridge + IB QDR) • 16-node Hadoop Cluster + 4-node Boldio Cluster • Performance gains over designs like Alluxio (Tachyon) over PFS Direct over Lustre Hadoop I/O Applications (MapReduce, Spark) Hadoop File System Class Abstraction (LocalFileSystem) Burst-Buffer Libmemcached Client RDMA-enhanced Communication Engine Non-Blocking API ARPE (CE/CD/Rep) RDMA-enhanced Comm. Engine Persistence Mgr. ….. Co-Design BoldioFileSystem Lustre Parallel File System MDS OSS OSS OSS MDT MDT OST OST OST OST OST OST Memcached Server Cluster Libmemcached Client Hyb-Mem Manager (RAM/SSD) ARPE (SE/SD) RDMA-enhanced Comm. Engine Persistence Mgr. Hyb-Mem Manager (RAM/SSD) ARPE (SE/SD) DRAM HCA Mem. Cntrl DRAM NVRAM I/O Cntrllr L3 PCIe Bus HCA Mem. Cntrl NVRAM I/O Cntrllr L3 PCIe Bus Durability Path RDMA into NVRAM RDMA from NVRAM Local Durability Point Core L1/L2 Core L1/L2 Core L1/L2 Core L1/L2 Node1 (Initiator/Target) Core L1/L2 Core L1/L2 Core L1/L2 Core L1/L2 Node0 (Initiator/Target) NonB-b Test/Wait NonB-i SDSC Comet Cluster 150 YCSB clients over 10 nodes 5-node Memcached server cluster Intel Westmere + QDR-IB Cluster Rep=3 vs. RS (3,2) Replication = 3 200% Storage Overhead Erasure Coding RS(3,2) 66% Storage Overhead 2.5x SDSC Comet (1 Server + 100 clients over 32 nodes) (Client-Encode/Client-Decode) (Server-Encode/Client-Decode) (Server-Encode/Server-Decode) (Client-Encode/Server-Decode) Encode/Decode D D1 P1 D2 D3 P2 Client Server Server ........ KV Server Cluster T async_set T async_get D D Encode/Decode D1 D2 ..... KV Server KV Server Cluster P2 Client T comm T async_set/get Decode D D1 D3 P1 KV Server KV Server Cluster D Encode D1 D2 ..... P2 D T comm T async_set T async_get Client Decode D D1 D3 P1 D Encode D1 D2 ..... KV Server KV Server Cluster P2 D T comm T async_set T async_get Client v Big Data I/O infrastructure (e.g., HDFS, Alluxio) vs. HPC Storage (e.g., GPFS, Lustre); limited ‘data locality’ in HPC v Bottleneck: Heavy reliance on PFS (limited I/O bandwidth); affects data-intensive Big Data applications v Approach: High-Performance and Hybrid and Resilient Key-Value Store-Based Burst-Buffer Over Lustre (Boldio) • Hadoop I/O over Lustre; transparent FileSystem plugin for Hadoop MapReduce/Spark • No dependence on local storage at compute nodes • Resilience via Async. Replication or Online EC • RDMA-Memcached as Burst-Buffer servers + Non- blocking client APIs for efficient I/O pipelines 0 50 100 150 200 250 4K 16K 32K 4K 16K 32K YCSB-A YCSB-B Agg. Throughput (Kops/sec) Memc-RDMA-NoRep Async-Rep=3 Era(3,2)-CE-CD Era(3,2)-SE-CD ~1.5x ~1.34x 0 20000 40000 60000 80000 60 GB 100 GB 60 GB 100 GB Write Read Agg. Throughput(MBps) Lustre-Direct Boldio 6.7x 3x 8-core Intel Westmere + IB QDR 8-node Hadoop Cluster 0 100 200 300 400 500 WordCount InvIndx CloudBurst Spark TeraGen Average Latency (us) Lustre-Direct Alluxio-Remote Boldio 21% 4-node Boldio Cluster over Lustre 0 10000 20000 30000 40000 50000 60000 Write Read DFSIO Agg. Throughput (MBps) Lustre-Direct Alluxio-Remote Boldio_Async-Rep=3 Boldio_Online-EC No EC overhead (Rep=3 vs. RS (3,2)) 5-node Boldio Cluster over Lustre Overlap = 80-90% v Approach: Novel Non-blocking API Semantics • Extensions for RDMA-based Libmemcached library • memcached_(iset/bset/bget) for SET/GET operations • memcached_(test/wait) for progressing communication • Ability to overlap request and response phases; hide SSD I/O overheads • Up to 8x gain in overall latency vs. blocking API semantics (Architecture of Boldio) v Approach: Non-blocking RDMA-aware semantics to enable compute/communication overlap v Encode/Decode offloading integrated into Memcached client (CE/CD) and server (SE/SD) v Experiments with Yahoo! Cloud Serving Benchmark (YCSB) for Online EC vs. Async. Rep: (1) Update-heavy: CE-CD outperforms; SE-CD on-par (2) Read-heavy: CE-CD/SE-CD on-par