Designing Scalable and Efficient I/O Middleware for Fault-Resilient HPC Clusters Raghunath Raja Chandrasekar Abstract Problem Statement Key Designs and Results Ongoing and Future Work This dissertation proposes a cross-layer framework that leverages this hierarchy in storage media, to design scalable and low-overhead fault-tolerance mechanisms that are inherently I/O bound. The key components of the framework include – CRUISE, a highly-scalable in-memory checkpointing system that leverages both volatile and Non-Volatile Memory technologies; Stage-FS, a light-weight data-staging system that leverages burst-buffers and SSDs to asynchronously move application snapshots to a remote file system; Stage-QoS, a file system agnostic Quality-of-Service mechanism for data-staging systems that minimizes network contention; MIC-Check, a distributed checkpoint-restart system for coprocessor-based supercomputing systems; and FTB-IPMI, an out-of- band fault-prediction mechanism that pro-actively monitors for failures. •Inline-compression strategies for data-staging framework •Traditionally considered for space-constrained systems •Better representation of data more efficient network data movement •How compressible are application-/system-generated checkpoints? •Is inline checkpoint-compression a viable strategy to reduce data-movement overheads in a data- staging framework? What are the trade-offs involved? •Energy-efficient checkpointing protocols •Energy – one “the most pervasive” challenges for Exascale computing •Power-budgets imposed system-wide •Power-aware job scheduling and accounting •I/O accounts for significant portion of job wallclock time •Are there opportunities to reduce energy consumption during checkpointing? •How can existing I/O middleware be made power-conscious? Hierarchical RDMA-Based Checkpoint Data Staging Advised by : Dhabaleswar K. Panda Committee : K. Mohror (LLNL) P. Sadayappan (OSU) R. Teodorescu (OSU) HPC Scientific Applications Fault-Tolerance Techniques Checkpoint-Restart Process-Migration Scalable and Efficient I/O Middleware Hierarchical Data-Staging QoS-Aware Checkpointing Inline compression for Data Staging System-level Mechanisms Efficient In-Memory Checkpointing Checkpointing Heterogeneous Systems Application-assisted Mechanisms Low-overhead fault-prediction Energy-aware checkpointing protocols Mutually-beneficial Mechanisms NVM Flash/SSDs IB, 10GigE.. MIC, GPU Lustre, PVFS.. • Can checkpoint-restart mechanisms benefit from an hierarchical data- staging framework? • How can I/O middleware minimize the contention for network resources between checkpoint-restart traffic and inter-process communication traffic? • How can the behavior of HPC applications and I/O middleware be enhanced to leverage the deep storage hierarchies available on current- generation supercomputers? • How can the capabilities of state-of-art checkpointing systems be enhanced efficiently handle heterogeneous systems? • Can low-overhead timely failure prediction mechanisms be designed for pro-active failure avoidance and recovery? Dissertation Research Framework I/O Quality-of-Service Aware Checkpointing Efficient In-Memory Checkpointing Checkpoint-Restart for Heterogeneous Systems Low-Overhead Fault Prediction Checkpointing overhead reduced by 8.3x with the staging approach MPI Applications I/O Libraries (POSIX. HDF5, MPI-IO, NetCDF, etc.) MPI Libraries (MVAPICH2, OpenMPI, etc.) InfiniBand Interconnect Fabric Backend Parallel Filesystem (Lustre, GPFS, PVFS, etc.) QoS-Aware Data-Staging Framework Parallel Filesystem Staging Group N Staging Group 2 Staging Group 1 …. Client Node 1 Client Node 2 .. Client Node N-1 Client Node N Staging Server SSD 0.9 0.95 1 1.05 1.1 1.15 1.2 default with I/O noise I/O noise isolated Anelastic Wave Propagation (64 MPI processes) Normalized Runtime 17.9% 8% 0 500 1000 1500 2000 2500 3000 3500 4000 Bandwidth (MB/s) Message Size (Bytes) Large-message Bandwidth default QoS-Aware I/O with I/O noise ~20% 0 1 2 .. 7 0 1 2 .. 7 0 1 2 .. 7 0 1 2 .. 7 Staging Server IB Switch Storage Network Switch SSD Parallel Filesystem CRUISE Compute Nodes Parallel File System MPI Application RAM Disk SSD HDD Node-Local Storage SCR RAM/ Persistent Memory SCR Local RDMA Agent Remote RDMA Agent CRUISE 1 2 3 4 5 6 7 9 get_data_region() get_chunk_meta_list() 8 Run on Sequoia @LLNL 50MB checkpoints 10 iterations 4MB Chunks 0.1 1 10 100 1000 10000 1K 2K 4K 8K 16K 32K 64K 96K TB/s Nodes Memory CRUISE RAM disk 1.21 PB/s @64ppn 1.16 PB/s @32ppn 58.9 TB/s (3 mil procs) (1.5 mil procs) Sandy Bridge Ivy Bridge Same Socket Read from MIC 962 MB/s (15%) 3421 MB/s (54%) Write to MIC 5280 MB/s (83%) 6396 MB/s (100%) Different Socket Read from MIC 370 MB/s (6%) 247 MB/s (4%) Write to MIC 1075 MB/s (17%) 1179 MB/s (19%) Peak IB FDR Bandwidth: 6397 MB/s CPU Xeon Phi PCIe QPI MCI = MIC-Check Interception Library MCP = MIC-Check Proxy MCI MVAPICH Application Processes Host Xeon Phi Parallel File System Buffer Pools + I/O Threads MCP 1 2 3 4 0 2 4 6 8 10 12 14 16 1 4 16 32 64 128 Time (sec) # Nodes 1 Thread 4 Threads 16 Threads 32 Threads Front-End Node FTB-IPMI Daemon FTB_Agent Client 1 FTB_Agent Client 2 FTB_Agent Client N FTB_Agent Fault-Tolerance Backplane Applications MPI Lib Filesystems Applications MPI Lib Filesystems Applications MPI Lib Filesystems