Designing Scalable and Efficient I/O Middleware for Fault …sc14.supercomputing.org/sites/all/themes/sc14/files/... · 2016-05-27 · Designing Scalable and Efficient I/O Middleware

Designing Scalable and Efficient I/O Middleware for Fault-Resilient HPC Clusters Raghunath Raja Chandrasekar

Abstract

Problem Statement

Key Designs and Results

Ongoing and Future Work

This dissertation proposes a cross-layer framework that leverages this hierarchy in storage media, to design scalable and low-overhead fault-tolerance mechanisms that are inherently I/O bound. The key components of the framework include – CRUISE, a highly-scalable in-memory checkpointing system that leverages both volatile and Non-Volatile Memory technologies; Stage-FS, a light-weight data-staging system that leverages burst-buffers and SSDs to asynchronously move application snapshots to a remote file system; Stage-QoS, a file system agnostic Quality-of-Service mechanism for data-staging systems that minimizes network contention; MIC-Check, a distributed checkpoint-restart system for coprocessor-based supercomputing systems; and FTB-IPMI, an out-of-band fault-prediction mechanism that pro-actively monitors for failures.

•Inline-compression strategies for data-staging framework •Traditionally considered for space-constrained systems •Better representation of data more efficient network data movement •How compressible are application-/system-generated checkpoints? •Is inline checkpoint-compression a viable strategy to reduce data-movement overheads in a data-staging framework? What are the trade-offs involved?

•Energy-efficient checkpointing protocols •Energy – one “the most pervasive” challenges for Exascale computing •Power-budgets imposed system-wide •Power-aware job scheduling and accounting •I/O accounts for significant portion of job wallclock time •Are there opportunities to reduce energy consumption during checkpointing? •How can existing I/O middleware be made power-conscious?

Hierarchical RDMA-Based Checkpoint Data Staging

Advised by : Dhabaleswar K. Panda

Committee : K. Mohror (LLNL)

P. Sadayappan (OSU)

R. Teodorescu (OSU)

HPC Scientific Applications

Fault-Tolerance Techniques Checkpoint-Restart Process-Migration

Scalable and Efficient I/O Middleware

Hierarchical Data-Staging

QoS-Aware

Checkpointing

Inline compression

for Data Staging

System-level Mechanisms

Efficient

In-Memory

Checkpointing

Heterogeneous

Systems

Application-assisted Mechanisms

Low-overhead fault-prediction Energy-aware checkpointing protocols

Mutually-beneficial Mechanisms

NVM Flash/SSDs IB, 10GigE.. MIC, GPU Lustre, PVFS..

• Can checkpoint-restart mechanisms benefit from an hierarchical data-

staging framework?

• How can I/O middleware minimize the contention for network resources

between checkpoint-restart traffic and inter-process communication

traffic?

• How can the behavior of HPC applications and I/O middleware be

enhanced to leverage the deep storage hierarchies available on current-

generation supercomputers?

• How can the capabilities of state-of-art checkpointing systems be enhanced

efficiently handle heterogeneous systems?

• Can low-overhead timely failure prediction mechanisms be designed for

pro-active failure avoidance and recovery?

Dissertation Research Framework

I/O Quality-of-Service Aware Checkpointing

Efficient In-Memory Checkpointing

Checkpoint-Restart for Heterogeneous Systems

Low-Overhead Fault Prediction

Checkpointing overhead reduced by 8.3x with the staging approach

MPI Applications I/O Libraries

(POSIX. HDF5,

MPI-IO, NetCDF, etc.)

MPI Libraries

(MVAPICH2, OpenMPI, etc.)

InfiniBand Interconnect Fabric

Backend Parallel Filesystem

(Lustre, GPFS, PVFS, etc.)

QoS-Aware Data-Staging Framework

Parallel Filesystem

Staging

Client

Staging Server

default with I/O noise I/O noise isolated

Anelastic Wave Propagation (64 MPI processes)

Normalized Runtime

Message Size (Bytes)

Large-message Bandwidth

default

QoS-Aware I/O

with I/O noise

Staging Server IB

Switch

Storage Network

Switch

SSD Parallel Filesystem

CRUISE

Compute Nodes

Parallel

File System

MPI Application

RAM Disk SSD HDD

Node-Local Storage

Persistent

Memory

SCRLocal RDMA

Remote RDMA

AgentCRUISE

get_data_region()

get_chunk_meta_list()

Run on Sequoia @LLNL 50MB checkpoints

10 iterations 4MB Chunks

1K 2K 4K 8K 16K 32K 64K 96K

Memory

CRUISE

RAM disk

1.21 PB/s @64ppn

1.16 PB/s @32ppn

58.9 TB/s

(3 mil procs)

(1.5 mil procs)

Sandy Bridge Ivy Bridge

Socket

Read from MIC 962 MB/s

3421 MB/s

Write to MIC 5280 MB/s

6396 MB/s

(100%)

Different

Socket

Read from MIC 370 MB/s

247 MB/s

Write to MIC 1075 MB/s

1179 MB/s

(19%) Peak IB FDR Bandwidth:

6397 MB/s

Xeon Phi

MCI = MIC-Check Interception Library

MCP = MIC-Check Proxy

MVAPICH

Application Processes

Host Xeon Phi

Parallel File System

Buffer Pools + I/O Threads

1 4 16 32 64 128

# Nodes

1 Thread

4 Threads

16 Threads

32 Threads

Front-End Node

FTB-IPMI

Daemon

FTB_Agent

Client 1

FTB_Agent

Client 2

FTB_Agent

Client N

FTB_Agent

Fault-Tolerance Backplane

Applications

MPI Lib

Filesystems

Applications

MPI Lib

Filesystems

Applications

MPI Lib

Filesystems

Designing Scalable and Efficient I/O Middleware for Fault …sc14.supercomputing.org/sites/all/themes/sc14/files/... · 2016-05-27 · Designing Scalable and Efficient I/O Middleware

Documents

Oracle Bankacılık YaklaşımıLeading Enterprise Software....

Sc14 powerpoint

Op-sc14-1905_league of Women Voters_july09

A Scalable P2P Architecture for Topic-based Event ... ·...

Compiler Techniques for Massively Scalable Implicit Task...

SC14: VLSCI Site Report Chris Samuel

Middleware for Scalable Real-time Multimedia...

A Communication Middleware for Scalable Real-Time Mobile...

Scalable, Fault-tolerant Management of Grid Services:...

Introducing Red Hat’s JBoss Portfolio...Introducing Red...

Scalable and Reactive Multi Micro-Agents System Middleware.....

Enhancing FlashSim Simulations for High Performance...

Vol 13 - SC14

The Data Distribution Service: The Communication Middleware...

SC14-2088 Appendix B - Supreme Court

Computer Science and Engineering FREERIDE-G: A Grid-Based...