Top Banner
Designing Scalable and Efficient I/O Middleware for Fault-Resilient HPC Clusters Raghunath Raja Chandrasekar Abstract Problem Statement Key Designs and Results Ongoing and Future Work This dissertation proposes a cross-layer framework that leverages this hierarchy in storage media, to design scalable and low-overhead fault-tolerance mechanisms that are inherently I/O bound. The key components of the framework include CRUISE, a highly-scalable in-memory checkpointing system that leverages both volatile and Non-Volatile Memory technologies; Stage-FS, a light-weight data-staging system that leverages burst-buffers and SSDs to asynchronously move application snapshots to a remote file system; Stage-QoS, a file system agnostic Quality-of-Service mechanism for data-staging systems that minimizes network contention; MIC-Check, a distributed checkpoint-restart system for coprocessor-based supercomputing systems; and FTB-IPMI, an out-of- band fault-prediction mechanism that pro-actively monitors for failures. Inline-compression strategies for data-staging framework Traditionally considered for space-constrained systems Better representation of data more efficient network data movement How compressible are application-/system-generated checkpoints? Is inline checkpoint-compression a viable strategy to reduce data-movement overheads in a data- staging framework? What are the trade-offs involved? Energy-efficient checkpointing protocols Energy one “the most pervasive” challenges for Exascale computing Power-budgets imposed system-wide Power-aware job scheduling and accounting I/O accounts for significant portion of job wallclock time Are there opportunities to reduce energy consumption during checkpointing? How can existing I/O middleware be made power-conscious? Hierarchical RDMA-Based Checkpoint Data Staging Advised by : Dhabaleswar K. Panda Committee : K. Mohror (LLNL) P. Sadayappan (OSU) R. Teodorescu (OSU) HPC Scientific Applications Fault-Tolerance Techniques Checkpoint-Restart Process-Migration Scalable and Efficient I/O Middleware Hierarchical Data-Staging QoS-Aware Checkpointing Inline compression for Data Staging System-level Mechanisms Efficient In-Memory Checkpointing Checkpointing Heterogeneous Systems Application-assisted Mechanisms Low-overhead fault-prediction Energy-aware checkpointing protocols Mutually-beneficial Mechanisms NVM Flash/SSDs IB, 10GigE.. MIC, GPU Lustre, PVFS.. Can checkpoint-restart mechanisms benefit from an hierarchical data- staging framework? How can I/O middleware minimize the contention for network resources between checkpoint-restart traffic and inter-process communication traffic? How can the behavior of HPC applications and I/O middleware be enhanced to leverage the deep storage hierarchies available on current- generation supercomputers? How can the capabilities of state-of-art checkpointing systems be enhanced efficiently handle heterogeneous systems? Can low-overhead timely failure prediction mechanisms be designed for pro-active failure avoidance and recovery? Dissertation Research Framework I/O Quality-of-Service Aware Checkpointing Efficient In-Memory Checkpointing Checkpoint-Restart for Heterogeneous Systems Low-Overhead Fault Prediction Checkpointing overhead reduced by 8.3x with the staging approach MPI Applications I/O Libraries (POSIX. HDF5, MPI-IO, NetCDF, etc.) MPI Libraries (MVAPICH2, OpenMPI, etc.) InfiniBand Interconnect Fabric Backend Parallel Filesystem (Lustre, GPFS, PVFS, etc.) QoS-Aware Data-Staging Framework Parallel Filesystem Staging Group N Staging Group 2 Staging Group 1 …. Client Node 1 Client Node 2 .. Client Node N-1 Client Node N Staging Server SSD 0.9 0.95 1 1.05 1.1 1.15 1.2 default with I/O noise I/O noise isolated Anelastic Wave Propagation (64 MPI processes) Normalized Runtime 17.9% 8% 0 500 1000 1500 2000 2500 3000 3500 4000 Bandwidth (MB/s) Message Size (Bytes) Large-message Bandwidth default QoS-Aware I/O with I/O noise ~20% 0 1 2 .. 7 0 1 2 .. 7 0 1 2 .. 7 0 1 2 .. 7 Staging Server IB Switch Storage Network Switch SSD Parallel Filesystem CRUISE Compute Nodes Parallel File System MPI Application RAM Disk SSD HDD Node-Local Storage SCR RAM/ Persistent Memory SCR Local RDMA Agent Remote RDMA Agent CRUISE 1 2 3 4 5 6 7 9 get_data_region() get_chunk_meta_list() 8 Run on Sequoia @LLNL 50MB checkpoints 10 iterations 4MB Chunks 0.1 1 10 100 1000 10000 1K 2K 4K 8K 16K 32K 64K 96K TB/s Nodes Memory CRUISE RAM disk 1.21 PB/s @64ppn 1.16 PB/s @32ppn 58.9 TB/s (3 mil procs) (1.5 mil procs) Sandy Bridge Ivy Bridge Same Socket Read from MIC 962 MB/s (15%) 3421 MB/s (54%) Write to MIC 5280 MB/s (83%) 6396 MB/s (100%) Different Socket Read from MIC 370 MB/s (6%) 247 MB/s (4%) Write to MIC 1075 MB/s (17%) 1179 MB/s (19%) Peak IB FDR Bandwidth: 6397 MB/s CPU Xeon Phi PCIe QPI MCI = MIC-Check Interception Library MCP = MIC-Check Proxy MCI MVAPICH Application Processes Host Xeon Phi Parallel File System Buffer Pools + I/O Threads MCP 1 2 3 4 0 2 4 6 8 10 12 14 16 1 4 16 32 64 128 Time (sec) # Nodes 1 Thread 4 Threads 16 Threads 32 Threads Front-End Node FTB-IPMI Daemon FTB_Agent Client 1 FTB_Agent Client 2 FTB_Agent Client N FTB_Agent Fault-Tolerance Backplane Applications MPI Lib Filesystems Applications MPI Lib Filesystems Applications MPI Lib Filesystems
1

Designing Scalable and Efficient I/O Middleware for Fault …sc14.supercomputing.org/sites/all/themes/sc14/files/... · 2016-05-27 · Designing Scalable and Efficient I/O Middleware

May 25, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Designing Scalable and Efficient I/O Middleware for Fault …sc14.supercomputing.org/sites/all/themes/sc14/files/... · 2016-05-27 · Designing Scalable and Efficient I/O Middleware

Designing Scalable and Efficient I/O Middleware for Fault-Resilient HPC Clusters Raghunath Raja Chandrasekar

Abstract

Problem Statement

Key Designs and Results

Ongoing and Future Work

This dissertation proposes a cross-layer framework that leverages this hierarchy in storage media, to design scalable and low-overhead fault-tolerance mechanisms that are inherently I/O bound. The key components of the framework include – CRUISE, a highly-scalable in-memory checkpointing system that leverages both volatile and Non-Volatile Memory technologies; Stage-FS, a light-weight data-staging system that leverages burst-buffers and SSDs to asynchronously move application snapshots to a remote file system; Stage-QoS, a file system agnostic Quality-of-Service mechanism for data-staging systems that minimizes network contention; MIC-Check, a distributed checkpoint-restart system for coprocessor-based supercomputing systems; and FTB-IPMI, an out-of-band fault-prediction mechanism that pro-actively monitors for failures.

•Inline-compression strategies for data-staging framework •Traditionally considered for space-constrained systems •Better representation of data more efficient network data movement •How compressible are application-/system-generated checkpoints? •Is inline checkpoint-compression a viable strategy to reduce data-movement overheads in a data-staging framework? What are the trade-offs involved?

•Energy-efficient checkpointing protocols •Energy – one “the most pervasive” challenges for Exascale computing •Power-budgets imposed system-wide •Power-aware job scheduling and accounting •I/O accounts for significant portion of job wallclock time •Are there opportunities to reduce energy consumption during checkpointing? •How can existing I/O middleware be made power-conscious?

Hierarchical RDMA-Based Checkpoint Data Staging

Advised by : Dhabaleswar K. Panda

Committee : K. Mohror (LLNL)

P. Sadayappan (OSU)

R. Teodorescu (OSU)

HPC Scientific Applications

Fault-Tolerance Techniques Checkpoint-Restart Process-Migration

Scalable and Efficient I/O Middleware

Hierarchical Data-Staging

QoS-Aware

Checkpointing

Inline compression

for Data Staging

System-level Mechanisms

Efficient

In-Memory

Checkpointing

Checkpointing

Heterogeneous

Systems

Application-assisted Mechanisms

Low-overhead fault-prediction Energy-aware checkpointing protocols

Mutually-beneficial Mechanisms

NVM Flash/SSDs IB, 10GigE.. MIC, GPU Lustre, PVFS..

• Can checkpoint-restart mechanisms benefit from an hierarchical data-

staging framework?

• How can I/O middleware minimize the contention for network resources

between checkpoint-restart traffic and inter-process communication

traffic?

• How can the behavior of HPC applications and I/O middleware be

enhanced to leverage the deep storage hierarchies available on current-

generation supercomputers?

• How can the capabilities of state-of-art checkpointing systems be enhanced

efficiently handle heterogeneous systems?

• Can low-overhead timely failure prediction mechanisms be designed for

pro-active failure avoidance and recovery?

Dissertation Research Framework

I/O Quality-of-Service Aware Checkpointing

Efficient In-Memory Checkpointing

Checkpoint-Restart for Heterogeneous Systems

Low-Overhead Fault Prediction

Checkpointing overhead reduced by 8.3x with the staging approach

MPI Applications I/O Libraries

(POSIX. HDF5,

MPI-IO, NetCDF, etc.)

MPI Libraries

(MVAPICH2, OpenMPI, etc.)

InfiniBand Interconnect Fabric

Backend Parallel Filesystem

(Lustre, GPFS, PVFS, etc.)

QoS-Aware Data-Staging Framework

Parallel Filesystem

Staging

Group

N

Staging

Group

2

Staging

Group

1

….

Client

Node

1

Client

Node

2

..

Client

Node

N-1

Client

Node

N

Staging Server

SSD

0.9

0.95

1

1.05

1.1

1.15

1.2

default with I/O noise I/O noise isolated

Anelastic Wave Propagation (64 MPI processes)

Normalized Runtime

17.9%

8%

0

500

1000

1500

2000

2500

3000

3500

4000

Ba

nd

wid

th (

MB

/s)

Message Size (Bytes)

Large-message Bandwidth

default

QoS-Aware I/O

with I/O noise

~20%

0

1

2

..

7

0

1

2

..

7 0

1

2

..

7

0

1

2

..

7

Staging Server IB

Switch

Storage Network

Switch

SSD Parallel Filesystem

CRUISE

Compute Nodes

Parallel

File System

MPI Application

RAM Disk SSD HDD

Node-Local Storage

SCR

RAM/

Persistent

Memory

SCRLocal RDMA

Agent

Remote RDMA

AgentCRUISE

1

2

3

45

6

7

9

get_data_region()

get_chunk_meta_list()

8

Run on Sequoia @LLNL 50MB checkpoints

10 iterations 4MB Chunks

0.1

1

10

100

1000

10000

1K 2K 4K 8K 16K 32K 64K 96K

TB

/s

Nodes

Memory

CRUISE

RAM disk

1.21 PB/s @64ppn

1.16 PB/s @32ppn

58.9 TB/s

(3 mil procs)

(1.5 mil procs)

Sandy Bridge Ivy Bridge

Same

Socket

Read from MIC 962 MB/s

(15%)

3421 MB/s

(54%)

Write to MIC 5280 MB/s

(83%)

6396 MB/s

(100%)

Different

Socket

Read from MIC 370 MB/s

(6%)

247 MB/s

(4%)

Write to MIC 1075 MB/s

(17%)

1179 MB/s

(19%) Peak IB FDR Bandwidth:

6397 MB/s

CPU

Xeon Phi

PCIe

QPI

MCI = MIC-Check Interception Library

MCP = MIC-Check Proxy

MCI

MVAPICH

Application Processes

Host Xeon Phi

Parallel File System

Buffer Pools + I/O Threads

MCP

1

2

3

4

0

2

4

6

8

10

12

14

16

1 4 16 32 64 128

Tim

e

(se

c)

# Nodes

1 Thread

4 Threads

16 Threads

32 Threads

Front-End Node

FTB-IPMI

Daemon

FTB_Agent

Client 1

FTB_Agent

Client 2

FTB_Agent

Client N

FTB_Agent

Fault-Tolerance Backplane

Applications

MPI Lib

Filesystems

Applications

MPI Lib

Filesystems

Applications

MPI Lib

Filesystems