Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected]http://www.cse.ohio-state.edu/~panda A Presentation at HPC Advisory Council Workshop, Lugano 2011 by Sayantan Sur The Ohio State University E-mail: [email protected]http://www.cse.ohio-state.edu/~surs
66
Embed
Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
– Contention at the source and destination adapter(s)
– CPU involvement/overhead
• Different algorithms based on system size and message size
• Multi-core-aware algorithms for the emerging multi-core platforms
• Topology-aware algorithms to dynamically adopt based on the
underlying network topology
HPC Advisory Council, Lugano Switzerland '11 44
Performance Issues (Cont’d)
• Performance of an application should increase as system
size increases
– Strong-Scaling
• Problem size is kept constant as system size increases
– Weak-Scaling
• Problem size keeps on increasing as system size increases
• Depends on
– Structure of the application
– Underling algorithms being used
– Performance of MPI library
• All Performance Issues (as indicated earlier) matter for the MPI library
• Additional Issues
– Network topology
– CPU mapping to cores (block and cyclic across nodes and within nodes)
HPC Advisory Council, Lugano Switzerland '11 45
Obtaining Scalable Performance
• Does the memory needed for MPI library increases with System size?
• Different transport protocols with IB
– Reliable Connection (RC) is the most common
– Unreliable Datagram (UD) is used in some cases
• Buffers need to be posted at each receiver to receive message from
any sender
– Buffer requirement can increase with system size
• Connections need to be established across processes under RC
– Each connection requires certain amount of memory for handling related
data structures
– Memory required for all connections can increase with system size
• Both issues have become critical as large-scale IB deployments have
taken place
– Being addressed by IB specification (SRQ, XRC, UD/RC/XRC Hybrid) and
MPI library (Will be discussed more in Day 2)
HPC Advisory Council, Lugano Switzerland '11 46
Memory Scalability of MPI Library in large-scale systems
• Millions of cores and components in next-generation Multi-PetaFlop
and Exaflop systems
• Components are bound to fail
• Mean Time Between Failure (MTBF) has to remain high so that
Exascale applications can run efficiently
• Two broad kinds of failures
– Network failure (adapter, link and switch)
– Node or Process failure
• InfiniBand provides multiple schemes like CRC, end-to-end reliability,
reliable connection (RC) mode, Automatic Path Migration (APM) to
handle network related errors
• Can MPI library be made Resilient? (Day 2)
• Can MPI library support efficient checkpoint-restart and process
migration for process/node failure? (Day 2)
HPC Advisory Council, Lugano Switzerland '11 47
Fault-Tolerance and Resiliency
• Power consumption is becoming a significant issue for the
design and deployment of Multi-Petaflop and Exaflop
systems
• All different hardware components (CPU, memory, storage,
network adapter, switch and links) are being re-designed
with less power consumption in mind
• Targeted goal is 20MW for an Exaflop system in 2018-2020
• Can we make MPI-library Power-Aware?
– Polling-based schemes are common in MPI library to receive
messages and act upon these quickly
– Continuous polling by CPU consumes a lot of power
– Can the CPUs be made running at lower speed when large collective
operations are taking place
– Can we design power-aware collective schemes (Day 2) HPC Advisory Council, Lugano Switzerland '11 48
Power-Aware Designs
• Trends in Designing Petaflop and Exaflop Systems
• Overview of Programming Models and MPI
• Using MPI
• Challenges in Designing MPI Library on Petaflop and
Exaflop Systems
• Overview of MVAPICH and MVAPICH2 MPI Stack
• Sample Performance Numbers
49
Presentation Overview
HPC Advisory Council, Lugano Switzerland '11
• High Performance MPI Library for IB, 10GE/iWARP & RoCE
– MVAPICH (MPI-1) and MVAPICH2 (MPI-2)
– Latest Releases: MVAPICH 1.2 and MVAPICH2 1.6
– Used by more than 1,500 organizations in 60 countries
• Registered at the OSU site voluntarily
– More than 57,000 downloads from OSU site directly
– Empowering many TOP500 production clusters during the last eight years
– Available with software stacks of many IB, 10GE and server vendors including
Open Fabrics Enterprise Distribution (OFED) and Linux Distros
– Also supports uDAPL device to work with any network supporting uDAPL
– http://mvapich.cse.ohio-state.edu/
50
MVAPICH/MVAPICH2 Software
HPC Advisory Council, Lugano Switzerland '11
MVAPICH-1 Architecture
MVAPICH (MPI-1) (1.2)
OpenFabrics/ Gen2
(Single-rail)
InfiniBand (Mellanox)
#1
PCI-X, PCIe, PCIe-Gen2 (SDR, DDR & QDR)
Major Computing Platforms: IA-32, EM64T, Nehalem, Westmere, Opteron, Magny, ..
#2
OpenFabrics/ Gen2-Hybrid (Single-rail)
PSM
#3
Shared- Memory
InfiniBand (QLogic)
PCIe & HT (SDR, DDR & QDR)
#4
TCP/IP
#5
Single Node/ Laptops with
Multi-core
VAPI Gen2-Multirail
uDAPL (deprecated)
51 HPC Advisory Council, Lugano Switzerland '11
Major Features of MVAPICH 1.2
• OpenFabrics-Gen2 – Scalable job start-up with mpirun_rsh, support for SLURM – RC and XRC support – Flexible message coalescing – Multi-core-aware pt-to-pt communication – User-defined processor affinity for multi-core platforms – Multi-core-optimized collective communication – Asynchronous and scalable on-demand connection management – RDMA Write and RDMA Read-based protocols – Lock-free Asynchronous Progress for better overlap between
computation and communication – Polling and blocking support for communication progress – Multi-pathing support leveraging LMC mechanism on large fabrics – Network-level fault tolerance with Automatic Path Migration
(APM) – Mem-to-mem reliable data transfer mode (for detection of I/O
error with 32-bit CRC) – Network Fault Resiliency
52
HPC Advisory Council, Lugano Switzerland '11
Major Features of MVAPICH 1.2 (Continued)
• OpenFabrics-Gen2-Hybrid – Introduced interface in 1.1 – Replaces UD interface in 1.0 – Targeted for emerging multi-thousand-core clusters to
achieve the best performance with minimal memory footprint
– Most of the features as in Gen2 – Adaptive selection during run-time (based on
application and systems characteristics) to switch between
• RC and UD (or between XRC and UD) transports
– Multiple buffer organization with XRC support
53 HPC Advisory Council, Lugano Switzerland '11
MVAPICH2 Architecture (Latest Release 1.6)
Major Computing Platforms: IA-32, EM64T, Nehalem, Westmere, Opteron, Magny, ..
54 HPC Advisory Council, Lugano Switzerland '11
All Different PCI interfaces
MVAPICH2 1.6 Features • Support for GPUDirect
• Using LiMIC2 for true one-sided intra-node RMA transfer to avoid extra memory copies
• Upgraded to LiMIC2 version 0.5.4
• Removing the limitation on number of concurrent windows in RMA operations
• Support for InfiniBand Quality of Service (QoS) with multiple virtual lanes
• Support for 3D Torus Topology
• Enhanced support for multi-threaded applications
• Fast Checkpoint-Restart support with aggregation scheme
• Job Pause-Migration-Restart Framework for Pro-active Fault-Tolerance
• Support for new standardized Fault-Tolerance Backplane (FTB) Events for CR and Migration Frameworks
• Dynamic detection of multiple InfiniBand adapters and using these by default in multi-rail configurations
• Support for process-to-rail binding policy (bunch, scatter and user-defined) in multi-rail configurations
• Enhanced and optimized algorithms for MPI_Reduce and MPI_AllReduce operations for small and medium message sizes
• XRC support with Hydra Process Manager
55 HPC Advisory Council, Lugano Switzerland '11
56
Support for Multiple Interfaces/Adapters
• OpenFabrics/Gen2-IB and OpenFabrics/Gen2-Hybrid – All IB adapters supporting OpenFabrics/Gen2
• Qlogic/PSM • Qlogic adapters
• OpenFabrics/Gen2-iWARP • Chelsio and Intel-NetEffect
• RoCE • ConnectX-EN
• uDAPL – Linux-IB – Solaris-IB – Any other adapter supporting uDAPL
• TCP/IP – Any adapter supporting TCP/IP interface
• Shared Memory Channel • for running applications in a node with multi-core processors (laptop,
SMP systems)
HPC Advisory Council, Lugano Switzerland '11
• Trends in Designing Petaflop and Exaflop Systems
• Overview of Programming Models and MPI
• Using MPI
• Challenges in Designing MPI Library on Petaflop and
Exaflop Systems
• Overview of MVAPICH and MVAPICH2 MPI Stack
• Sample Performance Numbers
57
Presentation Overview
HPC Advisory Council, Lugano Switzerland '11
MVAPICH2 Inter-Node Performance Ping Pong Latency
58 HPC Advisory Council, Lugano Switzerland '11
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Late
ncy
(u
s)
Message Size (Bytes)
Small Messages
MVAPICH2-1.6
1.56 us
0
50
100
150
200
250
300
350
Late
ncy
(u
s)
Message Size (Bytes)
Large Messages
MVAPICH2-1.6
Intel Westmere 2.53 GHz with Mellanox ConnectX-2 QDR Adapter
MVAPICH2 Inter-Node Performance
59 HPC Advisory Council, Lugano Switzerland '11
0
500
1000
1500
2000
2500
3000
3500
4000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth
MVAPICH2-1.6
0
1000
2000
3000
4000
5000
6000
7000
Bi-
Dir
ect
ion
al B
and
wid
th (
MB
/s)
Message Size (Bytes)
Bi-Directional Bandwidth
MVAPICH2-1.63394 MB/s
6539 MB/s
Intel Westmere 2.53 GHz with Mellanox ConnectX-2 QDR Adapter
Performance of HPC Applications on TACC Ranger using MVAPICH + IB