Martin Schulz Technische Universität München Fakultät für Informatik SOS Workshop Asheville, NC, USA March 2019 Adapt or Die: The Challenge for MPI in a Post-Exascale World
Martin Schulz
Technische Universität München
Fakultät für Informatik
SOS Workshop
Asheville, NC, USA
March 2019
Adapt or Die: The Challenge for MPI in a Post-Exascale World
Rising complexity of architectures
• On node: accelerators, deep memory hierarchies, …
• Off node: new I/O systems, high-dim. networks, …
Rising complexity of applications
• New algorithms
• Ensemble computation for UQ, scale-bridging, …
Holistic HW/SW Co-Design to map applications to architectures
• Substantial work in the HW and Application layers
• Includes design of the middleware layer
Challenges will get even harder
• Severe resource limitations in power/energy, network, I/O, …
• Increased variability even on homogeneous systems
• Complex workflows with varying demands
• New workloads with new requirements and pain points
Fighting the Uphill Battle to Exascale
Adaptivity is needed for efficient resource utilization
• Worst case provisioning is a waste of resources
• But limited resources lead to contention, variability, …
• Need to actively manage resources (e.g., for power, I/O, …)
Adaptivity is needed to counteract variability
• Increasing variability will be the new normal
• Need to actively balance and shift workloads
Adaptivity is needed to manage complex workflows
• Single, static applications in pure SPMD style will be a thing of the past
• Coupling of components for UQ and/or scale bridging
• Need to actively schedule components with varying demands
Adaptivity is needed by new workloads, especially in ML/DL/AI
This adaptivity must be managed and exploited across the whole software stack:
hardware, middleware and application – and that includes MPI
Adaptivity will be Key
MPI’s Philosophy• MPI is running the show – once started very little external control
• MPI is controlling progress – little interaction with other runtimes
• MPI has fixed resources – MPI_COMM_WORLD
Some of the tried approaches
• Dynamic process management (since MPI 2)
• Helper threads (for BG/L)
• Many research projects on external progress, FT, …
Clearly not sufficient, even for HPC
• Moldabilitiy present but nowhere adopted
• Malleability not available at all, HPC applications use C/R instead
• Use of secondary communication systems is growing
PLUS: We are not addressing needs of new communities!
Question: (How) Can We Make MPI More Adaptive?
Where Is MPI When it Comes to Adaptivity?
Need to maintain the key flavor of MPI
• Dynamicity should be limited and controlled / coarse grained
• Communicators are static and won’t change all the time (or ever?)
• Keep the well known communication constructs
• “Pockets/Phases” of program need to behave as before
• Inner kernels don’t change
• Low learning curve and easy change-over for MPI versed users
Need to work with external resource managers and runtimes
• Two-Way Communication of the MPI runtime with the outside world
• Ability to express needs and requirements
• Ability to react to external events and changing conditions
• Has to be able to support resource sharing
• Should still be agnostic to other resource usages
Application logic needs to stay in the application
• No automatic data re-distribution
• No rewriting of application state
Things to Consider Towards a Malleable MPI
Research project on “Invasic Computing”
• State of the art scientific applications utilize algorithms with evolving properties
• AMR, changing meshes, …
• The current assignment of fixed resources to these applications is suboptimal
➢ “Invading instead of wasting HPC resources”
Approach:
• Resource manager initiates
shrink or grow
• Application checks for changes
during adaptation window
• Typically done at iteration
boundaries
• Enables controlled change of
MPI_COMM_WORLD
Example: iMPI, aka. Elastic MPI
Gerndt, Compres, et al.
MPI_Init_adapt(…)
• Initializes the library in adaptive mode
MPI_Probe_adapt(…)
• Probes the resource manager for adaptations
MPI_Comm_adapt_begin(…)
• Marks the beginning of an adaptation window
• Provides a set of helper communicators
MPI_Comm_adapt_commit(…)
• Marks the end of an adaptation window
• Sets adapted MPI_COMM_WORLD
Proposed Resource Negotiation API
Gerndt, Compres, et al.
int main ( int argn , char ** argc ){
MPI_Init_adapt (& argn , & argc , & local_status );
for (...){
MPI_Probe_adapt (&adapt, ...);
if( local_status == MPI_ADAPT_STATUS_JOINING
|| adapt == MPI_ADAPT_TRUE){
MPI_Comm_adapt_begin (...);
// adaptation window's body with
// data redistribution code
MPI_Comm_adapt_commit (...);
}
// compute and MPI code
}
return 0;
}
Code Example
Gerndt, Compres, et al.
Example: Earthquake Simulation
Gerndt, Compres, et al.
How to integrate this into the MPI standard?
• What kind of additivity to support?
• Cooperative or not?
• How fine grained should/coarse grained can the adaptation be?
• How to interface with runtimes and resource managers?
• New SRI initiative based on ideas from PMI and PMIx
• How to guide/decide on adaptation?
How Could We Extend the Concept of iMPI
• Capturing best configurations
• Predicting runtime and power usage
• Estimating network load
Prerequisite: continuous monitoring
• Across all components and systems
• Across the entire software stack
• For all users and applications
Need To Understand Workloads
Stac
k-w
ide
Dat
a C
olle
ctio
n &
Se
man
tic
Co
rrel
atio
n
Application
Hardware
CPU NUMA Netw.
OS/Comm.
MPI Thrds. Tasks
Prg. Model
Msg. PGAS DSL
Libraries
LRZ’s Data Center DataBase (DCDB)
Legend
Planned Ongoing In Use
Operations
Monitoring
REST API
libdcdb
Database
Interface
dcdbpusher
IPMI
plugin
perf events
pluginXML plugin
(Clustsafe)
SNMP
plugin
BACnet
plugin
sysfs
plugin
User/Admin Interface
REST API
Sensor
CacheControl
Collect Agent
MQTT
Server
Database
Interface
Sensor Data
Cache
Data
Analysis
Source: Michael Ott, Daniele Tafani, LRZ
• Capturing best configurations
• Predicting runtime and power usage
• Estimating network load
Prerequisite: continuous monitoring
• Across all components and systems
• Across the entire software stack
• For all users and applications
Must include application information
• Key: progress information
• Interfaces like LLNL’s Caliper can help
• Integration of low-level resources like hardware counters
From the MPI perspective:
• Must include MPI internal configuration information → MPI_T CVARs
• Must include MPI internal performance data → MPI_T PVARs
• Must include adaptivity trigger points within MPI → MPI_T Events
• Must include analysis hooks from several layers
Need To Understand Workloads
Stac
k-w
ide
Dat
a C
olle
ctio
n &
Se
man
tic
Co
rrel
atio
n
Application
Hardware
CPU NUMA Netw.
OS/Comm.
MPI Thrds. Tasks
Prg. Model
Msg. PGAS DSL
Libraries
MPI Profiling Interface offers convenient hooks
• One tool, user controlled
• Ideal for performance tools
Hooks for Ubiquitous AnalysisApplication
MPI Library
Profiling Tool
MPI Profiling Interface offers convenient hooks
• One tool, user controlled
• Ideal for performance tools
Mechanisms to support multiple tools
• Tool stacks (e.g., PnMPI)
• Still under user control
Needed: tool stacks with multiple players
• System configurations
• Runtime enhancements
• Tools
Hooks for Ubiquitous AnalysisApplication
MPI Library
Tracing Tool
Profiling Tool
MPI Profiling Interface offers convenient hooks
• One tool, user controlled
• Ideal for performance tools
Mechanisms to support multiple tools
• Tool stacks
• Still under user control
Needed: tool stacks with multiple players
• System configurations
• Runtime enhancements
• Tools
• Under one roof, with one interface
• Independent from each other
• Function pointer based
Hooks for Ubiquitous AnalysisApplication
MPI Library
Tracing Tool
Profiling Tool
Application
MPI LibraryM
PI L
ibra
ry End-User Tool
System Monitor
Power Runtime
MPI Profiling Interface offers convenient hooks
• One tool, user controlled
• Ideal for performance tools
Mechanisms to support multiple tools
• Tool stacks
• Still under user control
Needed: tool stacks with multiple players
• System configurations
• Runtime enhancements
• Tools
• Under one roof, with one interface
• Independent from each other
• Function pointer based
➢ Code name: QMPI
Prototype coming to a system near you soon!
Hooks for Ubiquitous AnalysisApplication
MPI Library
Tracing Tool
Profiling Tool
Application
MPI LibraryM
PI L
ibra
ry
End-User Tool
System Monitor
Power Runtime
How to integrate this into the MPI standard?
• What kind of additivity to support?
• Cooperative or not?
• How fine grained should the adaptation be?
• How to interface with runtimes and resource managers?
• New SRI initiative based on ideas from PMI and PMIx
• How to guide adaptation?
Adaptation of communicators is problematic
• iMPI only supports COMM_WORLD
• Need subsetting
• Changes assumptions of application significantly
• Adaptation of libraries not simple
If we only had a way to provide
• Initialize multiple and possibly changing COMM_WORLDS
• Provide isolation between communicators
• Query process sets from the runtime
Coming back to:How Could We Extend the Concept of iMPI
Started as simple “local” initialization
• Grown to an “amorphous being”
• Could be a concept to support adaptivity
• Could be a concept to support fault tolerance
• Could be a concept to support isolation
Basic scheme
1. Get local access to the MPI library
Get a Session Handle
2. Query the underlying run-time system
Get a “set” of processes
3. Determine the processes you want
Create an MPI_Group
4. Create a communicator with just those processes
Create an MPI_Comm
MPI Sessions Is the Answer!Is MPI Sessions the Answer?
MPI
Sessions
MPI_Session
Set of processes
MPI_Group
MPI_Comm
MPI Session’s intended goals
• No more implicit MPI_COMM_WORLD
• Enable runtime information to flow into MPI
• Creation of communicators without parent communicators
Within a single MPI Session
• Query process sets
• Derive (static) groups
• Derive (static) communicators
• Static MPI bubbles/universes/enclaves …
Is MPI Sessions the Answer? Perhaps
MPI_Session
Set of processes
MPI_Group
MPI_Comm
Set of processes
What if …
Options
What if: process sets could change over time?
Within a session
• Query new set sizes
• Create new communicators
• User code for proper switch-over
Issue 1: How to ensure everyone sees the same process set?
• Set versioning
• Communicator creation fails if derived from groups derived from sets with
different versions
• Iteration until successful creation of a new communicator
Issue 2: Integration of new MPI processes
• Need trigger mechanism to notify existing processes
• Enlarged communicator created with new processes
• Distribute data
• Free old communicator
Option 1: Dynamic Set Management
What if: a runtime/RM could influence/terminate a session?
On changing resources a runtime invalidates a session bubble
• Based on the used process set(s)
• Once bubble is invalidated
• Either disallow communication (return error)
• Or issue warning
Application can/has to react to runtime input
• Create new session bubble with new processes
• Redistribute data
• Cleanup and delete old session bubble
Could also be a clean recipe for fault tolerance
• Resource isolation
• Clean composability
Option 2. Runtime Impact on Bubbles
Ability to reason about all MPI objects that are
• … derived from the same local local session(s)
• … part of the same process groups
• … create their own isolated resources
These objects form a natural group
• Isolated from the world in terms of communication
• Could be revoked without global impact
• Granularity of malleability
• Keeping flavor of MPI
• Maintaining coarse granularity
• Clean concept for FT
Open issue: not „sessioned“ MPI objects
• Datatypes
• Info objects
• MPI Tools Information Interface
Let‘s Make Pigs Fly:Towards a Global Session (Bubbles) Concept
Systems Require More Adaptivity
• Key to support future resource constraint systems
• But: the current MPI is not flexible enough
We need the ability to dynamically adapt
• Step 1: Introspection
MPI_T & QMPI are important parts of this
• Step 2: Support runtime/RM interactions
Two-way negotiation abilities → SRI efforts
• Step 3: Enable changing resources
Growing and shrinking process sets
MPI Sessions is the a good step, but …• Need ability to reason about global concepts
• Need ability to query external influence
• Need ability to capture and query changes
Needs to be Use Case Driven - Join the Discussion!
MPI Needs to Get More Adaptive