Top Banner
Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September 9, 2008 EuroPVM-MPI Conference Dublin Ireland Heterogeneous multi-core
21

Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

Mar 30, 2015

Download

Documents

Emely Leader
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

Managed by UT-Battellefor the Department of Energy

MPI Must Evolve or Die!

Research sponsored by ASCR

Al GeistOak Ridge National Laboratory

September 9, 2008

EuroPVM-MPI ConferenceDublin IrelandHeterogeneous multi-core

Page 2: Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

2 Managed by UT-Battellefor the Department of Energy

Exascale software solution can’t rely on“Then a miracle occurs”

Hardware developer describing exascale design says “Then a miracle occurs”

System software engineer replies “I think you should be more explicit”

Page 3: Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

3 Managed by UT-Battellefor the Department of Energy

Acknowledgements

Harness Research Project (Geist, Dongarra, Sundaram)The same team that created PVM has continued theexploration of heterogeneous and adaptive computing.

Acknowledge the team members whose ideas and research on the Harness project are being presented in this talk.

Bob ManchekGraham FaggJune Denoto

Jelena Pješivac-GrbovićGeorge BosilcaThara Angskun

Magdalena SlawinskaJaroslaw SlawinskiEdgar Gabriel

Research sponsored by ASCR

Interesting observation

PVM use is starting to grow againThe support questions have doubled in past yearEven getting queries from HPC users who are desperate for fault tolerance.

Apologies to anyone I missed

Page 4: Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

4 Managed by UT-Battellefor the Department of Energy

Example of a Petaflops System - ORNL (late 2008)Multi-core, homogeneous, multiple programming models

DOE Cray “Baker” 1 Petaflops system

13,944 dual-socket, 8-core SMP “nodes” with 16 GB

27,888 quad-core processors Barcelona 2.3 GHz (37 Gflops)

223 TB memory (2GB/core)

200+ GB/s disk bandwidth

10 PB storage

6.5 MW system power

150 cabinets, 3,400 ft2

Liquid cooled cabinets Compute Node Linux

operating system

Page 5: Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

5 Managed by UT-Battellefor the Department of Energy

MPI Dominates Petascale CommunicationSurvey top HPC open science applications

Must have Can use

Page 6: Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

6 Managed by UT-Battellefor the Department of Energy

The answer is MPI. What is the question?

While applications may continue to use MPI due to:•Inertia – these codes take decades to create and validate •Nothing better – developers need a BIG incentive to rewrite (not 50%) Communication libraries are being changed to exploit new petascale systems, giving applications more life. •Hardware support for MPI is pushing this out even further

Business as usual has been to improve latency and/or bandwidth.•But large-scale, many-core, heterogeneous architectures require us to think further outside the box

It is not business as usual inside petascale communication libraries •Hierarchical algorithms•Hybrid algorithms•Dynamic algorithm selection•Fault tolerance

Page 7: Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

7 Managed by UT-Battellefor the Department of Energy

Hierarchical Algorithms

Hierarchical algorithm designs seek to consolidate information at different levels of the architecture to reduce the number of messages and contention on the interconnect.

Architecture Levels:SocketNodeBoardCabinetSwitchSystem

PVM Project studied hierarchical collective algorithms using clusters of clusters (simple 2-level model)

Communication within cluster was 10X faster than between clusters

Found improvements in the range of 2X-5X but not pursued because HPC machines at time had only one level. Needs rethinking for petascale systems

Page 8: Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

8 Managed by UT-Battellefor the Department of Energy

Hybrid Algorithms

Hybrid algorithm designs use different algorithms at different levels of the architecture, for example, using a shared memory algorithm within a node, or an accelerator board, such as Cell, and a message passing algorithm between nodes.

PVM Project studied hybrid msg passing algorithms using heterogeneous parallel virtual machines

Communication optimized to the custom HW within each computer

Today all MPI implementations do this to some extent. But there is more to be done for new heterogeneous systems

Roadrunner

Page 9: Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

9 Managed by UT-Battellefor the Department of Energy

Adaptive Communication Libraries

Algorithm is dynamically selected from a set of collective communication algorithms based on multiple metrics such as:

• Number of tasks being sent to• Where they are located in the system• The size of the message being sent• The physical topology and particular quirks of the system

At run time, decision function is invoked to select the “best” algorithm for particular collective callSteps in optimization process:1. Implementation of different MPI algorithms2. MPI collective algorithm performance information“Optimal” MPI collective operation implementation3. Decision / Algorithm selection process4. Decision function - Automatically generate code based on step 3

Harness Project explored having adaptive MPI collectives

Page 10: Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

10 Managed by UT-Battellefor the Department of Energy

Harness Adaptive Collective Communication

MPI collective algorithm implementations

Exhaustive Testing

Optimal MPI collective implementation

Decision Process

Decision Function

Performance modeling

Performed just once on a given machine

Page 11: Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

11 Managed by UT-Battellefor the Department of Energy

Decision/Algorithm Selection ProcessThree Different Approaches Explored

Parametric data modeling:Use algorithm performance models to select algorithm with shortest completion time (Hockney, LogGP, PLogP, …)

Image encoding techniques:Use graphics encoding algorithms to capture information algorithm switching points

Statistical learning methods:Use statistical learning methods to find patterns in algorithm performance data and to construct decision systems

Parametric data modeling:Use algorithm performance models to select algorithm with shortest completion time (Hockney, LogGP, PLogP, …)

Image encoding techniques:Use graphics encoding algorithms to capture information algorithm switching points

Statistical learning methods:Use statistical learning methods to find patterns in algorithm performance data and to construct decision systems

Page 12: Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

12 Managed by UT-Battellefor the Department of Energy

Fault Tolerant Communication

Harness Project was where FT-MPI was created to explore ways that MPI could be modified to allow applications to “run through” faults.

Accomplishments of FT-MPI Research•Define the behavior of MPI in case an error occurs•Give the application the possibility to recover from a node-failure•A regular, non fault-tolerant MPI program will run using FT-MPI•Stick to the MPI-1 and MPI-2 specification as closely as possible (e.g. no additional function calls)

•Provide the notification to the application•Provide recovery options for the application to exploit if desired

What FT-MPI does not do:• Recover user data (e.g. automatic check-pointing)• Provide transparent fault-tolerance

Page 13: Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

13 Managed by UT-Battellefor the Department of Energy

FT-MPI recovery options

ABORT: just do as other implementations

BLANK: leave hole

SHRINK: re-order processes to make a contiguous communicator

Some ranks change

REBUILD: re-spawn lost processes and add them to MPI_COMM_WORLD

Key to allowing MPI applications to “run through” faults.Developing COMM_CREATE that can build a new MPI_COMM_WORLDFour options explored (abort, blank, shrink, rebuild)

As a convenience a fifth option to shrink or rebuild ALL communicators inside an application at once was also investigated.

Page 14: Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

14 Managed by UT-Battellefor the Department of Energy

Future of Fault Tolerant Communication

The fault tolerant capabilities and datatypes in FT-MPI are now becoming a part of the OpenMPI effort.

Fault Tolerance is under consideration by the MPI forum as a possible addition to the MPI-3 standard

MPI-3

Page 15: Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

15 Managed by UT-Battellefor the Department of Energy

Getting Applications to Use this Stuff

All these new multi-core, heterogeneous algorithms and techniques are for naught if we don’t get the science teams to use them.

ORNL’s Leadership Computing Facility uses a couple key methods to get the latest algorithms and system specific features used by the science teams. Science Liaisons and Centers of Excellence.

Science Liaisons from the Scientific Computing Group are assigned to every science team on the leadership system. Their duties include:

• Scaling algorithms to the required size• Application and library code optimization

and scaling• Exploiting parallel I/O & other technologies

in apps• More…

Page 16: Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

16 Managed by UT-Battellefor the Department of Energy

Centers of Excellence

ORNL has a Cray Center of Excellence and a Lustre Center of Excellence one of their missions is to have vendor engineers engage directly with users to help them with the latest techniques to get scalable performance

Cray Center of Excellence Lustre Center of Excellence

But, having science liaisons and help from vendor engineers is not a scalable solution for the larger community of users so we are creating…

Page 17: Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

17 Managed by UT-Battellefor the Department of Energy

Harness Workbench for Science TeamsEclipse (Parallel Tools Platform)Help the user by building

a tool that can apply basic knowledge of developer, admin, and vendor

Integrated with runtime

Available to LCF science liaisons this summer

Page 18: Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

18 Managed by UT-Battellefor the Department of Energy

Next Generation Runtime Scalable Tool Communication Infrastructure (STCI)

Harness runtime environment (underlying Harness workbench, adaptive comm, fault recovery)

Open runtime environment - OpenRTE (underlying OpenMPI)

Scalable Tool Communication Infrastructure

Which was generalized

Adopted emerging RTE

Execution contextSessionsCommunicationsPersistenceSecurity

High-performance, scalable, resilient, and portable communications and process control services for user and system tools:

parallel run-time environment (MPI), application correctness tools, performance analysis toolssystem monitoring and management

Page 19: Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

19 Managed by UT-Battellefor the Department of Energy

Petascale to Exascale requires new approach:

Try to break the cycle of HW vendors throwing the latest giant system over the fence and leave it to the system software guys and applications to figure out how to use the latest HW (billion-way parallelism at exascale)

Try to get applications to rethink their algorithms and even their physics in order to better match what the HW can give them (memory wall isn’t going away)

Meet in the middle – change what “balanced system” means

Synergistically Developing Architecture and Algorithms Together

Creating a Revolution in Evolution

Institute for Advances Architectures and Algorithms has been established in a Sandia/ORNL joint effort to facilitate the co-design of architectures and algorithms in order to create synergy in their respective evolutions.

Page 20: Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

20 Managed by UT-Battellefor the Department of Energy

Summary

It is not business as usual for petascale communication No longer just about improved latency and bandwidthBut MPI is not going away

Communication libraries are adaptingHierarchical algorithms Hybrid algorithmsDynamic selected algorithmsAllow “run through” fault tolerance

But we have to get applications to use these new ideas

Going to Exascale communication needs a fundamental shiftBreak the deadly cycle of hardware being thrown “over fence”

for the software developers to figure out how to use.is this crazy talk?

Evolve or Die

Page 21: Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

21 Managed by UT-Battellefor the Department of Energy

Questions?

21 Managed by UT-Battellefor the Department of Energy