High Performance Computing - morrisriedel.demorrisriedel.de/wp-content/uploads/2018/03/HPC...1. High Performance Computing 2. Parallelization Fundamentals 3. Parallel Programming with

ADVANCED SCIENTIFIC COMPUTING

Dr. – Ing. Morris RiedelAdjunct Associated ProfessorSchool of Engineering and Natural Sciences, University of IcelandResearch Group Leader, Juelich Supercomputing Centre, Germany

Parallel Programming with MPIAugust 31, 2017Room Ág-303

High Performance Computing

LECTURE 3

Review of Lecture 2 – Parallelization Fundamentals

Strategies for Parallelization

Lecture 3 – Parallel Programming with MPI

Terminology

Modified from [1] Introduction to High Performance Computing for Scientists and Engineers

(SPMD Example)

Modified from [2] Caterham F1 team races past competition with HPC

(MPMD Example)

2 / 40

Outline of the Course

1. High Performance Computing

2. Parallelization Fundamentals

3. Parallel Programming with MPI

4. Advanced MPI Techniques

5. Parallel Algorithms & Data Structures

6. Parallel Programming with OpenMP

7. Hybrid Programming & Patterns

8. Debugging & Profiling Techniques

9. Performance Optimization & Tools

10. Scalable HPC Infrastructures & GPUs


11. Scientific Visualization & Steering

12. Terrestrial Systems & Climate

13. Systems Biology & Bioinformatics

14. Molecular Systems & Libraries

15. Computational Fluid Dynamics

16. Finite Elements Method

17. Machine Learning & Data Mining

18. Epilogue

+ additional practical lectures for our

hands-on exercises in context

3 / 40

Outline

Message Passing Interface (MPI) Review Distributed Memory Systems Point-to-Point Message Passing Functions Understanding MPI Collectives MPI Rank & Communicators Standardization & Portability

MPI Parallel Programming Basics Environment with Libraries & Modules Thinking Parallel Basic Building Blocks of a Parallel Program Code Compilation & Parallel Executions Simple PingPong Application Example


Promises from previous lecture(s):

Lecture 1: Lecture 3 & 4 will give in-depth details on the distributed-memory programming model with MPI

4 / 40

Message Passing Interface (MPI)

Lecture 3 – Parallel Programming with MPI 5 / 40


Distributed-Memory Computers Reviewed

Processors communicate via Network Interfaces (NI) NI mediates the connection to a Communication network This setup is rarely used a programming model view today

A distributed-memory parallel computer establishes a ‘system view’ where no process can access another process’ memory directly

Modified from [1] Introduction to High Performance Computing for Scientists and Engineers

Programming Model:

Message Passing

6 / 40


Programming with Distributed Memory using MPI

No remote memory access on distributed-memory systems Require to ‘send messages’ back and forth between processes PX

Many free Message Passing Interface (MPI) libraries available Programming is tedious & complicated, but most flexible method

P1 P2 P3 P4 P5

Distributed-memory programming enables explicit message passing as communication between processors

MPI is dominant distributed-memory programming standard today (v3.1)

[3] MPI Standard

7 / 40


What is MPI?

‘Communication library’ abstracting from low-level network view Offers 500+ available functions to communicate between computing nodes Practice reveals: parallel applications often require just ~12 (!) functions Includes routines for efficient ‘parallel I/O’ (using underlying hardware)

Supports ‘different ways of communication’ ‘Point-to-point communication’ between two computing nodes (P P) Collective functions involve ‘N computing nodes in useful communiction’

Deployment on Supercomputers Installed on (almost) all parallel computers Different languages: C, Fortran, Python, R, etc. Careful: Different versions might be installed

Recall ‘computing nodes’ are independent computing processors (that may also have N cores each) and that are all part of one big parallel computer

8 / 40


HPC Machine

Message Passing: Exchanging Data with Send/Receive

P1 P2 P3 P4 P5 P6

P

M

P

M

P

M

P

M

Each processor has its own data in its memory that can not be seen/accessed by other processors

DATA: 17

DATA: 06 DATA: 19

DATA: 80

Point-To-PointCommunications

NEW: 17

NEW: 06

Compute Node

9 / 40


Collective Functions : Broadcast (one-to-many)

Lecture 5 will provide some more detailed examples of how collective communication is used

P

M

P

M

P

M

P

M

DATA: 17

DATA: 06 DATA: 19

DATA: 80

NEW: 17

NEW: 17NEW: 17

Broadcast distributes the same data to many or even all other processors

10 / 40


Collective Functions: Scatter (one-to-many) Scatter distributes

different data to many or even all other processors


P

M

P

M

P

M

P

M

DATA: 30DATA: 20DATA: 10

NEW: 10

DATA: 80

DATA: 19DATA: 06

NEW: 20NEW: 30

11 / 40


Collective Functions: Gather (many-to-one) Gather collects data

from many or even all other processorsto one specific processor


P

M

P

M

P

M

P

M

DATA: 17 DATA: 80

DATA: 19DATA: 06

NEW: 80NEW: 19NEW: 06

12 / 40


Collective Functions: Reduce (many-to-one) Reduce combines

collection with computation based on data from many or even all other processors


P

M

P

M

P

M

P

M

DATA: 17 DATA: 80

DATA: 19DATA: 06

NEW: 122

Usage of reduce includes finding a global minimum or maximum, sum, or product of the different data locatedat different processors

++

+

+

global sumas example+

13 / 40


MPI Communicators & MPI Rank

Each MPI activity specifies thecontext in which a corresponding function is performed MPI_COMM_WORLD

(region/context of all processes) Create (sub-)groups of the

processes / virtual groups of processes

Peform communicationsonly within these sub-groupseasily with well-defined processes

Using communicators wisely in collective functionscan reduce the number of affected processors

MPI rank is a unique number for each processor

[4] LLNL MPI

Tutorial

Lecture 4 will provide pieces of information about the often used MPI cartesian communicator

(numbers reflect unique identityof processor named ‘MPI rank)

14 / 40


Is MPI yet another network library?

TCP/IP and socket programming libraries are plentiful available Do we need a dedicated communication & network protocols library? Goal: simplify programming in parallel programming, focus on applications

Selected reasons Designed for performance within large parallel computers (e.g. no security) Supports various interconnects between ‘computing nodes’ (hardware) Offers various benefits like ‘reliable messages’ or ‘in-order arrivals’

MPI is not designed to handle any communication in computer networks and is thus very special Not good for clients that constantly establishing/closing connections again and again (e.g. would

have very slow performance in MPI) Not good for internet chat clients or Web service servers in the Internet (e.g. no security beyond

firewalls, no message encryption directly available, etc.)

15 / 40

Pro: Network communication is relativel hidden and supported Contra: Programming with MPI still requires using ‘parallelization methods’ Not easy: Write ‘technical code’ well integrated in ‘problem-domain code’

Example: Race Car Simulation(cf. Lecture 2) Apply a good parallelization method

(e.g. domain decomposition) Write manually good MPI code for

(technical) communication between processors(e.g. across 1024 cores)

Integrate well technical codewith problem-domain code(e.g. computational fluid dynamics & airflow)


Parallel Applications & Simulations use MPI

Modified from [2] Caterham F1 team

16 / 40


MPI as Open Standard

Many vendors provide supercomputers/clusters in the past/today Libraries in addition to OS are required to support message passing Proprietary and vendor-specific libraries existed up until early ~1990s.

The MPI ‘Joint Standardization’ Forum Members from many organizations that define/ maintain the MPI standard MPI1.0 1994; MPI2.0 1997; MPI2.2 2009; MPI3.0 getting used

[3] MPI Standard

17 / 40


MPI Standard enables Portability

Key reasons for requiring a standard programming library Technical advancement in supercomputers is extremely fast Parallel computing experts switch organizations and face another system

Applications using proprietary libraries where not portable Create whole applications from scratch or time-consuming code updates

MPI changed this & is dominant parallel programming model

HPC Machine A

MPI Library

MPI is an open standard thatsignificantly supports the portabilityof parallel applications

HPC Machine B

MPI LibraryPorting a parallelMPI application

18 / 40


[Video] Introducing MPI – Summary

[5] Introducing MPI, YouTube Video

19 / 40

MPI Parallel Programming Basics



Starting Parallel Programming

Check access to the cluster machine Check MPI standard implementation and its version Often SSH is used to remotely access clusters

OpenMPI ‘Open Source High Performance Computing’ E.g. Openmpi-x86_64; openmpi/1.3.6;

Other Implementations exists E.g. MPICH implementation ‘High-Performance Portable MPI‘ (we don‘t use this in this course)

[6] OpenMPI

[7] MPICH

Practical Lecture 3.1 will provide more insights on how to use MPI within a cluster environment

21 / 40


HPC Machine Environment – OS & Hostname

Most parallel computers/supercomputers have UNIX OSs Exceptions exist: HPC Windows server machines

Often UNIX commands are necesseary to work productively Tools exist to abstract from underlying UNIX and OS technical aspects Examples : parallel tool platform (PTP), middleware UNICORE, etc.

Example: ‘hostname -A’ command on JOTUNN cluster[8] Parallel Tools Platform [9] UNICORE

Practical Lecture 3.1 consists of details on MPI with batch systems and concept of login nodes

22 / 40


HPC Machine Environment – Compiler & Modules

Knowledge of installed compilers essential (e.g. C, Fortran90, etc.) Different versions and types of compilers exist (Intel, GNU, MPI, etc.) E.g. mpicc pingpong.c –o pingpong

Module environment tool Avoids to manually setup environment information for every application Simplifies shell initialization and lets users easily modify their environment Modules can be loaded and unloaded

Module avail Lists all available modules on the HPC system (e.g. compilers, MPI, etc.)

Module load Loads particular modules into the current work environment E.g. module load gnu openmpi

Practical Lecture 3.1 consists of details on MPI with using compilers & available system modules

23 / 40


Start ‘Thinking’ Parallel

Parallel MPI programs know about the existence of other processes of it and what their own role is in the bigger picture

MPI programs are written in a sequentialprogramming language, but executed in parallel Same MPI program runs on all processes (SPMD)

Data exchange is key for design of applications Sending/receiving data at specific times in the program No shared memory for sharing variables with other remote processes Messages can be simple variables (e.g. a word) or complex structures

Start with the basic building blocks using MPI Building up the ‘parallel computing environment’

Recall SPMD stands for Single ProgramMultiple Data

P P

P P …24 / 40


(MPI) Basic Building Blocks: A main() function

The main() function is automatically started when launching a C program

Normally the ‘return code’ denotes whether the program exitwas ok (0) or problematic (-1)

Practice view: use of resiliency is not part of MPI (e.g. automatic restart and error handlings), therefore this is rarely used in practice

‘standard C programming…’

25 / 40


(MPI) Basic Building Blocks: Variables & Output

Libraries can be used by including C header files, here library forscreen outputs for example

Two integer variables that are later useful for working withspecific data obtained fromMPI library

Output with printf using stdio library: ‘Hello World’ and which process is printing out of the summary of all n processes

‘standard C programming…’

26 / 40


MPI Basic Building Blocks: Header & Init/Finalize

‘standard C programming including MPI library use…’ Libraries can be used by including C header files, here library forMPI included

The MPI_INIT() function initializes the MPI environment and can take inputs via themain() functionarguments

MPI_Finalize() shuts down the MPI environment(after this statement no parallel execution of the code can take place)

27 / 40


MPI Basic Building Blocks: Rank & Size Variables

‘standard C programming including MPI library use…’

MPI_COMM_WORLD communicator constantdenotes the ‘region of communication’, here all processes

The MPI_Comm_size()function determines the overall number of n processes in the parallel program: stores it in variable size

The MPI_Comm_rank()function determines the unique identifier for each processor:stores it in variablerank with valures (0 … n-1)

28 / 40


Compiling & Executing an MPI program

Compilers and linkers need various information where include files and libraries can be found E.g. C header files like ‘mpi.h’, or Fortran modules via ‘use MPI’ Compiling is different for each programming language

Executing the MPI program on 4 processors Normally batch system allocations

(cf. SLURM on JOTUNN cluster) Manual start-up example:

Output of the program Order of outputs

can vary because I/Oscreen ‘serial resource’

P

M

P

M

P

M

P

M

hello hello

hello hello

$> mpirun –np 4 ./hello

create 4 processes that produceoutput in parallel

29 / 40


Practice: Our 4 CPU Program alongside many other Programs

[10] LLView Tool

Maybe our program!

30 / 40


(Blocking) Point-to-Point Communication

MPI messages are defined as an array of elements of a particular MPI data type Basic data types (MPI_INT, MPI_LONG,… ) Derived types (can be specifically defined)

The data types on sender and receiver sides must match Otherwise the message passing step/transfer will not succeed

Point-to-point communication takes place among exactly one sender and exactly one receiver.

Both ends are identified uniquely by their ranks.

P

M

P

M

rank 0 rank 1

[3] MPI Standard

31 / 40


Detailed View: The Role of the System Buffer

[4] LLNL MPI Tutorial

32 / 40


Sending an MPI Message: MPI_Send

P

M

P

M

rank 0 rank 1

…MPI_Send()… MPI_Send() performs a blocking send

block until message is received by the destination process.

buf

initial address of send buffer (choice)

count

number of elements in send buffer (nonnegative integer)

datatype

datatype of each send buffer element (handle)

dest

rank of destination (integer)

tag

message tag (integer)

comm

communicator (handle)

[3] MPI Standard

33 / 40


Receiving an MPI Message: MPI_Recv

P

M

P

M

rank 0 rank 1

…MPI_Send()…

MPI_Recv() performs a blocking receive for a message (until arrival)

buf

initial address of receive buffer (choice)

count

maximum number of elements in receive buffer (integer)

datatype

datatype of each receive buffer element (handle)

source

rank of source (integer)

tag

message tag (integer)

comm

communicator (handle)

status

status object (Status) …MPI_Recv()…

[3] MPI Standard

time 34 / 40


Summary of the Parallel Environment & Message Passing

Modified from [4] LLNL MPI Tutorial

PM

PM

PM

PM

PM

……

35 / 40


MPI PingPong Program with Message Passsing#include "mpi.h"

#include <stdio.h>

int main(argc,argv)

int argc; char *argv[]; {

int numtasks, rank, dest, source, rc, count, tag=1; char inmsg, outmsg='x';

MPI_Status Stat;

MPI_Init(&argc,&argv);

MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == 0) { dest = 1; source = 1;

rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);

rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat); }

else if (rank == 1) { dest = 0; source = 0;

rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);

rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); }

rc = MPI_Get_count(&Stat, MPI_CHAR, &count);

printf("Task %d: Received %d char(s) from task %d with tag %d \n", rank, count, Stat.MPI_SOURCE, Stat.MPI_TAG);

MPI_Finalize();

}

Lecture 5 consists of further advanced examples of using several MPI functions in applications

Simple PingPong Parallel Program MPI Rank 0 pings Rank 1 and awaits

return ping (check Send/Recv order of calling the functions here…)

Function MPI_Get_count() counts the number of received elements

36 / 40


[Video] Open MPI

[11] YouTube Video, What is Open MPI

37 / 40

Lecture Bibliography


Lecture Bibliography

[1] Introduction to High Performance Computing for Scientists and Engineers, Georg Hager & Gerhard Wellein, Chapman & Hall/CRC Computational Science, ISBN 143981192X

[2] Caterham F1 Team Races Past Competition with HPC, Online: http://insidehpc.com/2013/08/15/caterham-f1-team-races-past-competition-with-hpc

[3] The MPI Standard, Online: http://www.mpi-forum.org/docs/

[4] LLNL MPI Tutorial, Online: https://computing.llnl.gov/tutorials/mpi/

[5] HPC – Introducting MPI, YouTube VideoOnline: http://www.youtube.com/watch?v=kHV6wmG35po

[6] OpenMPI, ‘Open Source High Performance Computing’, Online: http://www.open-mpi.org/

[7] MPICH, ‘High-Performance Portable MPI’, Online: http://www.mpich.org/

[8] Parallel Tools Platform, Online: http://www.eclipse.org/ptp/

[9] UNICORE Middleware, Online: https://www.unicore.eu/

[10] LLView Tool, Online: http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/LLview/_node.html

[11] YouTube Video, What is OpenMPI,Online: http://www.youtube.com/watch?v=D0-xSWBGNAw



High Performance Computing - morrisriedel.demorrisriedel.de/wp-content/uploads/2018/03/HPC...1. High Performance Computing 2. Parallelization Fundamentals 3. Parallel Programming with

Documents