Top Banner
{Open} MPI, Parallel Computing, Life, the Universe, and Everything Dr. Jeffrey M. Squyres November 7, 2013
63

(Open) MPI, Parallel Computing, Life, the Universe, and Everything

May 12, 2015

Download

Technology

Jeff Squyres

This talk is a general discussion of the current state of Open MPI, and a deep dive on two new features:

1. The flexible process affinity system (I presented many of these slides at the Madrid EuroMPI'13 conference in September 2013).
2. The MPI-3 "MPI_T" tools interface.

I originally gave this talk at Lawrence Berkeley Labs on Thursday, November 7, 2013.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

{Open} MPI, Parallel Computing, Life, the Universe, and Everything

Dr. Jeffrey M. Squyres

November 7, 2013

Page 2: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Open MPI

PACX-MPI LAM/MPI

LA-MPI FT-MPI

Sun CT 6

Project founded in 2003 after intense

discussions between multiple open source MPI implementations

Page 3: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Open_MPI_Init()

shell$ svn log https://svn.open-mpi.org/svn/ompi -r 1 ------------------------------------------------------------------------ r1 | jsquyres | 2003-11-22 11:36:58 -0500 (Sat, 22 Nov 2003) | 2 lines First commit ------------------------------------------------------------------------ shell$

Page 4: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Open_MPI_Current_status()

shell$ svn log https://svn.open-mpi.org/svn/ompi -r HEAD ------------------------------------------------------------------------ r29619 | brbarret | 2013-11-06 09:14:24 -0800 (Wed, 06 Nov 2013) | 2 lines update ignore file ------------------------------------------------------------------------ shell$

Page 5: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Open MPI 2014 membership

13 members, 15 contributors, 2 partners

Page 6: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Fun stats

•  ohloh.net says: §  819,741 lines of code §  Average 10-20

committers at a time §  “Well-commented

source code”

•  I rank in top-25 ohloh stats for: §  C §  Automake §  Shell script §  Fortran (…ouch)

Page 7: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Current status

•  Version 1.6.5 / stable series § Unlikely to see another release

•  Version 1.7.3 / feature series §  v1.7.4 due (hopefully) by end of 2013 §  Plan to transition to v1.8 in Q1 2014

Page 8: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

MPI conformance

•  MPI-2.2 conformant as of v1.7.3 §  Finally finished several 2.2 issues that no one

really cares about •  MPI-3 conformance just missing new RMA

§  Tracked on wiki: https://svn.open-mpi.org/trac/ompi/wiki/MPIConformance

§ Hope to be done by v1.7.4

Page 9: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

New MPI-3 features

•  Mo’ betta Fortran bindings §  You should “use mpi_f08”. Really.

•  Matched probe •  Sparse and neighborhood collectives •  “MPI_T” tools interface •  Nonblocking communicator duplication •  Noncollective communicator creation •  Hindexed block datatype

Page 10: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

New Open MPI features

•  Better support for more runtime systems §  PMI2 scalability, etc.

•  New generalized processor affinity system •  Better CUDA support •  Java MPI bindings (!) •  Transports:

§ Cisco usNIC support § Mellanox MXM2 and hcoll support §  Portals 4 support

Page 11: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

My new favorite random feature

•  mpirun CLI option <tab> completion §  Bash and zsh § Contributed by Nathan Hjelm, LANL

shell$ mpirun --mca btl_usnic_<tab> btl_usnic_cq_num -- Number of completion queue!btl_usnic_eager_limit -- Eager send limit (0 = use !btl_usnic_if_exclude -- Comma-delimited list of de!btl_usnic_if_include -- Comma-delimited list of de!btl_usnic_max_btls -- Maximum number of usNICs t!btl_usnic_mpool -- Name of the memory pool to!btl_usnic_prio_rd_num -- Number of pre-posted prior!btl_usnic_prio_sd_num -- Maximum priority send desc!btl_usnic_priority_limit -- Max size of "priority" mes!btl_usnic_rd_num -- Number of pre-posted recei!btl_usnic_retrans_timeout -- Number of microseconds bef!btl_usnic_rndv_eager_limit -- Eager rendezvous limit (0 !btl_usnic_sd_num -- Maximum send descriptors t!

Page 12: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Two features to discuss in detail…

1. “MPI_T” interface 2. Flexible process affinity system

Page 13: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

MPI_T interface

Page 14: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

MPI_T interface

•  Added in MPI-3.0 •  So-called “MPI_T” because all the

functions start with that prefix §  T = tools

•  APIs to get/set MPI implementation values § Control variables (e.g., implementation

tunables) §  Performance variables (e.g., run-time stats)

Page 15: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

MPI_T control variables (“cvar”)

•  Another interface to MCA param values •  In addition to existing methods:

§ mpirun CLI options §  Environment variables § Config file(s)

•  Allows tools / applications to programmatically list all OMPI MCA params

Page 16: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

MPI_T cvar example

•  MPI_T_cvar_get_num() § Returns the number of control variables

•  MPI_T_cvar_get_info(index, …) returns: §  String name and description §  Verbosity level (see next slide) §  Type of the variable (integer, double, etc.) §  Type of MPI object (communicator, etc.) §  “Writability” scope

Page 17: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Verbosity levels

Level name Level description

USER_BASIC Basic information of interest to users USER_DETAIL Detailed information of interest to users USER_ALL All remaining information of interest to users TUNER_BASIC Basic information of interest for tuning TUNER_DETAIL Detailed information of interest for tuning TUNER_ALL All remaining information of interest to tuning MPIDEV_BASIC Basic information for MPI implementers MPIDEV_DETAIL Detailed information for MPI implementers MPIDEV_ALL All remaining information for MPI implementers

Page 18: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Open MPI interpretation of verbosity levels

1.  User §  Parameters required

for correctness §  As few as possible

2.  Tuner §  Tweak MPI

performance §  Resource levels, etc.

3.  MPI developer §  For Open MPI devs

1.  Basic Even for less-advanced users and tuners

2.  Detailed Useful but you won’t need to change them often

3.  All Anything else

Page 19: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

“Writeability” scope

Level name Level description

CONSTANT Read-only, constant value READONLY Read-only, but the value may change LOCAL Writing is local operation GROUP Writing must be done as a group, and all values

must be consistent GROUP_EQ Writing must be done as a group, and all values

must be exactly the same ALL Writing must be done by all processes, and all

values must be consistent ALL_EQ Writing must be done by all processes, and all

values must be exactly the same

Page 20: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Reading / writing a cvar

•  MPI_T_cvar_handle_alloc(index, handle, …) §  Allocates an MPI_T handle §  Binds it to a specific MPI handle (e.g., a

communicator), or BIND_NO_OBJECT •  MPI_T_cvar_read(handle, buf) •  MPI_T_cvar_write(handle, buf)

à OMPI has very, very few writable control variables after MPI_INIT

Page 21: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

MPI_T Performance variables (“pvar”)

•  New information available from OMPI § Run-time statistics of implementation details §  Similar interface to control variables

•  Not many available in OMPI yet •  Cisco usnic BTL exports 24 pvars

§  Per usNIC interface §  Stats about underlying network

(more details to be provided in usNIC talk)

Page 22: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Process affinity system

Page 23: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Locality matters

•  Goals: § Minimize data transfer distance § Reduce network congestion and contention

• …this also matters inside the server, too!

Page 24: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Machine (128GB)

NUMANode P#0 (64GB)

Socket P#0

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

PU P#19

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

PU P#20

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#5

PU P#21

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#6

PU P#22

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#7

PU P#23

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

PCI 1137:0043

eth5

PCI 102b:0522

NUMANode P#1 (64GB)

Socket P#1

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#8

PU P#24

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#9

PU P#25

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#10

PU P#26

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#11

PU P#27

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#12

PU P#28

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#13

PU P#29

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#14

PU P#30

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#15

PU P#31

PCI 1000:005b

sda sdb

PCI 1137:0043

eth6

PCI 1137:0043

eth7

Indexes: physical

Date: Mon Jan 28 10:51:26 2013

Intel Xeon E5-2690 (“Sandy Bridge”) 2 sockets, 8 cores, 64GB per socket

Page 25: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Machine (128GB)

NUMANode P#0 (64GB)

Socket P#0

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

PU P#19

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

PU P#20

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#5

PU P#21

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#6

PU P#22

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#7

PU P#23

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

PCI 1137:0043

eth5

PCI 102b:0522

NUMANode P#1 (64GB)

Socket P#1

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#8

PU P#24

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#9

PU P#25

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#10

PU P#26

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#11

PU P#27

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#12

PU P#28

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#13

PU P#29

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#14

PU P#30

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#15

PU P#31

PCI 1000:005b

sda sdb

PCI 1137:0043

eth6

PCI 1137:0043

eth7

Indexes: physical

Date: Mon Jan 28 10:51:26 2013

Intel Xeon E5-2690 (“Sandy Bridge”) 2 sockets, 8 cores, 64GB per socket

1G NICs

10G NICs

10G NICs

L1 and L2

Shared L3

Hyperthreading enabled

Page 26: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

The intent of this work is to provide a mechanism that allows users to explore the process-placement space

within the scope of their own applications.

A user’s playground

Page 27: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Two complimentary systems

•  Simple § mpirun --bind-to [ core | socket | … ] … § mpirun --by[ node | slot | … ] … § …etc.

•  Flexible §  LAMA: Locality Aware Mapping Algorithm

Page 28: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

LAMA

•  Supports a wide range of regular mapping patterns § Drawn from much prior work § Most notably, heavily inspired by BlueGene/P

and /Q mapping systems

Page 29: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Launching MPI applications

•  Three steps in MPI process placement 1.  Mapping 2.  Ordering 3.  Binding

•  Let's discuss how these work in Open MPI

Page 30: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

1. Mapping

•  Create a layout of processes-to-resources

Server Server Server Server

Server Server Server Server

Server Server Server Server

Server Server Server Server

MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI

MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI

MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI

MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI

MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI

MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI

MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI

MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI

Page 31: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Mapping

•  MPI's runtime must create a map, pairing processes-to-processors (and memory).

•  Basic technique: § Gather hwloc topologies from allocated nodes. § Mapping agent then makes a plan for which

resources are assigned to processes

Page 32: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Mapping agent

•  Act of planning mappings: §  Specify which process will be launched on

each server §  Identify if any hardware resource will be

oversubscribed •  Processes are mapped to the resolution of

a single processing unit (PU) §  Smallest unit of allocation: hardware thread §  In HPC, usually the same as a processor core

Page 33: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Oversubscription

•  Common / usual definition: § When a single PU is assigned more than one

process •  Complicating the definition:

§  Some application may need more than one PU per process (multithreaded applications)

•  How can the user express what their application means by “oversubscription”?

Page 34: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

2. Ordering: by “slot”

Assigning MCW ranks to mapped processes

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

16 17 18 19

20 21 22 23

24 25 26 27

28 29 30 31

32 33 18 19

36 37 22 23

40 41 26 27

44 45 30 31

48 49 50 51

4 5 6 7

8 9 10 11

12 13 14 15

64 65 66 67

20 21 22 23

24 25 26 27

28 29 30 31

80 81 18 19

36 37 22 23

40 41 26 27

44 45 30 31

Page 35: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

2. Ordering: by node

Assigning MCW ranks to mapped processes

0 16 32 48

64 80 96 112

128 144 160 176

192 208 224 240

1 17 33 49

65 81 97 113

129 145 161 177

193 209 225 241

2 18 18 19

66 82 22 23

130 146 26 27

194 210 30 31

4 20 36 52

4 5 6 7

8 9 10 11

12 13 14 15

5 23 37 53

20 21 22 23

24 25 26 27

28 29 30 31

6 81 18 19

36 37 22 23

40 41 26 27

44 45 30 31

Page 36: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Ordering

•  Each process must be assigned a unique rank in MPI_COMM_WORLD

•  Two common types of ordering: §  natural

•  The order in which processes are mapped determines their rank in MCW

§  sequential •  The processes are sequentially numbered starting

at the first processing unit, and continuing until the last processing unit

Page 37: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

3. Binding

•  Launch processes and enforce the layout Machine (128GB)

NUMANode P#0 (64GB)

Socket P#0

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

PU P#19

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

PU P#20

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#5

PU P#21

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#6

PU P#22

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#7

PU P#23

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

PCI 1137:0043

eth5

PCI 102b:0522

NUMANode P#1 (64GB)

Socket P#1

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#8

PU P#24

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#9

PU P#25

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#10

PU P#26

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#11

PU P#27

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#12

PU P#28

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#13

PU P#29

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#14

PU P#30

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#15

PU P#31

PCI 1000:005b

sda sdb

PCI 1137:0043

eth6

PCI 1137:0043

eth7

Indexes: physical

Date: Mon Jan 28 10:51:26 2013

Machine (128GB)

NUMANode P#0 (64GB)

Socket P#0

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

PU P#19

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

PU P#20

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#5

PU P#21

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#6

PU P#22

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#7

PU P#23

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

PCI 1137:0043

eth5

PCI 102b:0522

NUMANode P#1 (64GB)

Socket P#1

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#8

PU P#24

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#9

PU P#25

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#10

PU P#26

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#11

PU P#27

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#12

PU P#28

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#13

PU P#29

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#14

PU P#30

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#15

PU P#31

PCI 1000:005b

sda sdb

PCI 1137:0043

eth6

PCI 1137:0043

eth7

Indexes: physical

Date: Mon Jan 28 10:51:26 2013

0 1 2 3 4 5 6 7

8 9 10 11 12 13 14 15

Machine (128GB)

NUMANode P#0 (64GB)

Socket P#0

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

PU P#19

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

PU P#20

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#5

PU P#21

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#6

PU P#22

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#7

PU P#23

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

PCI 1137:0043

eth5

PCI 102b:0522

NUMANode P#1 (64GB)

Socket P#1

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#8

PU P#24

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#9

PU P#25

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#10

PU P#26

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#11

PU P#27

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#12

PU P#28

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#13

PU P#29

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#14

PU P#30

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#15

PU P#31

PCI 1000:005b

sda sdb

PCI 1137:0043

eth6

PCI 1137:0043

eth7

Indexes: physical

Date: Mon Jan 28 10:51:26 2013

Machine (128GB)

NUMANode P#0 (64GB)

Socket P#0

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

PU P#19

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

PU P#20

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#5

PU P#21

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#6

PU P#22

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#7

PU P#23

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

PCI 1137:0043

eth5

PCI 102b:0522

NUMANode P#1 (64GB)

Socket P#1

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#8

PU P#24

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#9

PU P#25

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#10

PU P#26

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#11

PU P#27

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#12

PU P#28

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#13

PU P#29

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#14

PU P#30

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#15

PU P#31

PCI 1000:005b

sda sdb

PCI 1137:0043

eth6

PCI 1137:0043

eth7

Indexes: physical

Date: Mon Jan 28 10:51:26 2013

16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31

Machine (128GB)

NUMANode P#0 (64GB)

Socket P#0

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

PU P#19

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

PU P#20

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#5

PU P#21

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#6

PU P#22

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#7

PU P#23

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

PCI 1137:0043

eth5

PCI 102b:0522

NUMANode P#1 (64GB)

Socket P#1

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#8

PU P#24

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#9

PU P#25

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#10

PU P#26

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#11

PU P#27

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#12

PU P#28

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#13

PU P#29

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#14

PU P#30

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#15

PU P#31

PCI 1000:005b

sda sdb

PCI 1137:0043

eth6

PCI 1137:0043

eth7

Indexes: physical

Date: Mon Jan 28 10:51:26 2013

Machine (128GB)

NUMANode P#0 (64GB)

Socket P#0

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

PU P#19

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

PU P#20

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#5

PU P#21

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#6

PU P#22

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#7

PU P#23

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

PCI 1137:0043

eth5

PCI 102b:0522

NUMANode P#1 (64GB)

Socket P#1

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#8

PU P#24

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#9

PU P#25

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#10

PU P#26

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#11

PU P#27

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#12

PU P#28

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#13

PU P#29

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#14

PU P#30

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#15

PU P#31

PCI 1000:005b

sda sdb

PCI 1137:0043

eth6

PCI 1137:0043

eth7

Indexes: physical

Date: Mon Jan 28 10:51:26 2013

32 33 34 3 4 5 6 7

40 41 42 11 12 13 14 15

Machine (128GB)

NUMANode P#0 (64GB)

Socket P#0

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

PU P#19

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

PU P#20

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#5

PU P#21

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#6

PU P#22

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#7

PU P#23

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

PCI 1137:0043

eth5

PCI 102b:0522

NUMANode P#1 (64GB)

Socket P#1

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#8

PU P#24

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#9

PU P#25

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#10

PU P#26

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#11

PU P#27

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#12

PU P#28

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#13

PU P#29

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#14

PU P#30

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#15

PU P#31

PCI 1000:005b

sda sdb

PCI 1137:0043

eth6

PCI 1137:0043

eth7

Indexes: physical

Date: Mon Jan 28 10:51:26 2013

0 1 2 3 4 5 6 7

8 9 10 11 12 13 14 15

Machine (128GB)

NUMANode P#0 (64GB)

Socket P#0

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

PU P#19

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

PU P#20

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#5

PU P#21

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#6

PU P#22

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#7

PU P#23

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

PCI 1137:0043

eth5

PCI 102b:0522

NUMANode P#1 (64GB)

Socket P#1

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#8

PU P#24

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#9

PU P#25

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#10

PU P#26

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#11

PU P#27

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#12

PU P#28

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#13

PU P#29

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#14

PU P#30

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#15

PU P#31

PCI 1000:005b

sda sdb

PCI 1137:0043

eth6

PCI 1137:0043

eth7

Indexes: physical

Date: Mon Jan 28 10:51:26 2013

0 1 2 3 4 5 6 7

8 9 10 11 12 13 14 15

Machine (128GB)

NUMANode P#0 (64GB)

Socket P#0

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

PU P#19

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

PU P#20

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#5

PU P#21

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#6

PU P#22

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#7

PU P#23

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

PCI 1137:0043

eth5

PCI 102b:0522

NUMANode P#1 (64GB)

Socket P#1

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#8

PU P#24

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#9

PU P#25

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#10

PU P#26

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#11

PU P#27

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#12

PU P#28

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#13

PU P#29

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#14

PU P#30

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#15

PU P#31

PCI 1000:005b

sda sdb

PCI 1137:0043

eth6

PCI 1137:0043

eth7

Indexes: physical

Date: Mon Jan 28 10:51:26 2013

0 1 2 3 4 5 6 7

8 9 10 11 12 13 14 15

Page 38: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Binding

•  Process-launching agent working with the OS to limit where each process can run: 1.  No restrictions 2.  Limited set of restrictions 3.  Specific resource restrictions

•  “Binding width” §  The number of PUs to which a process is

bound

Page 39: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Command Line Interface (CLI)

•  4 levels of abstraction for the user §  Level 1: None §  Level 2: Simple, common patterns §  Level 3: LAMA process layout regular patterns §  Level 4: Irregular patterns (not described in

this talk)

Page 40: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

CLI: Level 1 (none)

•  No mapping or binding options specified § May or may not specify the number of

processes to launch (-np) §  If not specified, default to the number of cores

available in the allocation § One process is mapped to each core in the

system in a "by-core" style §  Processes are not bound

• …for backwards compatibility reasons L

Page 41: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

CLI: Level 2 (common)

•  Simple, common patterns for mapping and binding §  Specify mapping pattern with

•  --map-by X (e.g., --map-by socket) §  Specify binding option with:

•  --bind-to Y (e.g., --bind-to core) §  All of these options are translated to Level 3

options for processing by LAMA (full list of X / Y values shown later)

Page 42: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

CLI: Level 3 (regular patterns)

•  LAMA process layout regular patterns §  Power users wanting something unique for

their application §  Four MCA run-time parameters

•  rmaps_lama_map: Mapping process layout •  rmaps_lama_bind: Binding width •  rmaps_lama_order: Ordering of MCW ranks •  rmaps_lama_mppr: Maximum allowable number of

processes per resource (oversubscription)

Page 43: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

rmaps_lama_map (map)

•  Takes as an argument the "process layout" §  A series of nine tokens

•  allowing 9! (362,880) mapping permutation options.

§  Preferred iteration order for LAMA •  innermost iteration specified first •  outermost iteration specified last

Page 44: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Example system

2 servers (nodes), 4 sockets, 2 cores, 2 PUs Node 0

Node 1

Core 1H0H1

Socket 0Core 0H0H1

016

420

Core 1H0H1

Socket 1Core 0H0H1

117

521

Core 1H0H1

Socket 2Core 0H0H1

218

622

Core 1H0H1

Socket 3Core 0H0H1

319

723

Core 1H0H1

Socket 0Core 0H0H1

8 12Core 1H0H1

Socket 1Core 0H0H1

9 13

Core 1H0H1

Socket 2Core 0H0H1

10 14Core 1H0H1

Socket 3Core 0H0H1

11 15

Page 45: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

rmaps_lama_map (map)

•  map=scbnh (a.k.a., by socket, then by core) Node 0

Node 1

Core 1H0H1

Socket 0Core 0H0H1

016

420

Core 1H0H1

Socket 1Core 0H0H1

117

521

Core 1H0H1

Socket 2Core 0H0H1

218

622

Core 1H0H1

Socket 3Core 0H0H1

319

723

Core 1H0H1

Socket 0Core 0H0H1

8 12Core 1H0H1

Socket 1Core 0H0H1

9 13

Core 1H0H1

Socket 2Core 0H0H1

10 14Core 1H0H1

Socket 3Core 0H0H1

11 15

Step 1: Traverse sockets

Page 46: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

rmaps_lama_map (map)

•  map=scbnh (a.k.a., by socket, then by core) Node 0

Node 1

Core 1H0H1

Socket 0Core 0H0H1

016

420

Core 1H0H1

Socket 1Core 0H0H1

117

521

Core 1H0H1

Socket 2Core 0H0H1

218

622

Core 1H0H1

Socket 3Core 0H0H1

319

723

Core 1H0H1

Socket 0Core 0H0H1

8 12Core 1H0H1

Socket 1Core 0H0H1

9 13

Core 1H0H1

Socket 2Core 0H0H1

10 14Core 1H0H1

Socket 3Core 0H0H1

11 15

Step 2: Ran out of sockets, so now traverse cores

Page 47: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

rmaps_lama_map (map)

•  map=scbnh (a.k.a., by socket, then by core) Node 0

Node 1

Core 1H0H1

Socket 0Core 0H0H1

016

420

Core 1H0H1

Socket 1Core 0H0H1

117

521

Core 1H0H1

Socket 2Core 0H0H1

218

622

Core 1H0H1

Socket 3Core 0H0H1

319

723

Core 1H0H1

Socket 0Core 0H0H1

8 12Core 1H0H1

Socket 1Core 0H0H1

9 13

Core 1H0H1

Socket 2Core 0H0H1

10 14Core 1H0H1

Socket 3Core 0H0H1

11 15

Step 3: Now traverse boards (but there aren’t any)

Page 48: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

rmaps_lama_map (map)

•  map=scbnh (a.k.a., by socket, then by core) Node 0

Node 1

Core 1H0H1

Socket 0Core 0H0H1

016

420

Core 1H0H1

Socket 1Core 0H0H1

117

521

Core 1H0H1

Socket 2Core 0H0H1

218

622

Core 1H0H1

Socket 3Core 0H0H1

319

723

Core 1H0H1

Socket 0Core 0H0H1

8 12Core 1H0H1

Socket 1Core 0H0H1

9 13

Core 1H0H1

Socket 2Core 0H0H1

10 14Core 1H0H1

Socket 3Core 0H0H1

11 15

Step 4: Now traverse server nodes

Page 49: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

rmaps_lama_map (map)

•  map=scbnh (a.k.a., by socket, then by core) Node 0

Node 1

Core 1H0H1

Socket 0Core 0H0H1

016

420

Core 1H0H1

Socket 1Core 0H0H1

117

521

Core 1H0H1

Socket 2Core 0H0H1

218

622

Core 1H0H1

Socket 3Core 0H0H1

319

723

Core 1H0H1

Socket 0Core 0H0H1

8 12Core 1H0H1

Socket 1Core 0H0H1

9 13

Core 1H0H1

Socket 2Core 0H0H1

10 14Core 1H0H1

Socket 3Core 0H0H1

11 15

Step 5: After repeating s, c, and b on server node 2, traverse hardware threads

Page 50: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

rmaps_lama_bind (bind)

•  “Binding width" and layer •  Example: bind=3c (3 cores) Machine (128GB)

NUMANode P#0 (64GB)

Socket P#0

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

PU P#19

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

PU P#20

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#5

PU P#21

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#6

PU P#22

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#7

PU P#23

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

PCI 1137:0043

eth5

PCI 102b:0522

NUMANode P#1 (64GB)

Socket P#1

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#8

PU P#24

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#9

PU P#25

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#10

PU P#26

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#11

PU P#27

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#12

PU P#28

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#13

PU P#29

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#14

PU P#30

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#15

PU P#31

PCI 1000:005b

sda sdb

PCI 1137:0043

eth6

PCI 1137:0043

eth7

Indexes: physical

Date: Mon Jan 28 10:51:26 2013

bind = 3c

Page 51: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Machine (128GB)

NUMANode P#0 (64GB)

Socket P#0

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

PU P#19

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

PU P#20

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#5

PU P#21

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#6

PU P#22

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#7

PU P#23

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

PCI 1137:0043

eth5

PCI 102b:0522

NUMANode P#1 (64GB)

Socket P#1

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#8

PU P#24

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#9

PU P#25

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#10

PU P#26

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#11

PU P#27

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#12

PU P#28

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#13

PU P#29

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#14

PU P#30

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#15

PU P#31

PCI 1000:005b

sda sdb

PCI 1137:0043

eth6

PCI 1137:0043

eth7

Indexes: physical

Date: Mon Jan 28 10:51:26 2013

rmaps_lama_bind (bind)

•  “Binding width" and layer •  Example: bind=2s (2 sockets) Machine (128GB)

NUMANode P#0 (64GB)

Socket P#0

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

PU P#19

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

PU P#20

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#5

PU P#21

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#6

PU P#22

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#7

PU P#23

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

PCI 1137:0043

eth5

PCI 102b:0522

NUMANode P#1 (64GB)

Socket P#1

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#8

PU P#24

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#9

PU P#25

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#10

PU P#26

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#11

PU P#27

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#12

PU P#28

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#13

PU P#29

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#14

PU P#30

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#15

PU P#31

PCI 1000:005b

sda sdb

PCI 1137:0043

eth6

PCI 1137:0043

eth7

Indexes: physical

Date: Mon Jan 28 10:51:26 2013

bind = 2s

bind = 2s

Page 52: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

rmaps_lama_bind (bind)

•  “Binding width" and layer •  Example: bind=12 (all PUs in an L2)

bind = 12

Page 53: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

rmaps_lama_bind (bind)

•  “Binding width" and layer •  Example: bind=1N (all PUs in NUMA locality)

bind = 1N

Page 54: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

rmaps_lama_order (order)

•  Select which ranks are assigned to processes in MCW

•  There are other possible orderings, but no one has asked for them yet…

Natural order for map-by-node (default)

Sequential order for any mapping

Page 55: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

rmaps_lama_mppr (mppr)

•  mppr (mip-per) sets the Maximum number of allowable Processes Per Resource § User-specified definition of oversubscription

•  Comma-delimited list of <#:resource>!§  1:c à At most one process per core §  1:c,2:s à At most one process per core, and

at most two processes per socket

Page 56: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

MPPR

§  1:c à At most one process per core Machine (128GB)

NUMANode P#0 (64GB)

Socket P#0

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

PU P#19

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

PU P#20

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#5

PU P#21

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#6

PU P#22

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#7

PU P#23

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

PCI 1137:0043

eth5

PCI 102b:0522

NUMANode P#1 (64GB)

Socket P#1

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#8

PU P#24

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#9

PU P#25

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#10

PU P#26

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#11

PU P#27

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#12

PU P#28

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#13

PU P#29

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#14

PU P#30

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#15

PU P#31

PCI 1000:005b

sda sdb

PCI 1137:0043

eth6

PCI 1137:0043

eth7

Indexes: physical

Date: Mon Jan 28 10:51:26 2013

Page 57: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

MPPR

§  1:c,2:s à At most one process per core and two processes per socket

Machine (128GB)

NUMANode P#0 (64GB)

Socket P#0

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

PU P#19

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

PU P#20

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#5

PU P#21

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#6

PU P#22

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#7

PU P#23

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

PCI 1137:0043

eth5

PCI 102b:0522

NUMANode P#1 (64GB)

Socket P#1

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#8

PU P#24

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#9

PU P#25

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#10

PU P#26

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#11

PU P#27

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#12

PU P#28

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#13

PU P#29

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#14

PU P#30

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#15

PU P#31

PCI 1000:005b

sda sdb

PCI 1137:0043

eth6

PCI 1137:0043

eth7

Indexes: physical

Date: Mon Jan 28 10:51:26 2013

Page 58: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Level 2 to Level 3 chart

Page 59: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Remember the prior example?

•  -np 24 -mppr 2:c -map scbnh Node 0

Node 1

Core 1H0H1

Socket 0Core 0H0H1

016

420

Core 1H0H1

Socket 1Core 0H0H1

117

521

Core 1H0H1

Socket 2Core 0H0H1

218

622

Core 1H0H1

Socket 3Core 0H0H1

319

723

Core 1H0H1

Socket 0Core 0H0H1

8 12Core 1H0H1

Socket 1Core 0H0H1

9 13

Core 1H0H1

Socket 2Core 0H0H1

10 14Core 1H0H1

Socket 3Core 0H0H1

11 15

Page 60: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Same example, different mapping

•  -np 24 -mppr 2:c -map nbsch Node 0

Node 1

Core 1H0H1

Socket 0Core 0H0H1

016

8Core 1H0H1

Socket 1Core 0H0H1

218

10

Core 1H0H1

Socket 2Core 0H0H1

420

12Core 1H0H1

Socket 3Core 0H0H1

622

14

Core 1H0H1

Socket 0Core 0H0H1

1 9Core 1H0H1

Socket 1Core 0H0H1

3 11

Core 1H0H1

Socket 2Core 0H0H1

5 13Core 1H0H1

Socket 3Core 0H0H1

7 15

17 19

21 23

Page 61: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

•  Displays prettyprint representation of the binding actually used for each process. §  Visual feedback = quite helpful when exploring

mpirun -np 4 --mca rmaps lama --mca rmaps_lama_bind 1c --mca rmaps_lama_map nbsch --mca rmaps_lama_mppr 1:c --report-bindings hello_world!

MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..]!MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..]!MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]!MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..]!

Report bindings

Page 62: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Feedback

•  Available in Open MPI v1.7.2 (and later)

•  Open questions to users: §  Are more flexible ordering options useful? § What common mapping patterns are useful? § What additional features would you like to

see?

Page 63: (Open) MPI, Parallel Computing, Life, the Universe, and Everything

Thank you