IBM HPC Development MPI update - spscicomp.org · AIX5.3, IBM HPS (4 links), IBM RSCT/LAPI 2.4.3, and IBM LoadLeveler 3.4 For each benchmark, run using the old algorithm in PE4.2.2

Post on 23-Aug-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

IBM Systems & Technology Group

© 2004 IBM Corporation

IBM HPC DevelopmentMPI update

Chulho KimScicomp 2007

IBM Systems & Technology Group

© 2004 IBM Corporation

Enhancements in PE 4.3.0 & 4.3.1

� MPI 1-sided improvements (October 2006)

� Selective collective enhancements (October 2006)

� AIX IB US enablement (July 2007)

IBM Systems & Technology Group

© 2004 IBM Corporation

MPI 1-sided improvements

MPI 1-sided testcases provided by William Gropp and Rajeev

Thakur from Argonne National Lab. These two test programs

perform nearest-neighbor ghost-area data exchange:

fence2d.c - four MPI_Puts with MPI_Win_fence call

lock2d.c - four MPI_Puts with MPI_Win_lock and

MPI_Win_unlock calls

A single communication step for this set includes a synchronization

call to start an epoch, a MPI_Put to put the data to each of its four

neighbors ( 4 MPI_Puts) and a synchronization call to end the epoch.

Configurations: 64 tasks – 16 SQ IH nodes

Improvements seen: > 10% and sometimes up to 90% reduction in

time taken to do the operation especially in small messages (< 64K).

IBM Systems & Technology Group

© 2004 IBM Corporation

MPI 1-sided improvements

4

16

64

256

1024

4096

16384

65536

message size

0

1

2

3

4

5

6

7

8

9

second

New Code Current Code

fence2d - 64 tasks on 16 sq nodes

PE 4.2PE 4.3

IBM Systems & Technology Group

© 2004 IBM Corporation

MPI 1-sided improvements

4

16

64

256

1024

4096

16384

65536

message size

0

0.5

1

1.5

2

2.5

3

second

New Code Current Code

lock2d - 64 tasks on 16 sq nodes

PE 4.3 PE 4.2

IBM Systems & Technology Group

© 2004 IBM Corporation

New algorithms in PE 4.3.0

� A set of high radix algorithms for:

� MPI_Barrier,

� small message (<= 2KB) MPI_Reduce,

� small message (<= 2KB) MPI_Allreduce, and

� small message (<= 2KB) MPI_Bcast.

� New parallel pipelining algorithm for large message (>= 256KB) MPI_Bcast.

� These new algorithms are implemented in IBM Parallel Environment4.3.

� Performance comparison are made between IBM PE4.3 and IBM PE4.2.2 (old algorithms)

IBM Systems & Technology Group

© 2004 IBM Corporation

n0

n4n2 n6

n1 n5n3 n7

Shared memory

t0

t1t7 t6 t5 t4 t3 t2

n0

Shared memory

t32

t33t39 t38 t37 t36 t35 t34

n4

• One task is selected as node leader per SMP.• Possible prolog step - node leader gathers inputs from tasks on the same SMP node and forms partial result through shared memory.• Inter-node communication steps among node leaders through interconnect. • Possible epilog step – node leader distributes results among the tasks on the same node through shared memory.• Benefit from shared memory is only for the intra-node part. Available communication resource may not be fully utilized.Example of MPI_Bcast with root=t0, using the common shared memory optimization• 8 SMP nodes each running 8 MPI tasks. • Colors of the nodes represent the inter-node step, after which the node has the broadcast message: Root, Step 0, Step 1, Step2. Arrows show message flow.• 3 inter-node steps are needed to complete the broadcast.

Common shared memory optimization for SMP cluster

IBM Systems & Technology Group

© 2004 IBM Corporation

n7Shared memory

n6Shared memory

n5Shared memory

n4Shared memory

n3Shared memory

Shared memory

t0 t1 t7t6t5t4t3t2

n0

n2Shared memoryShared memory

t8 t9 t15t14t13t12t11t10

n1

High Radix Algorithm

• k (k >= 1) tasks on each SMP participating in the inter-node communication.

• Possible shared memory step after each inter-node step to synchronize and reshuffle data.• Better short message performance as long as the extra shared memory overhead and switch/adapter contention are low. • 64bit only, requires shared memory and same number of tasks on each nodes. • In this MPI_Bcast example, k = 7 • only one inter-node step is needed.

IBM Systems & Technology Group

© 2004 IBM Corporation

Parallel pipeline algorithm: MPI_Bcast

� Split large data into small slices and pipeline the slices along multiple broadcast trees in parallel and in non-conflicting fashion.

� Optimal on any number of tasks (power of two or not) � Communication scheduling is through simple bit operations on rank

of task. Schedule does not need to be cached in the communicator. � Once the pipeline is established, each task is doing concurrent send

and receive of data slices � Paper accepted by EuroPVMMPI07

IBM Systems & Technology Group

© 2004 IBM Corporation

t0 t1 t3

t5

t7

t2

t4

t6

t2 t6

t3

t7

t4

t1

t5

t4 t5

t6

t7

t1

t2

t3

s0

s1

s2

� Broadcast among 8 tasks: t0, t1, … t7. t0 is the root of the broadcast.� Message is split into slices: s0, s1, s2,…� t0 round-robins the slices to the three pipelines

Parallel Pipelining Example:

IBM Systems & Technology Group

© 2004 IBM Corporation

Benchmark config

� Micro benchmarks

� on up to 32 P5 nodes and 8 tasks per node

� AIX5.3, IBM HPS (4 links), IBM RSCT/LAPI 2.4.3, and IBM LoadLeveler 3.4

� For each benchmark,

� run using the old algorithm in PE4.2.2 first,

� then run with the new algorithm in PE4.3.

IBM Systems & Technology Group

© 2004 IBM Corporation

Benchmark

� Loops of:

MPI_Barrier();

Start_time = MPI_Wtime();

MPI_Barrier(); /* or other measured collectives */

End_time = MPI_Wtime();

MPI_Barrier();

IBM Systems & Technology Group

© 2004 IBM Corporation

Timing

� Global switch clock is used. MP_CLOCK_SOURCE=Switch� Time of the measured collectives at any task for a particular loop is:

� for MPI_Barrier and MPI_Allreduce

� End_time of this task – Start_time of this task

� for MPI_Bcast

� End_time of this task – Start_time at the root task

� for MPI_Reduce

� End_time of the root task – Start_time of this task

� 1. For every loop, select the longest time reported by any task as the time of the loop 2. Select the time of the fastest loop� This filters out OS jitter impact and highlights the algorithm capability

IBM Systems & Technology Group

© 2004 IBM Corporation

IBM Systems & Technology Group

© 2004 IBM Corporation

IBM Systems & Technology Group

© 2004 IBM Corporation

IBM Systems & Technology Group

© 2004 IBM Corporation

IBM Systems & Technology Group

© 2004 IBM Corporation

IBM Systems & Technology Group

© 2004 IBM Corporation

top related