Top Banner
Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL
55

Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Parallel Application Scaling, Performance, and Efficiency

David Skinner

NERSC/LBL

Page 2: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Parallel Scaling of MPI Codes

A practical talk on using MPI with focus on:

• Distribution of work within a parallel program

• Placement of computation within a parallel computer

• Performance costs of different types of communication

• Understanding scaling performance terminology

Page 3: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Topics

• Application Scaling

• Load Balance

• Synchronization

• Simple stuff

• File I/O

Page 4: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Scale: Practical Importance

Time required to compute the NxN matrix product C=A*B

Assuming you can address 64GB from

one task, can you wait a month?

How to balancecomputational goal

vs.compute resources?

Choose the right scale!

Page 5: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Let’s jump to an example

• Sharks and Fish II : N2 parallel force evalulation

• e.g. 4 CPUs evaluate force for 125 fish

• Domain decomposition: Each CPU is “in charge” of ~31 fish, but keeps a fairly recent copy of all the fishes positions (replicated data)

• Is it not possible to uniformly decompose problems in general, especially in many dimensions

• This toy problem is simple, has fine granularity and is 2D

• Let’s see how it scales

31 31 31 32

Page 6: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Sharks and Fish II : Program

Data:

n_fish global

my_fish local

fishi = {x, y, …}

Dynamics:

F = ma

V = Σ 1/rij

dq/dt = m * p

dp/dt = -dV/dq

MPI_Allgatherv(myfish_buf, len[rank], MPI_FishType…)

for (i = 0; i < my_fish; ++i) { for (j = 0; j < n_fish; ++j) { // i!=j ai += g * massj * ( fishi – fishj ) / rij

}}

Move fish

Page 7: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

• 100 fish can move 1000 steps in

1 task 5.459s

32 tasks 2.756s

• 1000 fish can move 1000 steps in

1 task 511.14s

32 tasks 20.815s

• So what’s the “best” way to run?–How many fish do we really have?–How large a computer (time) do we have? –How quickly do we need the answer?

Sharks and Fish II: How fast?

x 24.6 speedup

x 1.98 speedup

Page 8: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Scaling: Good 1st Step: Do runtimes make sense?

1 Task

32 Tasks

Running fish_sim for 100-1000 fish on 1-32 CPUs we see

time ~ fish2

Page 9: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Scaling: Walltimes

walltime is (all)important but let’s look at some other scaling metrics

Each line is contour describing computations doable in a given time

Page 10: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Scaling: terminology

• Scaling studies involve changing the degree of parallelism. Will we be changing the problem also?–Strong scaling Fixed problem size

–Weak scaling Problem size grows with additional compute resources

• How do we measure success in parallel scaling?–Speed up = Ts/Tp(n)

–Efficiency = Ts/(n*Tp(n))

Multiple definitions

exist!

Page 11: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Scaling: Speedups

Page 12: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Scaling: Efficiencies

Remarkably smooth! Often algorithm and architecture make efficiency landscape quite complex

Page 13: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Scaling: Analysis

Why does efficiency drop?–Serial code sections

Amdahl’s law–Surface to Volume

Communication bound–Algorithm complexity

or switching–Communication

protocol switching–Scalability of computer

and interconnect

W

hoa!

Page 14: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Scaling: Analysis

• In general, changing problem size and concurrency expose or remove compute resources. Bottlenecks shift.

• In general, first bottleneck wins.

• Scaling brings additional resources too. –More CPUs (of course)–More cache(s)–More memory BW in some cases

Page 15: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Scaling: Superlinear Speedup

# CPUs(OMP)

Page 16: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Strong Scaling: Communication Bound

64 tasks , 52% comm 192 tasks , 66% comm 768 tasks , 79% comm

MPI_Allreduce buffer size is 32 bytes.

Q: What resource is being depleted here?A: Small message latency

1) Compute per task is decreasing2) Synchronization rate is increasing3) Surface:Volume ratio is increasing

Page 17: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Sharks and Atoms:

At HPC centers like NERSC fish are rarely modeled as point masses. The associated algorithms and their scalings are none the less of great practical importance for scientific problems.

Particle Mesh EwaldMatSci Computation

600s@256 way or

80s@1024 way

Page 18: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Topics

• Load Balance

• Synchronization

• Simple stuff

• File I/O

Now instead of looking at scaling of specific applications lets look at general issues in parallel application scalability

Page 19: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Load Balance : Application Cartoon

Universal App Unbalanced:

Balanced:

Time saved by load balanceWill define synchronization later

Page 20: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Load Balance : performance data

MPI ranks sorted by total communication time

Communication Time: 64 tasks show 200s, 960 tasks show 230s

Page 21: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Load Balance: ~code

while(1) {

do_flops(Ni);

MPI_Alltoall();

MPI_Allreduce();

}

960 x

64x

Page 22: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Load Balance: real code

Sync

Flops

Exchange

Time

MP

I Ra

nk

Page 23: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Load Balance : analysis

• The 64 slow tasks (with more compute work) cause 30 seconds more “communication” in 960 tasks

• This leads to 28800 CPU*seconds (8 CPU*hours) of unproductive computing

• All load imbalance requires is one slow task and a synchronizing collective!

• Pair well problem size and concurrency.

• Parallel computers allow you to waste time faster!

Page 24: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Load Balance : FFT

Q: When is imbalance good? A: When is leads to a faster Algorithm.

Page 25: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Load Balance: Summary

•Imbalance is most often a byproduct of data decomposition•Must be addressed before further MPI tuning can happen•Good software exists for graph partitioning / remeshing

•For regular grids consider padding or contracting

Page 26: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Topics

• Load Balance

• Synchronization

• Simple stuff

• File I/O

Page 27: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Scaling of MPI_Barrier()

fourorders of magnitude

Page 28: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Synchronization: terminology

MPI_Barrier(MPI_COMM_WORLD);

T1 = MPI_Wtime();

e.g. MPI_Allreduce();

T2 = MPI_Wtime()-T1;

• For a code running on N tasks what is the distribution of the T2’s?

• The average and width of this distribution tell us how synchronizing e.g. MPI_Allreduce is relative to some given interconnect. (HW & SW)

How synchronizing is MPI_Allreduce?

Page 29: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Synchronization : MPI Functions

Completion semantics of MPI functions

• Local : leave based on local logic – MPI_Comm_rank, MPI_Get_count

• Probably Local : try to leave w/o messaging other tasks

– MPI_Isend/Irecv

• Partially synchronizing : leave after messaging M<N tasks

– MPI_Bcast, MPI_Reduce

• Fully synchronizing : leave after every else enters– MPI_Barrier, MPI_Allreduce

Page 30: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

seaborg.nersc.gov

• It’s hard to discuss synchronization outside of the context a particular parallel computer

• MPI timings depend on HW, SW, and environment –How much of MPI is handled by the switch adapter?–How big are messaging buffers?–How many thread locks per function?–How noisy is the machine (today)?

• This is hard to model, so take an empirical approach based on an IBM SP which is largely applicable to other clusters…

Page 31: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Colony Switch

Colony Switch

PG F S

seaborg.nersc.gov basics

Resource Speed Bytes

Registers 3 ns 2560 B

L1 Cache 5 ns 32 KB

L2 Cache 45 ns 8 MB

Main Memory 300 ns 16 GB

Remote Memory 19 us 7 TB

GPFS 10 ms 50 TB

HPSS 5 s 9 PB

380 x

HPSSHPSS

CSS0

CSS1

•6080 dedicated CPUs, 96 shared login CPUs•Hierarchy of caching, speeds not balanced •Bottleneck determined by first depleted resource

16 way SMP NHII Node

Main MemoryGPFS

IBM SP

Page 32: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Colony Switch

Colony Switch

PG F S

MPI on the IBM SP

HPSSHPSS

CSS0

CSS1

16 way SMP NHII Node

Main MemoryGPFS

•2-4096 way concurrency

•MPI-1 and ~MPI-2

•GPFS aware MPI-IO

•Thread safety

•Ranks on same node bypass the switch

Page 33: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

MPI: seaborg.nersc.gov

Intra and Inter Node Communication

MP_EUIDEVICE

(fabric)

Bandwidth

(MB/sec)

Latency

(usec)

css0 500 / 350 9 / 21

css1 X X

csss 500 / 350 9 / 21

•Lower latency can satisfy more syncs/sec•What is the benefit of two adapters?•This is for a single pair of tasks

16 way SMP NHII Node

Main MemoryGPFS

css0

css1

csss

Page 34: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Seaborg : point to point messaging

16 way SMP NHII Node

Main MemoryGPFS

16 way SMP NHII Node

Main MemoryGPFS

IntranodeInternode

Switch BW and latency are often stated in optimistic terms.The number and size of concurrent messages changes things.

A fat tree / crossbar switch helps hide this.

Page 35: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Inter-Node Bandwidth

csss

css0

• Tune message size to optimize throughput

• Aggregate messages when possible

Page 36: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

MPI Performance is often Hierarchical

message size and task placement are key to performance

Intra

Inter

Page 37: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

MPI: Latency not always 1 or 2 numbers

The set of all possibly latencies describes

the interconnect geometry from the application

perspective

Page 38: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Synchronization: measurement

MPI_Barrier(MPI_COMM_WORLD);

T1 = MPI_Wtime();

e.g. MPI_Allreduce();

T2 = MPI_Wtime()-T1;

How synchronizing is MPI_Allreduce?

For a code running on N tasks what is the distribution of the T2’s?

One can derive the level of synchronization from MPI algorithms.

Instead let’s just measure …

Page 39: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Synchronization: MPI Collectives

2048 tasks

Beyond load balance there is a distribution on MPI timings intrinsic to the MPI Call

Page 40: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Synchronization: Architecture

t is the frequencykernel process

scheduling

Unix : cron et al.

…and from the machine itself

Page 41: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Intrinsic Synchronization : Alltoall

Page 42: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Intrinsic Synchronization: Alltoall

Architecture makes a big difference!

Page 43: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

This leads to variability in Execution Time

Page 44: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Synchronization : Summary

• As a programmer you can control–Which MPI calls you use (it’s not required to use

them all). –Message sizes, Problem size (maybe) –The temporal granularity of synchronization, i.e.,

where do synchronization occur.

• Language writers and system architects control–How hard is it to do the above–The intrinsic amount of noise in the machine

Page 45: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Topics

• Load Balance

• Synchronization

• Simple stuff

• File I/O

Page 46: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Simple Stuff

Parallel programs are easier to mess up than serial ones. Here are some common

pitfalls.

Page 47: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

What’s wrong here?

Page 48: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Is MPI_Barrier time bad? Probably. Is it avoidable?

~three cases:1) The stray / unknown / debug barrier2) The barrier which is masking compute balance 3) Barriers used for I/O ordering

Often very easy to fix

MPI_Barrier

Page 49: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Topics

• Load Balance

• Synchronization

• Simple stuff

• File I/O

Page 50: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Parallel File I/O : Strategies

MPI

Disk

Some strategies fall down at scale

Page 51: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Parallel File I/O: Metadata

• A parallel file system is great, but it is also another place to create contention.

• Avoid uneeded disk I/O, know your file system

• Often avoid file per task I/O strategies when running at scale

Page 52: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Topics

• Load Balance

• Synchronization

• Simple stuff

• File I/O

Happy Scaling!

Page 53: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Other sources of information:

• MPI Performance: http://www-unix.mcs.anl.gov/mpi/tutorial/perf/mpiperf/

• Seaborg MPI Scaling: • http://www.nersc.gov/news/reports/technical/seaborg_scaling/

• MPI Synchronization : Fabrizio Petrini, Darren J. Kerbyson, Scott Pakin, "The Case of the Missing

Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q", in Proc. SuperComputing, Phoenix, November 2003.

• Domain decomposition:

http://www.ddm.org/

google://”space filling”&”decomposition” etc.

• Metis : http://www-users.cs.umn.edu/~karypis/metis

Page 54: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Dynamical Load Balance: Motivation

Time

MPI R

an

k

Sync

Flops

Exchange

Page 55: Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.