High Performance Computing for Science and Engineering I · High Performance Computing for Science and Engineering I Fabian Wermelinger Computational Science & Engineering Laboratory

High Performance Computing for Science and Engineering

I

Fabian Wermelinger Computational Science & Engineering Laboratory

Strong and Weak Scaling

• Amdahl’s Law — Strong Scaling •Fixed Problem Size •How much does parallelism reduce the execution time of a problem?

• Gustafson’s Law — Weak Scaling •Fixed Execution Time •How much longer does it take for the problem without parallelism?

OUTLINE

Recall Amdahl’s Law from the first lecture:

Amdahl’s Law — Strong Scaling Analysis

I wrote a shared memory code. How well does my code run in parallel?

Sp = 1

f + 1 − fp

Speedup S with p processors

Recall Amdahl’s Law from the first lecture:


I wrote a shared memory code. How well does my code run in parallel?

Sp = 1

f + 1 − fp

Speedup S with p processors

In a picture:Time

f 1 − f

1 − fp

f

Serial execution

Parallel execution (p processors)

Serial fraction of the code

Parallel fraction of the code

Implicit assumptions in Amdahl’s Law: • Fixed problem size ‣ Makes sense if p is relatively small ‣ Often we want to keep the execution time constant

and increase the problem size (weak scaling) • Negligible communication cost ‣ The number of processors p should be small

• All-or-None parallelism ‣ A more realistic model would be:

Sp = 1

f1 + ∑pi=2

fii

(∑pi=1 fi = 1)with

Problems for which those assumptions are reasonable can use this model for performance analysis. Such analysis is called Strong Scaling.

Recall: Shared memory architectures can not implement a large number of processors due to limitations on the memory bus as well as related cost issues. Communication cost using shared memory is still relatively low compared to distributed memory models.


Implication of fixed problem size:

wt

Speed of a certain task:associated work (problem size)

time needed to complete the work

Speed for serial task:

Speed for parallel task:

w/t1w/tp

Sp =w/tpw/t1

=t1tp

Strong scaling speedup:



Presenting Strong Scaling data:

Execution time

t

p

t1

tp

1 p

Strong Speedup

Sp

p

t1tp

1

1 p

Implication of serial fraction :f

10% serial fraction! Only about 6x faster with 24 threads…

1% serial fraction!

0.1% serial fraction! The serial fraction implies a performance upper-bound:

limp→∞ Sp = 1f

Even with an infinite amount of processors, this is the best we could do. Strong scaling analysis is very sensitive towards the serial fraction. Communication overhead (e.g. synchronization) further degrades performance.


Implication of serial fraction :f

10% serial fraction! Only about 6x faster with 24 threads…

1% serial fraction!

0.1% serial fraction! The serial fraction implies a performance upper-bound:

limp→∞ Sp = 1f

Even with an infinite amount of processors, this is the best we could do. Strong scaling analysis is very sensitive towards the serial fraction. Communication overhead (e.g. synchronization) further degrades performance.


So we are doomed?• If you want to squeeze in more parallelism for

a fixed problem size, then yes! • We are interested to run HPC applications on

very large computers. The question we ask ourselves is: How much can I increase the size of my problem such that the execution time is the same as if I ran the problem with only one process (relative to a fixed problem size per parallel process)

Gustafson’s Law — Weak Scaling Analysis

Amdahl’s Law:Time

f 1 − f

1 − fp

f

Gustafson’s Law:Time

f

1 − ff

1 − f 1 − f 1 − f 1 − f 1 − f 1 − f1 − f 1 − f …

Problem size: f + (1 − f ) = 1 Problem size: f + (1 − f ) + (1 − f ) + … = f + p(1 − f ) > 1

How much faster am I with processors for fixed problem size?

p

How much longer does it take for a given workload when parallelism is absent?

Problem size per process: f + (1 − f ) = 1

Serial

Parallel Work load per process is constant!

1 − ff

1 − ff

1 − ff

…

p ×


Sp = f + p(1 − f )Speedup:

Serial fraction is unaffected by parallelization!

Main question: How well does the parallel fraction scale among processors?p

Gustafson’s Law:Time

f

1 − ff

1 − f 1 − f 1 − f 1 − f 1 − f 1 − f1 − f 1 − f …

Problem size: f + (1 − f ) + (1 − f ) + … = f + p(1 − f ) > 1

How much longer does it take for a given workload when parallelism is absent?


Work load per process is constant!

1 − ff

1 − ff

1 − ff

…

p ×

Implication of fixed problem size per process:w1

tSpeed of a certain task:

associated work (problem size per process)

time needed to complete the work

Speed for serial task:

Speed for parallel task:

w1/t1pw1/tp


Sp =pw1/tpw1/t1

= pt1tp

Weak scaling speedup: Ew =Sp

p =t1tp

Weak scaling efficiency:


f

1 − ff

1 − f 1 − f 1 − f 1 − f 1 − f 1 − f1 − f 1 − f …

Problem size: f + (1 − f ) + (1 − f ) + … = f + p(1 − f ) > 1


1 − ff

1 − ff

1 − ff

…

p ×

Gustafson’s Point of View:(Example shown for processors)p = 24

Gustafson’s Law scales relative to the parallel fraction of a code. The serial fraction does not affect the scaling.


Presenting Weak Scaling data:

Execution time

t

p

t1

tp

1 p

Weak efficiency

Ew

p

t1tp

1

1 p

High Performance Computing for Science and Engineering I · High Performance Computing for Science and Engineering I Fabian Wermelinger Computational Science & Engineering Laboratory

Documents