Lecture 6: Performance of Multiprocessor Systems

Lecture 6:Lecture 6:

Performance of Performance of Multiprocessor Multiprocessor SystemsSystems

SpeedupExecution time on 1 processor T1

Speedup = ----------------------------------------------- = --------Execution time on p processors Tp

ts : time for the serial part of the algorithm

tp : time for the parallelizable part of the algorithm

T1 = ts + tp Speedup ideal

Tp = ts + tp/p

ts + tp Speedup(p) = ----------------

ts + tp/pp

Amdahl’s Law

If the sequential component of an algorithm is 1/s of the program’s execution time, then maximum speedup that can be achieved on a parallel computer is s.

ts = (1/s) x T1

tp = (1- 1/s) x T1

Speedup(p) = ------------------------ s T1/s + (1-1/s)T1

------------- p

Speedup(p) = s p lim p ∞

Speedup

Superlinear speedup

Speedup(p) > p superlinear speedup

Reasons: Increased cache size Random algorithms Parallel algorithm

Speedup

T1 Speedup = --------

Relative speedup: single processor execution time of the parallel algorithm is used

Absolute speedup: execution time of the best parallel algorithm on one processor is used

Efficiency

Speedup(p) T1

Efficiency(p) = ------------------- = ---------- ≤ 1p p x Tp

Efficiency

Amdahl’s Law

If the sequential component of an algorithm is 1/s of the program’s execution time, then maximum speedup that can be achieved on a parallel computer is s.

ts = (1/s) x T1

tp = (1- 1/s) x T1

Speedup = ------------------------ sT1/s + (1-1/s)T1

------------- p

Speedup = s p lim p ∞

Gustafson’s Law

work time

p pwork time

wp wp wp wp

ws ws ws ws tp /p tp

/ptp /p

tp tp tp tp

ts ts ts ts

Fixed size

Fixed time

1 2 3 4

1 2 3 41 2 3 4

1 2 3 4

Gustafson’s Law Scaled Speedup (Fixed-size Speedup)

Tp = ts + tp

T1 = ts + p.tp

If the sequential component of an algorithm is 1/s of the program’s execution time

ts = (1/s) x Tp

tp = (1- 1/s) x Tp Speedup ideal

Speedup(p) = 1/s + (1-1/s)p

Speedup(p) = ∞ p lim p ∞

Sizeup

Total work on 1 processor Sizeup = -------------------------------------------

Total work on p processors

ws: serial work

wp: parallelizable work

wp’: scaled parallelizable work

ws + wp’ ws + p.wp

Sizeup = ---------------- = ----------------- ws + wp ws + wp

Roofline Performance Model

Arithmetic intensity is the ratio of floating-point operations in a program to the number of data bytes accessed by the program from main memory

floating-point operations Arithmetic intensity = --------------------------------------- = FLOPs/Byte

number of data bytes

Attainable GFLOPs/second

Peak memory bandwidth x Arithmetic intensity= min

Peak floating-point performance

Roofline Performance Model Peak floating-point performance is given by the hardware

specifications of the computer (FLOPs/second) For multicore chips, peak performance is the collective performance

of all the cores on the chip. So, multiply the peak per chip by the number of chips

Peak memory performance is also given by the hardware specifications of the computer (Mbytes/second)

Maximum floating-point performance that the memory system of the computer can support for a given arithmetic intensity, can be plotted as

Peak memory bandwidth x Arithmetic intensity

(bytes/second) x (FLOPs/bytes) ==> FLOPs/second

Roofline sets an upper bound on performance

Roofline of a computer does not vary by benchmark kernel

Stream Benchmark A synthetic benchmark Measures the performance of long vector operations They have no temporal locality and they access arrays that are

larger than the cache size http://www.cs.virginia.edu/stream/ref.html

Lecture 6: Performance of Multiprocessor Systems

Documents

management, multiprocessor architectures, high performance.....

EE202C Networked Embedded Systems Design Lecture 12...

High Performance Neural Net Simulation on a Multiprocessor.....

Computer Science Lecture 7, page 1 CS677: Distributed OS...

High Performance Embedded Computing © 2007 Elsevier Lecture...

CS 213 Lecture 11: Multiprocessor 3: Directory...

CS252 Graduate Computer Architecture Lecture 14...

Multiprocessor Performance Debugging and Memory...

CS252/Patterson Lec 11.1 2/23/01 CS252 Graduate Computer...

CS519: Lecture 7 zUniprocessor and Multiprocessor...

Virtual Asymmetric Multiprocessor for Interactive...

PERFORMANCE EVALUATION OF A MULTIPROCESSOR REAL …

John David Eriksen Jamie Unger-Fink High-Performance,...

CS252 Graduate Computer Architecture Lecture 17...

Lecture 7. Multiprocessor and Memory Coherence

CS213 Parallel Processing Architecture Lecture 7: ...