Top Banner
Lecture 6: Lecture 6: Performance of Performance of Multiprocessor Multiprocessor Systems Systems
18

Lecture 6: Performance of Multiprocessor Systems

Jan 01, 2016

Download

Documents

Robert Butler

Lecture 6: Performance of Multiprocessor Systems. Speedup. Execution time on 1 processorT 1 Speedup = ----------------------------------------------- = -------- Execution time on p processors T p t s : time for the serial part of the algorithm - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 6: Performance of Multiprocessor Systems

Lecture 6:Lecture 6:

Performance of Performance of Multiprocessor Multiprocessor SystemsSystems

Page 2: Lecture 6: Performance of Multiprocessor Systems

SpeedupExecution time on 1 processor T1

Speedup = ----------------------------------------------- = --------Execution time on p processors Tp

ts : time for the serial part of the algorithm

tp : time for the parallelizable part of the algorithm

T1 = ts + tp Speedup ideal

Tp = ts + tp/p

ts + tp Speedup(p) = ----------------

ts + tp/pp

Page 3: Lecture 6: Performance of Multiprocessor Systems

Amdahl’s Law

If the sequential component of an algorithm is 1/s of the program’s execution time, then maximum speedup that can be achieved on a parallel computer is s.

ts = (1/s) x T1

tp = (1- 1/s) x T1

T1

Speedup(p) = ------------------------ s T1/s + (1-1/s)T1

------------- p

Speedup(p) = s p lim p ∞

Page 4: Lecture 6: Performance of Multiprocessor Systems

Speedup

Page 5: Lecture 6: Performance of Multiprocessor Systems

Speedup

Page 6: Lecture 6: Performance of Multiprocessor Systems

Speedup

Page 7: Lecture 6: Performance of Multiprocessor Systems

Superlinear speedup

Speedup(p) > p superlinear speedup

Reasons: Increased cache size Random algorithms Parallel algorithm

Page 8: Lecture 6: Performance of Multiprocessor Systems

Speedup

T1 Speedup = --------

Tp

Relative speedup: single processor execution time of the parallel algorithm is used

Absolute speedup: execution time of the best parallel algorithm on one processor is used

Page 9: Lecture 6: Performance of Multiprocessor Systems

Efficiency

Speedup(p) T1

Efficiency(p) = ------------------- = ---------- ≤ 1p p x Tp

Efficiency

1

p

Page 10: Lecture 6: Performance of Multiprocessor Systems

Amdahl’s Law

If the sequential component of an algorithm is 1/s of the program’s execution time, then maximum speedup that can be achieved on a parallel computer is s.

ts = (1/s) x T1

tp = (1- 1/s) x T1

T1

Speedup = ------------------------ sT1/s + (1-1/s)T1

------------- p

Speedup = s p lim p ∞

Page 11: Lecture 6: Performance of Multiprocessor Systems

Gustafson’s Law

work time

p pwork time

p p

wp wp wp wp

ws ws ws ws tp /p tp

/ptp /p

tp /p

ts

ts

tsts

wpwp

wp

wp

ws

ws

ws

ws

tp tp tp tp

ts ts ts ts

Fixed size

Fixed time

1 2 3 4

1 2 3 41 2 3 4

1 2 3 4

Page 12: Lecture 6: Performance of Multiprocessor Systems

Gustafson’s Law Scaled Speedup (Fixed-size Speedup)

Tp = ts + tp

T1 = ts + p.tp

If the sequential component of an algorithm is 1/s of the program’s execution time

ts = (1/s) x Tp

tp = (1- 1/s) x Tp Speedup ideal

Speedup(p) = 1/s + (1-1/s)p

Speedup(p) = ∞ p lim p ∞

Page 13: Lecture 6: Performance of Multiprocessor Systems

Sizeup

Total work on 1 processor Sizeup = -------------------------------------------

Total work on p processors

ws: serial work

wp: parallelizable work

wp’: scaled parallelizable work

ws + wp’ ws + p.wp

Sizeup = ---------------- = ----------------- ws + wp ws + wp

Page 14: Lecture 6: Performance of Multiprocessor Systems

Roofline Performance Model

Arithmetic intensity is the ratio of floating-point operations in a program to the number of data bytes accessed by the program from main memory

floating-point operations Arithmetic intensity = --------------------------------------- = FLOPs/Byte

number of data bytes

Page 15: Lecture 6: Performance of Multiprocessor Systems

Roofline Performance Model

Attainable GFLOPs/second

Peak memory bandwidth x Arithmetic intensity= min

Peak floating-point performance

Page 16: Lecture 6: Performance of Multiprocessor Systems

Roofline Performance Model Peak floating-point performance is given by the hardware

specifications of the computer (FLOPs/second) For multicore chips, peak performance is the collective performance

of all the cores on the chip. So, multiply the peak per chip by the number of chips

Peak memory performance is also given by the hardware specifications of the computer (Mbytes/second)

Maximum floating-point performance that the memory system of the computer can support for a given arithmetic intensity, can be plotted as

Peak memory bandwidth x Arithmetic intensity

(bytes/second) x (FLOPs/bytes) ==> FLOPs/second

Page 17: Lecture 6: Performance of Multiprocessor Systems

Roofline Performance Model

Roofline sets an upper bound on performance

Roofline of a computer does not vary by benchmark kernel

Page 18: Lecture 6: Performance of Multiprocessor Systems

Stream Benchmark A synthetic benchmark Measures the performance of long vector operations They have no temporal locality and they access arrays that are

larger than the cache size http://www.cs.virginia.edu/stream/ref.html