Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Parallel Architectures & Performance Analysis
Parallel computer: multiple-processor system supporting parallel programming.
Three principle types of architectureVector computers, in particular processor
arraysShared memory multiprocessors
Specially designed and manufactured systemsDistributed memory multicomputers
Message passing systems readily formed from a cluster of workstations
Parallel Architectures and Performance Analysis – Slide 2
Parallel Computers
Vector computer: instruction set includes operations on vectors as well as scalars
Two ways to implement vector computersPipelined vector processor (e.g. Cray): streams
data through pipelined arithmetic unitsProcessor array: many identical, synchronized
arithmetic processing elements
Parallel Architectures and Performance Analysis – Slide 3
Type 1: Vector Computers
Natural way to extend single processor modelHave multiple processors connected to
multiple memory modules such that each processor can access any memory module
So-called shared memory configuration:
Parallel Architectures and Performance Analysis – Slide 4
Type 2: Shared Memory Multiprocessor Systems
Parallel Architectures and Performance Analysis – Slide 5
Ex: Quad Pentium Shared Memory Multiprocessor
Type 2: Distributed MultiprocessorDistribute primary memory among processorsIncrease aggregate memory bandwidth and
lower average memory access timeAllow greater number of processorsAlso called non-uniform memory access
(NUMA) multiprocessor
Parallel Architectures and Performance Analysis – Slide 6
Fundamental Types of Shared Memory Multiprocessor
Parallel Architectures and Performance Analysis – Slide 7
Distributed Multiprocessor
Complete computers connected through an interconnection network
Parallel Architectures and Performance Analysis – Slide 8
Type 3: Message-Passing Multicomputers
Distributed memory multiple-CPU computerSame address on different processors refers
to different physical memory locationsProcessors interact through message passingCommercial multicomputersCommodity clusters
Parallel Architectures and Performance Analysis – Slide 9
Multicomputers
Parallel Architectures and Performance Analysis – Slide 10
Asymmetrical Multicomputer
Parallel Architectures and Performance Analysis – Slide 11
Symmetrical Multicomputer
Parallel Architectures and Performance Analysis – Slide 12
ParPar Cluster: A Mixed Model
Michael Flynn (1966) created a classification for computer architectures based upon a variety of characteristics, specifically instruction streams and data streams.
Also important are number of processors, number of programs which can be executed, and the memory structure.
Parallel Architectures and Performance Analysis – Slide 13
Alternate System: Flynn’s Taxonomy
Parallel Architectures and Performance Analysis – Slide 14
Flynn’s Taxonomy: SISD (cont.)
Control unit ArithmeticProcessor
Memory
Control Signals
Instruction Data Stream
Results
Parallel Architectures and Performance Analysis – Slide 15
Flynn’s Taxonomy: SIMD (cont.)
Control Unit
Control Signal
PE 1 PE 2 PE n
Data Stream 1 Data Stream 2 Data Stream n
Parallel Architectures and Performance Analysis – Slide 16
Flynn’s Taxonomy: MISD (cont.)
Control Unit 1
Control Unit 2
Control Unit n
ProcessingElement 1
ProcessingElement 2
ProcessingElement n
Instruction Stream 1
Instruction Stream 2
Instruction Stream n
DataStream
Parallel Architectures and Performance Analysis – Slide 17
MISD Architectures (cont.)
Serial execution of two processes with 4 stages each. Time to execute T = 8 t , where t is the time to execute one stage.
Pipelined execution of the same two processes.
T = 5 t
S1 S2 S3 S4 S1 S2 S3 S4
S1 S2 S3 S4
S1 S2 S3 S4
Parallel Architectures and Performance Analysis – Slide 18
Flynn’s Taxonomy: MIMD (cont.)
Control Unit 1
Control Unit 2
Control Unit n
ProcessingElement 1
ProcessingElement 2
ProcessingElement n
Instruction Stream 1
Instruction Stream 2
Instruction Stream n
Data Stream 1
Data Stream 2
Data Stream n
Multiple Program Multiple Data (MPMD) Structure
Within the MIMD classification, which we are concerned with, each processor will have its own program to execute.
Parallel Architectures and Performance Analysis – Slide 19
Two MIMD Structures: MPMD
Single Program Multiple Data (SPMD) Structure
Single source program is written and each processor will execute its personal copy of this program, although independently and not in synchronism.
The source program can be constructed so that parts of the program are executed by certain computers and not others depending upon the identity of the computer.
Software equivalent of SIMD; can perform SIMD calculations on MIMD hardware.Parallel Architectures and Performance Analysis – Slide 20
Two MIMD Structures: SPMD
ArchitecturesVector computersShared memory multiprocessors: tightly
coupledCentralized/symmetrical multiprocessor (SMP):
UMADistributed multiprocessor: NUMA
Distributed memory/message-passing multicomputers: loosely coupledAsymmetrical vs. symmetrical
Flynn’s TaxonomySISD, SIMD, MISD, MIMD (MPMD, SPMD)
Parallel Architectures and Performance Analysis – Slide 21
Topic 1 Summary
A sequential algorithm can be evaluated in terms of its execution time, which can be expressed as a function of the size of its input.
The execution time of a parallel algorithm depends not only on the input size of the problem but also on the architecture of a parallel computer and the number of available processing elements.
Parallel Architectures and Performance Analysis – Slide 22
Topic 2: Performance Measures and Analysis
The speedup factor is a measure that captures the relative benefit of solving a computational problem in parallel.
The speedup factor of a parallel computation utilizing p processors is defined as the following ratio:
In other words, S(p) is defined as the ratio of the sequential processing time to the parallel processing time.
Parallel Architectures and Performance Analysis – Slide 23
Speedup Factor
p
sTTpS ssormultiproce ausing time Exec.
processor oneusing time Exec.)(
Speedup factor can also be cast in terms of computational steps:
Maximum speedup is (usually) p with p processors (linear speedup).
Parallel Architectures and Performance Analysis – Slide 24
Speedup Factor (cont.)
processors using steps comp. No.processor oneusing steps comp. No.)( ppS
Given a problem of size n on p processors letInherently sequential computations (n)Potentially parallel computations (n)Communication operations (n,p)
Then:
Parallel Architectures and Performance Analysis – Slide 25
Execution Time Components
p
s
TT
pnpnn
nnpS
),()()(
)()()(
Parallel Architectures and Performance Analysis – Slide 26
Speedup PlotComputation Time Communication Time
“elbowing out”
Number of processors
The efficiency of a parallel computation is defined as a ratio between the speedup factor and the number of processing elements in a parallel system:
Efficiency is a measure of the fraction of time for which a processing element is usefully employed in a computation.
Parallel Architectures and Performance Analysis – Slide 27
Efficiency
ppS
TpT
ppE
p
s )(processors using time Exec.processor oneusing time Exec.
Since E = S(p)/p, by what we did earlier
Since all terms are positive, E > 0Furthermore, since the denominator is larger
than the numerator, E < 1
Parallel Architectures and Performance Analysis – Slide 28
Analysis of Efficiency
),()()()()(
pnpnnpnn
E
Parallel Architectures and Performance Analysis – Slide 29
Maximum Speedup: Amdahl’s Law
As before
since the communication time must be non-trivial.
Let f represent the inherently sequential portion of the computation; then
Parallel Architectures and Performance Analysis – Slide 30
Amdahl’s Law (cont.)
pnn
nn
pnpnn
nnpS
)()(
)()(
),()()(
)()()(
)()()(nn
nf
LimitationsIgnores communication timeOverestimates speedup achievable
Amdahl EffectTypically (n,p) has lower complexity than
(n)/pSo as p increases, (n)/p dominates (n,p)Thus as p increases, speedup increases
Parallel Architectures and Performance Analysis – Slide 31
Amdahl’s Law (cont.)
fpp
pS)(
)(11
As before
Let s represent the fraction of time spent in parallel computation performing inherently sequential operations; then
Parallel Architectures and Performance Analysis – Slide 32
Gustafson-Barsis’ Law
pnn
nn
pnpnn
nnpS
)()(
)()(
),()()(
)()()(
pnn
ns
)()(
)(
Then
Parallel Architectures and Performance Analysis – Slide 33
Gustafson-Barsis’ Law (cont.)
)1()1()( psppspsspspS
Begin with parallel execution time instead of sequential time
Estimate sequential execution time to solve same problem
Problem size is an increasing function of pPredicts scaled speedup
Parallel Architectures and Performance Analysis – Slide 34
Gustafson-Barsis’ Law (cont.)spppS )()( 1
Both Amdahl’s Law and Gustafson-Barsis’ Law ignore communication time
Both overestimate speedup or scaled speedup achievable
Gene Amdahl John L. Gustafson
Parallel Architectures and Performance Analysis – Slide 35
Limitations
Performance terms: speedup, efficiencyModel of speedup: serial, parallel and
communication componentsWhat prevents linear speedup?
Serial and communication operationsProcess start-upImbalanced workloadsArchitectural limitations
Analyzing parallel performanceAmdahl’s LawGustafson-Barsis’ Law
Parallel Architectures and Performance Analysis – Slide 36
Topic 2 Summary
Based on original material fromThe University of Akron: Tim O’Neil, Kathy
LiszkaHiram College: Irena LomonosovThe University of North Carolina at Charlotte
Barry Wilkinson, Michael AllenOregon State University: Michael Quinn
Revision history: last updated 7/28/2011.
Parallel Architectures and Performance Analysis – Slide 37
End Credits