Page 1
THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineeringand Computer Science
c© C. D. Cantrell (05/1999)
PARALLEL PROCESSING
• Numerical problems and algorithms
• Impact of Amdahl’s law
• Parallel computer architectures
. Memory architectures
. Interconnection networks
• Flynn’s taxonomy (SIMD, MIMD, etc.)
• Shared-memory multiprocessors
• Distributed-memory, message-passing multicomputers
Page 2
THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineeringand Computer Science
c© C. D. Cantrell (05/1999)
COMPUTATIONAL PROBLEMS
• Numerical simulation has become as important as experimentation
. Permits exploration of a much larger region in parameter space
. All-numerical designs
• The most CPU-intensive application areas include:
. Computational fluid dynamics
. Computational electromagnetics
. Molecular modeling
• Typical problem sizes (in double-precision floating-point operations)and execution times (in CPU-days @ 109 FLOPS):
Problem size Execution time (CPU-days)1014 1.161015 11.571016 115.71017 1157
Page 3
THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineeringand Computer Science
c© C. D. Cantrell (05/1999)
NUMERICAL ALGORITHMS
•Most of the important CPU-intensive applications involve the solutionof a system of coupled partial differential equations
• Initial-value problems
. Time-dependent wave propagation (electromagnetics, geophysics,optical networking)
. Time-dependent dynamics (fluid flow, mechanical engineering)
• Boundary-value problems
. Typically require solution of very large systems of linear equations
. Quasi-static electromagnetic problems
. Static mechanical engineering (finite element analysis)
. Eigenvalue problems (molecular energy surfaces)
Page 4
THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineeringand Computer Science
c© C. D. Cantrell (05/1999)
PROPAGATION EQUATIONS FOR 4-WAVE MIXING
• Paraxial wave equation: ∂∂z′
+iβ2
2∂2
∂t′2− β3
6∂3
∂t′3
Fn
= −α2Fn+i
16π2ωnχ(3)
n20,nc
2AeµnklmD|k|,|l|,|m|
× FkF lFmei(∆βnklm)z′
• Two regimes to study:
. Three strong waves (Fk, Fl, Fm) generate nine weak waves (Fn)◦ Useful for estimating crosstalk among channels
. Parametric coupling of three strong waves◦ Leads to coherent amplification of some channels and de-amplification
of others
Page 5
THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineeringand Computer Science
c© D. M. Hollenbeck and C. D. Cantrell (05/1999)
PULSE PROPAGATION IN AN OPTICAL FIBER (1)
0
20
40
Propagation distance (km)
–0.3
–0.1
0.1
0.3
Frequency
0
1
2
Power (mW)
Page 6
THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineeringand Computer Science
c© D. M. Hollenbeck and C. D. Cantrell (05/1999)
PULSE PROPAGATION IN AN OPTICAL FIBER (2)
-3.4
-1.7
0.0
1.7
3.4
Time (ps) 0.0
5.0
10.0
15.0
20.0
20.0
40.0
60.0
Power (mW)
Propagation distance (km)
Page 7
Processor
One ormore levelsof cache
ProcessorProcessor Processor
Main memory I/O system
One ormore levelsof cache
One ormore levelsof cache
One ormore levelsof cache
A centralized, shared-memory multiprocessor
Page 8
Machine 1 Machine 2
Languagerun-timesystem
Operatingsystem
Shared memory
Application
Hardware
Languagerun-timesystem
Operatingsystem
Application
Hardware
Machine 1 Machine 2
Languagerun-timesystem
Operatingsystem
Shared memory
Application
Hardware
Languagerun-timesystem
Operatingsystem
Application
Hardware
Machine 1 Machine 2
Languagerun-timesystem
Operatingsystem
Shared memory
Application
Hardware
Languagerun-timesystem
Operatingsystem
Application
Hardware
Layers where shared memory can be implemented
Page 9
Memory I/O
Interconnection network
Memory I/O Memory I/O
Processor+ cache
Processor+ cache
Processor+ cache
Processor+ cache
Memory I/O
Memory I/O Memory I/O Memory I/O Memory I/O
Processor+ cache
Processor+ cache
Processor+ cache
Processor+ cache
A distributed-memory multicomputer
Page 10
star full interconnection
Interconnection topologies
tree ring
grid double torus
cube 4-D hypercube
Page 11
C D
CPU 1
Endof
packet
Middleof
packet
A
Input port
Output port
Front of packet
Four-port switch
B
CPU 2
Interconnection network consisting of a 4-switch square grid
Page 12
Deadlock in a circuit-switched interconnection network
CPU 1
CPU 2
CPU 3
A
C
B
D
Input port
Output buffer
Four-port switch
CPU 4
Page 13
Two-Hypernode Convex Exemplar System
FunctionalBlocks
InterfaceI/O
InterfaceI/O
hypernode 0
hypernode 1
CTI Ring
Page 14
MachineCommunication
mechanismInterconnection
networkProcessor
countTypical remote
memory access time
SPARCCenter Shared memory Bus ≤ 20 1 µs
SGI Challenge Shared memory Bus ≤ 36 1 µs
Cray T3D Shared memory 3D torus 32–2048 1 µs
Convex Exemplar Shared memory Crossbar + ring 8–64 2 µs
KSR-1 Shared memory Hierarchical ring 32–256 2–6 µs
CM-5 Message passing Fat tree 32–1024 10 µs
Intel Paragon Message passing 2D mesh 32–2048 10–30 µs
IBM SP-2 Message passing Multistage switch 2–512 30–100 µs
Typical access times to retrieve a word from a remote memory
Page 15
ApplicationScaling of
computation Scaling of
communicationScaling of computation-
to-communication
FFT
LU
BarnesApproximately Approximately
Ocean
Scaling of computation, of communication, and of the ratio are critical factors indetermining performance on parallel machines
n nlogp
-------------- np--- nlog
np--- n
p------- n
p-------
n nlogp
-------------- n nlog( )p
------------------------- n
p-------
np--- n
p------- n
p-------
Page 16
CPU
Bus
Scalable vs. non-scalable systems
Page 17
n CPUs active
1 CPUactive
1 – ff
T
Inherentlysequential
part
1 – ff
Potentiallyparallelizable
part
…
fT(1 – f)T/n
AmdahlÕs law for parallel processing
Page 18
log2n(n = number
of processors)
02
46
810 0
0.2
0.4
0.6
0.8
10.5
1.0
f = parallelizable fraction
efficiency
AMDAHL’S LAW FOR PARALLEL COMPUTING
Page 19
THE UNIVERSITY OF TEXAS AT DALLAS Erik Jonsson School of Engineeringand Computer Science
c© C. D. Cantrell (05/1999)
THREADS
• Threads are the software equivalent of hardware functional units
• Properties of threads:
. Exist in user process space or kernel space◦ User threads are faster to launch than processes
. Mappings required for execution:user thread → lightweight process → kernel thread → CPU
• Reference: Bil Lewis and Daniel J. Berg, Threads Primer(SunSoft/Prentice Hall, 1996)
Page 20
Process
P1
P2
P5 P6
P3
P2P1 P3
P8P7
P1
P9
P1
P2
P3
P2 P3
Synchronization point
P1 P2 P3
Synchronization point
P4
Work queue
Computational paradigms
Page 21
Cache tagand data
Processor
Single bus
Memory I/O
Snooptag
Cache tagand data
Processor
Snooptag
Cache tagand data
Processor
Snooptag
Page 22
Invalid(not valid
cache block)
Read/Write(dirty)
Read Only(clean)
)tihfietadilavnidneS(
(Write
back
dirty
block
tomem
ory)
Processor read miss
Processor write
Processor write miss
Processorread miss
Processor write(hit or miss)
Cache state transitions using signals from the processor
Invalid(not valid
cache block)
Read/Write(dirty)
Read Only(clean)Invalidate or
another processorhas a write miss
for this block(seen on bus)
Another processor has a readmiss or a write miss forthis block (seen on bus);
write back old block
a.
Cache state transitions using signals from the busb.
Page 23
(a)
(b)
(c)
(d)
(e)
(f)
Exclusive
Cache Bus
Bus
Bus
Bus
Bus
Bus
Shared Shared
Shared Shared
Modified
Modified
Modified
CPU 1 reads block A
CPU 2 reads block A
CPU 2 writes block A
CPU 3 reads block A
CPU 2 writes block A
CPU 1 writes block A
CPU 1 CPU 2 CPU 3Memory
CPU 1 CPU 2 CPU 3Memory
CPU 1 CPU 2 CPU 3Memory
CPU 1 CPU 2 CPU 3Memory
CPU 1 CPU 2 CPU 3Memory
CPU 1 CPU 2 CPU 3Memory
A
A
A
A
A
A
A
The MESI cache coherence protocol
Page 24
Succeed?(= 0?)
Unlocked?(= 0?)
Load lockvariable
No
Yes
No
Try to lock variable using swap:read lock variable and then set
variable to locked value (1)
Begin updateof shared data
Finish updateof shared data
Unlock:set lock variable to 0
Yes
Page 25
Instructionstreams
Datastreams Name Examples
1 1 SISD Classical Von Neumann machine
1 Multiple SIMD Vector supercomputer, array processor
Multiple 1 MISD Arguably none
Multiple Multiple MIMD Multiprocessor, multicomputer
FlynnÕs taxonomy of parallel computers
Page 26
SISD
(Von Neumann)
SIMD
Parallel computer architectures
MISD
?
MIMD
Vectorprocessor
Arrayprocessor
Multi-processors
Multi-computers
UMA COMA NUMA MPP COW
Bus Switched CC-NUMA NC-NUMA GridHyper-cube
Shared memory Message passing
A taxonomy of parallel computers