Page 1
ΧΑΡΗΣ
ΘΕΟΧΑΡΙΔΗΣ
([email protected] )
ΗΜΥ
656
ΠΡΟΧΩΡΗΜΕΝΗ
ΑΡΧΙΤΕΚΤΟΝΙΚΗ
ΗΛΕΚΤΡΟΝΙΚΩΝ
ΥΠΟΛΟΓΙΣΤΩΝ
Εαρινό
Εξάμηνο
2007
ΔΙΑΛΕΞΗ
7: Parallel Computer Systems
Ack: Parallel Computer Architecture: A Hardware/Software Approach, David E. Culler et al, Morgan Kaufmann
Presenter
Presentation Notes
Other handouts To handout next time
Page 2
What is Parallel Architecture?
•
A parallel computer is a collection of processing elements that cooperate to solve large problems fast
•
Some broad issues:–
Resource Allocation:•
how large a collection? •
how powerful are the elements?•
how much memory?–
Data access, Communication and Synchronization•
how do the elements cooperate and communicate?•
how are data transmitted between processors?•
what are the abstractions and primitives for cooperation?–
Performance and Scalability•
how does it all translate into performance?•
how does it scale?
Page 3
Why Study Parallel Architecture?•
Role of a computer architect:
–
To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost.
•
Parallelism:–
Provides alternative to faster clock for performance
–
Applies at all levels of system design–
Is a fascinating perspective from which to view architecture
–
Is increasingly central in information processing
Page 4
Why Study it Today?
•
History:
diverse and innovative organizational structures, often tied to novel programming models
•
Rapidly maturing under strong technological constraints–
The “killer micro”
is ubiquitous–
Laptops and supercomputers are fundamentally similar!–
Technological trends cause diverse approaches to converge•
Technological trends make parallel computing inevitable–
In the mainstream•
Need to understand fundamental principles and design tradeoffs, not just taxonomies–
Naming, Ordering, Replication, Communication performance
Page 5
Inevitability of Parallel Computing
•
Application demands:
Our insatiable need for cycles–
Scientific computing: CFD, Biology, Chemistry, Physics, ...–
General-purpose computing: Video, Graphics, CAD, Databases, TP...•
Technology Trends–
Number of transistors on chip growing rapidly–
Clock rates expected to go up only slowly•
Architecture Trends–
Instruction-level parallelism valuable but limited–
Coarser-level parallelism, as in MPs, the most viable approach•
Economics
•
Current trends:–
Today’s microprocessors have multiprocessor support–
Servers & even PCs becoming MP: Sun, SGI, COMPAQ, Dell,...–
Tomorrow’s microprocessors are multiprocessors
Page 6
Application Trends
•
Demand for cycles fuels advances in hardware, and vice-versa–
Cycle drives exponential increase in microprocessor performance–
Drives parallel architecture harder: most demanding applications•
Range of performance demands–
Need range of system performance with progressively increasing cost
–
Platform pyramid•
Goal of applications in using parallel machines: Speedup•
Speedup (p processors) =
•
For a fixed problem size (input data set), performance = 1/time
•
Speedup fixed problem (p processors) =
Performance (p processors)
Performance (1 processor)
Time (1 processor)Time (p processors)
Page 7
Scientific Computing Demand
Page 8
Summary of Application Trends•
Transition to parallel computing has occurred for scientific and engineering computing
•
In rapid progress in commercial computing–
Database and transactions as well as financial
–
Usually smaller-scale, but large-scale systems also used
•
Desktop
also uses multithreaded
programs, which are a lot like parallel programs
•
Demand for improving throughput
on sequential workloads–
Greatest use of small-scale multiprocessors
•
Solid application demand exists and will increase
Page 9
Technology Trends
•
Commodity microprocessors have caught up with supercomputers.
Per
form
ance
0.1
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Page 10
Architectural Trends
•
Architecture translates technology’s gifts
to performance
and capability
•
Resolves the tradeoff between parallelism and locality–
Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect
–
Tradeoffs may change with scale and technology advances•
Understanding microprocessor architectural trends –
Helps build intuition about design issues or parallel machines–
Shows fundamental role of parallelism even in “sequential”
computers•
Four generations of architectural history: tube, transistor, IC, VLSI–
Here focus only on VLSI
generation•
Greatest delineation in VLSI has been in type of parallelism exploited
Page 11
Arch. Trends: Exploiting Parallelism
•
Greatest trend in VLSI generation is increase in parallelism–
Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit•
slows after 32 bit
•
adoption of 64-bit now under way, 128-bit far (not performance issue)
•
great inflection point when 32-bit micro and cache fit on a chip
–
Mid 80s to mid 90s: instruction level parallelism•
pipelining and simple instruction sets, + compiler advances (RISC)
•
on-chip caches and functional units => superscalar execution
•
greater sophistication: out of order execution, speculation, prediction–
to deal with control transfer and latency problems
–
Next step: thread level parallelism
Page 12
Phases in VLSI Generation
•
How good is instruction-level parallelism? •
Thread-level needed in microprocessors?
Tran
sist
ors
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level (?)
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Page 13
Architectural Trends: ILP
• • Reported speedups for superscalar processors•
Horst, Harris, and Jardine
[1990] ......................
1.37•
Wang and Wu [1988] ..........................................
1.70•
Smith, Johnson, and Horowitz [1989] .............. 2.30•
Murakami et al. [1989] ........................................
2.55•
Chang et al. [1991] ............................................. 2.90•
Jouppi
and Wall [1989] ...................................... 3.20•
Lee, Kwok, and Briggs [1991] ........................... 3.50•
Wall [1991] .......................................................... 5•
Melvin and Patt
[1991] ....................................... 8•
Butler et al. [1991] ............................................. 17+
•
Large variance due to difference in–
application domain investigated (numerical versus non-numerical)–
capabilities of processor modeled
Page 14
ILP Ideal Potential
•
Infinite resources and fetch bandwidth, perfect branch prediction and renaming –
real caches and non-zero miss latencies
0 1 2 3 4 5 6+0
5
10
15
20
25
30
0 5 10 150
0.5
1
1.5
2
2.5
3
Frac
tion
of to
tal c
ycle
s (%
)
Number of instructions issued
Spe
edup
Instructions issued per cycle
Page 15
Results of ILP Studies
•
Concentrate on parallelism for 4-issue machines•
Realistic studies show only 2-fold speedup
•
Recent studies show that for more parallelism, one must look across threads
1x
2x
3x
4x
Jouppi_89 Smith_89 Murakami_89 Chang_91 Butler_91 Melvin_91
1 branch unit/real prediction
perfect branch prediction
Page 16
Architectural Trends: Bus-based MPs
No. of processors in fully configured commercial shared-memory systems
•Micro on a chip makes it natural to connect many to shared memory–
dominates server and enterprise market, moving down to desktop•Faster processors began to saturate bus, then bus technology advanced
–
today, range of sizes for bus-based systems, desktop to large servers
0
10
20
30
40
CRAY CS6400
SGI Challenge
Sequent B2100
Sequent B8000
Symmetry81
Symmetry21
Power
SS690MP 140 SS690MP 120
AS8400
HP K400AS2100SS20
SE30SS1000E
SS10
SE10
SS1000
P-ProSGI PowerSeries
SE60
SE70
Sun E6000
SC2000ESun SC2000SGI PowerChallenge/XL
SunE10000
50
60
70
1984 1986 1988 1990 1992 1994 1996 1998
Num
ber o
f pro
cess
ors
Page 17
Bus BandwidthS
hare
d bu
s ba
ndw
idth
(MB
/s)
10
100
1,000
10,000
100,000
1984 1986 1988 1990 1992 1994 1996 1998
SequentB8000
SGI PowerCh
XL
Sequent B2100
Symmetry81/21
SS690MP 120SS690MP 140 SS10/
SE10/SE60
SE70/SE30SS1000 SS20
SS1000EAS2100SC2000EHPK400SGI Challenge
Sun E6000AS8400
P-Pro
Sun E10000
SGI PowerSeries
SC2000
Power
CS6400
Page 18
Economics
•
Commodity microprocessors not only fast
but CHEAP•
Development cost is tens of millions of dollars (5-100 typical)• BUT, many more are sold compared to supercomputers–
Crucial to take advantage of the investment, and use the commodity building block
–
Exotic parallel architectures no more than special-purpose
•
Multiprocessors being pushed by software vendors (e.g. database)
as well as hardware vendors
•
Standardization by Intel makes small, bus-based SMPs
commodity
•
Desktop: few smaller processors versus one larger one?–
Multiprocessor on a chip
Page 19
History•
Historically, parallel architectures tied to programming models –
Divergent architectures, with no predictable pattern of growth.
Application Software
SystemSoftware SIMD
Message PassingShared MemoryDataflow
SystolicArrays Architecture
Uncertainty of direction paralyzed parallel software development!
Page 20
Today•
Extension of “computer architecture”
to support
communication and cooperation–
OLD: Instruction Set Architecture
–
NEW: Communication Architecture
•
Defines –
Critical abstractions, boundaries, and primitives (interfaces)
–
Organizational structures that implement interfaces (hw or sw)
•
Compilers, libraries and OS are important bridges today
Page 21
History•
“Mainframe”
approach:–
Motivated by multiprogramming–
Extends crossbar used for mem
bw
and I/O–
Originally processor cost limited to small scale•
later, cost of crossbar–
Bandwidth scales with p–
High incremental cost; use multistage instead
•
“Minicomputer”
approach:–
Almost all microprocessor systems have bus–
Motivated by multiprogramming, TP–
Used heavily for parallel computing–
Called symmetric multiprocessor (SMP)–
Latency larger than for uniprocessor–
Bus is bandwidth bottleneck•
caching is key:
coherence problem–
Low incremental cost
P
P
C
C
I/O
I/O
M MM M
PP
C
I/O
M MC
I/O
$ $
Page 22
Example: Intel Pentium Pro Quad
–
All coherence and multiprocessing glue in processor module
–
Highly integrated, targeted at high volume
–
Low latency and bandwidth
P-Pro bus (64-bit data, 36-bit address, 66 MHz)
CPU
Bus interface
MIU
P-Promodule
P-Promodule
P-Promodule256-KB
L2 $Interruptcontroller
PCIbridge
PCIbridge
Memorycontroller
1-, 2-, or 4-wayinterleaved
DRAM
PCI b
us
PCI b
usPCII/O
cards
Page 23
Example: SUN Enterprise
–
16
cards of either type: processors + memory, or I/O–
All memory accessed over bus, so symmetric–
Higher bandwidth, higher latency bus
Gigaplane bus (256 data, 41 address, 83 MHz)
SBU
S
SBU
S
SBU
S
2 Fi
berC
hann
el
100b
T, S
CS
I
Bus interface
CPU/memcardsP
$2
$P
$2
$
Mem ctrl
Bus interface/switch
I/O cards
Page 24
Scaling Up
–
Problem is interconnect: cost (crossbar) or bandwidth (bus)–
Dance-hall: bandwidth still scalable, but lower cost than crossbar•
latencies to memory uniform, but uniformly large–
Distributed memory
or non-uniform memory access (NUMA)•
Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-request, read-response)
–
Caching shared (particularly nonlocal) data?
M M M° ° °
° ° ° M ° ° °M M
NetworkNetwork
P
$
P
$
P
$
P
$
P
$
P
$
“Dance hall” Distributed memory
Page 25
Example: Cray T3E
–
Scale up to 1024 processors, 480MB/s
links–
Memory controller generates comm. request for nonlocal
references–
No hardware mechanism for coherence
(SGI Origin etc. provide this)
Switch
P$
XY
Z
External I/O
Memctrl
and NI
Mem
Page 26
Message Passing Architectures
•
Complete computer as building block, including I/O–
Communication via explicit I/O operations
•
Programming model:
–
directly access
only private address space
(local memory)
–
communicate
via explicit messages (send/receive)
•
High-level block diagram similar to distributed-mem
SAS–
But comm. integrated at IO level, need not put into memory system–
Like networks of workstations (clusters), but tighter integration–
Easier to build than scalable SAS
•
Programming model further from basic hardware ops–
Library or OS intervention
Page 27
Message Passing Abstraction
–
Send
specifies buffer to be transmitted and receiving process–
Recv
specifies sending process and application storage to receive into–
Memory to memory copy, but need to name processes–
Optional tag on send and matching rule on receive–
User process names local data and entities in process/tag space too–
In simplest form, the send/recv
match achieves pairwise
synch event•
Other variants too–
Many overheads:
copying, buffer management, protection
Process P Process Q
Address Y
Address X
Send X, Q, t
Receive Y, P, tMatch
Local processaddress
spaceLocal processaddress
space
Page 28
Evolution of Message Passing
•
Early machines: FIFO on each link–
Hardware close to programming model•
synchronous ops–
Replaced by DMA, enabling non-blocking ops•
Buffered by system at destination until recv
•
Diminishing role of topology–
Store & forward routing: topology important–
Introduction of pipelined routing made it less so
–
Cost is in node-network interface–
Simplifies programming
000001
010011
100
110
101
111
Page 29
Example: IBM SP-2
–
Made out of essentially complete RS6000 workstations–
Network interface integrated in I/O bus (bw
limited by I/O bus)
Memory bus
MicroChannel bus
I/O
i860 NI
DMA
DR
AM
IBM SP-2 node
L2 $
Power 2CPU
Memorycontroller
4-wayinterleaved
DRAM
General interconnectionnetwork formed from8-port switches
NIC
Page 30
Example: Intel Paragon
Memory bus (64-bit, 50 MHz)
i860
L1 $
NI
DMA
i860
L1 $
Driver
Memctrl
4-wayinterleaved
DRAM
IntelParagonnode
8 bits,175 MHz,bidirectional2D grid network
with processing nodeattached to every switch
Sandia’ s Intel Paragon XP/S-based Supercomputer
Page 31
Toward Architectural Convergence
•
Evolution and role of software have blurred boundary–
Send/recv
supported on SAS machines via buffers–
Can construct global address space on MP using hashing–
Page-based (or finer-grained) shared virtual memory•
Hardware organization converging too–
Tighter NI integration even for MP (low-latency, high-bandwidth)–
At lower level, even hardware SAS passes hardware messages•
Even clusters of workstations/SMPs
are parallel systems–
Emergence of fast system area networks (SAN)•
Programming models distinct, but organizations converging–
Nodes connected by general network and communication assists–
Implementations also converging, at least in high-end machines
Page 32
Data Parallel Systems•
Programming model:–
Operations performed in parallel on each element of data structure–
Logically single thread of control, performs sequential or parallel steps
–
Conceptually, a processor associated with each data element
•
Architectural model:–
Array of many simple, cheap processors with little memory each•
Processors don’t sequence through instructions–
Attached to a control processor that issues instructions–
Specialized and general communication, cheap global synchronization
•
Original motivation:–
Matches simple differential equation solvers
–
Centralize high cost of instruction fetch & sequencing
PE PE PE° ° °
PE PE PE° ° °
PE PE PE° ° °
° ° ° ° ° ° ° ° °
Controlprocessor
Page 33
Application of Data Parallelism–
Each PE contains an employee record with his/her salaryIf salary > 100K then
salary = salary *1.05
else
salary = salary *1.10
–
Logically, the whole operation is a single step–
Some
processors enabled for arithmetic operation, others disabled
•
Other examples:–
Finite differences, linear algebra, ... –
Document searching, graphics, image processing, ...•
Some recent machines:–
Thinking Machines CM-1, CM-2 (and CM-5)–
Maspar
MP-1 and MP-2,
Page 34
Parallel Computer Architectures
•
Flynn’s Classification•
Legacy Parallel Computers
•
Current Parallel Computers•
Trends in Supercomputers
•
Converged Parallel Computer Architecture
Page 35
Parallel Architectures Tied to Programming Models
Application Software
Systemsoftware SIMD
Message PassingShared Memory
Dataflow
SystolicArrays Architecture
Uncertainty of direction paralyzed parallel software development!
Divergent architectures, no predictable pattern of growth
Page 36
Flynn’s Classification•
Based on notions of instruction
and data streams (1972)–
SISD (Single Instruction stream over a Single Data stream )–
SIMD (Single Instruction stream over Multiple Data streams )
–
MISD (Multiple Instruction streams over a Single Data stream)
–
MIMD (Multiple Instruction streams over Multiple Data stream)
•
Popularity–
MIMD > SIMD > MISD
Page 37
SISD (Single Instruction Stream Over A Single Data Stream )
CU PU MU
IS
IS DSI/O
IS : Instruction Stream DS : Data StreamCU : Control Unit PU : Processing UnitMU : Memory Unit
•
SISD –
Conventional sequential machines
Page 38
SIMD (Single Instruction Stream Over Multiple Data Streams )
•
SIMD–
Vector computers
–
Special purpose computations
CU
PE1 LM1
PEn LMn
DS
DS
DS
DS
IS IS
Program loaded from host
Data sets loaded from host
SIMD architecture with distributed memory
PE : Processing Element LM : Local Memory
Page 39
MISD (Multiple Instruction Streams Over A Single Data Streams)
•
MISD –
Processor arrays, systolic arrays
–
Special purpose computations
Memory(Program,
Data) PU1 PU2 PUn
CU1 CU2 CUn
DS DS DSIS IS
IS IS
IS
DSI/O
MISD architecture (the systolic array)
Page 40
MIMD (Multiple Instruction Streams Over Multiple Data Stream)
•
MIMD–
General purpose parallel computers
CU1 PU1
SharedMemory
ISIS DS
I/O
CUn PUnIS DSI/O
ISMIMD architecture with shared memory
Page 41
Dataflow Architectures•
Represent computation as a graph of essential dependences–
Logical processor
at each node, activated by availability of operands–
Message (tokens) carrying tag
of next instruction sent to next processor–
Tag compared with others in matching store; match fires execution1 b
a
+ − ×
×
×
c e
d
f
Dataflow graph
f = a × d
Network
Tokenstore
WaitingMatching
Instructionfetch Execute
Token
queue
Formtoken
Network
Network
Programstore
a = (b +1) × (b − c)d = c × e
Page 42
Convergence: General Parallel Architecture
•
Node:
processor(s), memory system, plus communication assist–
Network interface
and communication controller
•
Scalable network•
Convergence allows lots of innovation, now within framework–
Integration of assist with node, what operations, how efficiently...
Mem
° ° °
Network
P
$
Communicationassist (CA)
•
A generic modern multiprocessor
Page 43
Data Flow vs. Control Flow
•
Control-flow computer–
Program control is explicitly controlled by instruction flow in program
–
Basic components•
PC (Program Counter)•
Shared memory•
Control sequencer
Page 44
Data Flow vs. Control Flow (Cont’d)
•
Advantages–
Full control
–
Complex data and control structures are easily implemented
•
Disadvantages–
Less efficient
–
Difficult in programming–
Difficult in preventing runtime error
Page 45
Data Flow vs. Control Flow (Cont’d)
•
Data-flow computer–
Execution of instructions is driven by data availability
–
Basic components•
Data are directly held inside instructions•
Data availability check unit•
Token matching unit•
Chain reaction of asynchronous instruction executions
Page 46
Data Flow vs. Control Flow (Cont’d)
•
Advantages–
Very high potential for parallelism
–
High throughput –
Free from side-effect
•
Disadvantages–
Time lost waiting for unneeded arguments
–
High control overhead–
Difficult in manipulating data structures
Page 47
Dataflow Machine (Manchester Dataflow Computer)
From host
To host
I/OSwitch
MatchingUnit
OverflowUnit
TokenQueue
InstructionStore
func1 funck
Network
First actual hardware implementation
Token<data, tag, dest, marker>
Match<tag, dest>
Page 48
Execution on Control Flow Machines
a1 b1 c1 a2 b2 c2 a4 b4 c4
Sequential execution on a uniprocessor in 24 cycles
Assume all the external inputs are available before entering do loop+ : 1 cycle, * : 2 cycles, / : 3 cycles,
Page 49
Execution On A Data Flow Machine
c1 c2 c3 c4a1
a2
a3
a4
b1
b2
b4
b3
Data-driven execution on a 4-processor dataflow computer in 9 cycles
s1 t1a1
a2
a3
a4
b1
b2
b3
b4
Parallel execution on a shared-memory 4-processor system in 7 cycles
s1 = b1+b2, t1 = s1+b3
s2 t2 s2 = b3+b4, t2 = s1+s2
Page 50
Problems
•
Excessive copying of large data structures in dataflow operations–
I-structure : a tagged memory unit for overlapped usage by the producer and consumer
•
Retreat from pure dataflow approach (shared memory)•
Handling complex data structures•
Chain reaction control is difficult to implement–
Complexity of matching store and memory units
•
Expose too much parallelism (?)
Page 51
Convergence of Dataflow Machines
•
Converged to use conventional processors and memory–
Support for large, dynamic set of threads to map to processors
•
Operations have locality across them, useful to group together
–
Typically shared address space as well–
But separation of programming model from hardware (like data-parallel)
Page 52
Contributions
•
Integration of communication with thread (handler) generation
•
Tightly integrated communication and fine-grained synchronization–
Each instruction represents a synchronization operation.
–
Absorb the communication latency and minimize the losses due to synchronization waits.
Page 53
Systolic Architectures
–
Replace single processor with array of regular processing elements–
Orchestrate data flow
for high throughput with less memory access
M
PE
M
PE PE PE
•
Different from pipelining:–
Nonlinear array structure, multidirection data flow, each PE may
have (small) local instruction and data memory
•
Different from SIMD:
each PE may do something different•
Initial motivation:
VLSI enables inexpensive special-purpose chips•
Represent algorithms directly by chips connected in regular pattern
Page 54
Systolic Arrays (Cont)
•
Example: Systolic array for 1-D convolution
–
Practical realizations (e.g. iWARP) use quite general processors•
Enable variety of algorithms on same hardware–
But dedicated interconnect channels•
Data transfer directly from register to register across channel–
Specialized, and same problems as SIMD•
General purpose systems work well for same algorithms (locality etc.)
x(i+1) x(i) x(i-1) x(i-k)
y(i) y(i+1)
y(i) = w(j)*x(i-j)
j=1
k
y(i+k+1) y(i+k)W (1) W (2) W (k)
Page 55
Systolic Architectures•
Orchestrate data flow for high throughput with less memory access
•
Different from pipelining–
Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory
•
Different from SIMD–
Each PE may do something different•
Initial motivation–
VLSI enables inexpensive special-purpose chips–
Represent algorithms directly by chips connected in regular pattern
Page 56
Systolic Architectures
M
PE PE PE
M
PE
Conventional Systolic arrays
Replace a processing element(PE) with an array of PE’s without increasing I/O bandwidth
Page 57
Two Communication Styles
CPU CPU CPU
LocalMemory
LocalMemory
LocalMemory
Systolic communication
Memory communication
CPU
LocalMemory
CPU
LocalMemory
CPU
LocalMemory
Page 58
Characteristics
•
Practical realizations (e.g. Intel iWARP) use quite general processors–
Enable variety of algorithms on same hardware
•
But dedicated interconnect channels–
Data transfer directly from register to register across channel
•
Specialized, and same problems as SIMD–
General purpose systems work well for same algorithms (locality etc.)
Page 59
Matrix Multiplication
nixay j
n
jiji ,...,1,
1== ∑
=
for i = 1 to ny(i,0) = 0for j = 0 to n
y(i,0) = y(i,0) + a(i,j) * x(j,0)
Recursive algorithmx
w
xin xout
yin yout
xout = xx = xinyout = yin + w * xin
Use the following PE
Page 60
Systolic Array Representation of Matrix Multiplication
x4
x3
x2
x1
y1 y2 y3 y4
0 0 0 0
Page 61
Example of Convolution
y(i) = w1 * x(i) + w2 * x(i+1) + w3 * x(i+2) + w4 * x(i+3)y(i) is initialized as 0.
x7 x5 x3
w4
x1
w3 w2 w1
x8 x6 x4 x2
y3 y2 y1
x
w
xin xout
yin yout
xout = xx = xinyout = yin + w * xin
Page 62
Data Parallel Systems
•
Programming model –
Operations performed in parallel on each element of data structure
–
Logically single thread of control, performs sequential or parallel steps
–
Conceptually, a processor associated with each data element
Page 63
Data Parallel Systems (Cont’d)
•
SIMD Architectural model–
Array of many simple, cheap processors with little memory each
•
Processors don’t sequence through instructions–
Attached to a control processor that issues instructions
–
Specialized and general communication, cheap global synchronization
Page 64
Evolution & Convergence
•
Popular due to cost savings of centralized sequencer –
Replace by vector in mid-70s.
–
Revived in mid-80s when 32-bit datapath sliced fit on chip
–
Old machines•
Thinking machines CM-1, CM-2•
Maspar
MP-1 and MP-2
Page 65
Evolution & Convergence (Cont’d)
•
Drawbacks–
Low applicability
–
Simple, regular applications can do well anyway in MIMD
•
Convergence: SPMD (Single Program Multiple Data)–
Fast global synchronization is needed
Page 66
Vector Processors
•
Merits of vector processor–
Very deep pipeline without data hazard
•
The computation of each result is independent of the computation of previous results
–
Instruction bandwidth requirement is reduced•
A vector instruction specifies a great deal of work–
Control hazards are nonexistent
•
A vector instruction represents an entire loop.•
No loop branch
Page 67
Vector Processors (Cont’d)
–
The high latency of initiating a main memory access is amortized
•
A single access is initiated for the entire vector rather than a single word
–
Known access pattern–
Interleaved memory banks
•
Vector operations is faster than a sequence of scalar operations on the same number of data items!
Page 68
Vector Programming Example
LD F0, aADDI R4, Rx, #512 ; last address to load
Loop: LD F2, 0(Rx) ; load X(i)MULTD F2, F0, F2 ; a x X(i)LD F4, 0(Ry) ; load Y(i)ADDD F4, F2, F4 ; a x X(i) + Y(i)SD F4, 0(Ry) ; store into Y(i)ADDI Rx, Rx, #8 ; increment index to XADDI Ry, Ry, #8 ; increment index to YSUB R20, R4, Rx ; compute boundBNZ R20, loop ; check if done
RISC machine
8 * 64
Repeat 64 times
Y = a * X + Y
Page 69
Vector Programming Example (Cont’d)
LD F0, a ; load scalar LV V1, Rx ; load vector XMULTSV V2, F0, V1 ; vector-scalar multiplyLV V3, Ry ; load vector YADDV V4, V2, V3 ; addSV Ry, V4 ; store the result
Vector machine
6 instructions(low instructionbandwidth)
Y = a * X + Y
Page 70
Basic Vector Architecture
•
Vector-register processor–
All vector operations except load and store are among the vector registers
–
The major vector computers•
Memory-memory vector processor–
All vector operations are memory to memory
–
The first vector computer
Page 71
A Vector-Register Architecture (DLXV)
Main Memory
VectorLoad-store
FP add/subtractFP add/subtract
FP add/subtractFP add/subtract
FP add/subtractFP add/subtract
FP add/subtractFP add/subtract
FP add/subtractFP add/subtract
Vectorregisters
Scalarregisters
Crossbar Crossbar
Page 72
Vector Machines
CRAY-1
CRAY-2
CRAY X-MP
CRAY C-90
NEC SX/2
NEC SX/4
Fujitsu VP200
Hitachi S820
Convex C-1
8
8
8
8
8 + 8192
8 + 8192
8 - 256
32
8
Registers
64
64
64
128
256
256
32-1024
256
128
Elementsper register
1
1
2Ld/1St
4
8
8
2
4
1
LoadStore
6
5
8
8
16
16
3
4
4
Functionalunits
CRAY Y-MP 8 64 2Ld/1St 8
Page 73
Convoy
•
Convoy–
The set of vector instructions that could begin execution in one clock period
–
The instructions in a convoy must not contain any structural or data hazards
•
Chime–
An approximate measure of execution time for a vector sequence
Page 74
Convoy Example
LV V1, Rx ; load vector XMULTSV V2, F0, V1 ; vector-scalar multiplyLV V3, Ry ; load vector YADDV V4, V2, V3 ; addSV Ry, V4 ; store the result
1. LV2. MULTSV LV3. ADDV4. SV
Convoy
Chime = 4
Page 75
Strip Mining
Strip miningWhen a vector has a length greater than that of the vector
registers, segment the long vector into fixed-length segments (size = MVL, maximum vector length).
Tn = n/MVL * (Tloop + Tstart ) + n * Tchime
Ex) T200 = 200/64 * (15+49) + 200*4
Total execution time
Page 76
Strip Mining Example
do 10 i = 1, n10 Y(i) = a * X(i) + Y(i)
low = 1VL = (n mod MVL) ; find the odd size piecedo 1 j = 0, (n/MVL) ; outer loop
do 10 i = low, low+VL-1 ; runs for length VLY(i) = a*X(i) + Y(i) ; main operation
10 continuelow = low + VL ; start of next vectorVL = MVL ; reset the length to max
1 continue
Page 77
Performance Enhancement Techniques
•
Chaining•
Conditional execution
•
Sparse matrix
Page 78
Chaining
•
Chaining–
Allow a vector operation to start as soon as the individual elements of its vector source operand become available
–
Permit the operations to be schedule in the same convoy and reduces the number of chimes required.
Page 79
Chaining Example
Unit
Load and store unit
Multiply unit
Add unit
Start-up overhead
12 cycles
7 cycles
6 cycles
MULTVV1, V2, V3ADDV V4, V1, V5
Vector sequence
7 664 64Unchained
7
6
64
64
Chained
MULTV ADDV
ADDV
MULTV
Total = 141
Total = 77
Page 80
Conditional Execution
•
Vector mask register–
Any vector instructions executed operate only on the vector elements whose corresponding entries in the vector-mask registers are 1.
–
Require execution time even when the condition is not satisfied.
•
The elimination of a branch and the associated control dependences can make a conditional instruction faster.
Page 81
Conditional Execution Example
do 10 i = 1, 64if (A(i) .ne. 0) then
A(i) = A(i) - B(i)endif
10 continue
LV V1, Ra ; load vector A into V1LV V2, Rb ; load vector BLD F0, #0 ; load FP zero into F0SNESV F0, V1 ; sets the VM to 1 if V1(i) != F0SUBV V1, V1, V2 ; subtract under the VMCVM ; set the VM to all 1sSV Ra, V1 ; store the result in A
Page 82
Sparse Matrix
•
Sparse Matrix–
There are small number of non-zero elements
•
Scatter-gather operations using index vectors–
Moving between a dense representation and a normal representation of a sparse matrix
–
Gather•
Make a dense representation–
Scatter
•
Return to the sparse representation
Page 83
Sparse Matrix Example
do 10 i = 1,n10 A(K(i)) = A(K(i))+ C(M(i))
LV Vk, Rk ; load KLVI Va, (Ra+Vk) ; load A(K(i)) - gatherLV Vm, Rm ; load MLVI Vc, (Rc+Vm) ; load C(M(i)) - gatherADDV Va, Va, Vc ; add themSVI (Ra + Vk), Va ; store A(K(i)) - scatter
LVI : Load Vector Indexed
SVI : Store Vector Indexed
Page 84
Current Parallel Computer Architectures
MIMD
Multiprocessor
Multicomputer
PVP
SMP
DSM
MPP
Constellation
Cluster
(Shared Address Space)
(Message Passing)
Page 85
Programming Models•
What does programmer use in coding applications?•
Specifies communication and synchronization•
Classification–
Uniprocessor
model: Von Neumann model–
Multiprogramming: no comm. and synch. at program level•
(ex) CAD–
Shared address space–
Symmetric multiprocessor model–
CC-NUMA model–
Message passing–
Data parallel
Page 86
Communication Architecture
•
User/System interface–
Communication primitives exposed to used-level realizes the programming model
•
Implementation–
How to implement primitives: HW or OS
–
How optimized are they?–
Network structure
Page 87
Communication Architecture (Cont’d)
•
Goals–
Performance and cost
–
Programmability–
Scalability
–
Broad applicability
Page 88
Shared Address Space Architecture•
Any processor can directly reference any memory location–
Communication occurs implicitly by “loads and stores”.•
Natural extension of uniprocessor
model–
Location transparency–
Good throughput on multiprogrammed
workloads•
OS used shared memory to coordinate processes•
Shared memory multiprocessors–
SMP: every processor has equal access to the shared memory, the I/O devices, and the OS system serviced. UMA architecture
–
NUMA: distributed shared memory
Page 89
Shared Address Space Model
Virtual address spaces for a collection of processes communicating via shared addresses
Machine physical address space
Shared portion of address space
Private portion of address space
Pn private
Common physical addresses
P2 private
P1 private
P0 private
Store
Load
Page 90
SAS History
•
“Mainframe”
approach–
Motivated by multiprogramming
–
Extends crossbar used for memory bandwidth and I/O
–
Bandwidth scales with p–
High incremental cost; use multistage instead
Page 91
Crossbar Switch
Mem
Mem
Mem
Mem
Cache
P
I/OCache
P
I/O
Page 92
Minocomputer
•
“Minicomputer”
approach–
Almost all microprocessor systems have bus
–
Motivated by multiprogramming, TP–
Called symmetric multiprocessor (SMP)
–
Latency larger than for uniprocessor–
Bus is bandwidth bottleneck
–
caching is key: coherence problem–
Low incremental cost
Page 93
Bus Connection
Mem Mem Mem Mem
Cache
P
I/OCache
P
I/O
Page 94
Scaling Up•
Problem is interconnect: cost (crossbar) or bandwidth (bus)•
Dance-hall: bandwidth still scalable, but lower cost than crossbar–
Latencies to memory uniform, but uniformly large•
Distributed memory or non-uniform memory access (NUMA)–
Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-
request, read-response)•
Caching shared (particularly nonlocal) data?
Page 95
Organizations
Network
Mem Mem
Cache
P
Cache
P
Cache
P
Mem
Dancing Hall
Network
Cache
P
Mem Cache
P
Mem
Cache
P
Mem Cache
P
Mem
Distributed Memory
Page 96
Multiprocessors (Shared Address Space Architecture)
•
PVP (Parallel Vector Processor)–
A small number of proprietary vector processors connected by a high-bandwidth crossbar switch
•
SMP (Symmetric Multiprocessor)–
A small number of COTS microprocessors connected by a high-speed bus or crossbar switch
•
DSM (Distributed Shared Memory)–
Similar to SMP
–
The memory is physically distributed among nodes.
Page 97
PVP (Parallel Vector Processor)
VP VP
Crossbar Switch
VP
SM SM SM
VP : Vector Processor SM : Shared Memory
Page 98
SMP (Symmetric Multi-Processor)
P/C P/C
Bus or Crossbar Switch
P/C
SM SM SM
P/C : Microprocessor and Cache
Page 99
DSM (Distributed Shared Memory)
Custom-Designed Network
MB MBP/C
LM
NIC
DIR
P/C
LM
NIC
DIR
DIR : Cache Directory
Page 100
MPP (Massively Parallel Processing)
P/C
LM
NIC
MB
P/C
LM
NIC
MB
Custom-Designed Network
MB : Memory Bus NIC : Network Interface Circuitry
Page 101
Cluster
Commodity Network (Ethernet, ATM, Myrinet, VIA)
MB MBP/C
M
NIC
P/C
M
Bridge Bridge
LD LD
NIC
IOB IOB
LD : Local Disk IOB : I/O Bus
Page 102
Constellation
P/CP/C
SM SMNICLD
Hub
Custom or Commodity Network
>= 16
IOC
P/CP/C
SM SMNICLD
Hub
>= 16
IOC
IOC : I/O Controller
Page 103
Trend in Parallel Computer Architectures
0
50
100
150
200
250
300
350
400
1997 1998 1999 2000 2001 2002
Years
Num
ber o
f HPC
s
MPPs Constellations Clusters SMPs
Page 104
Food-Chain of High-Performance Computers
Page 105
Converged Architecture of Current Supercomputers
Interconnection NetworkInterconnection Network
Memory
P P P P
Memory
P P P P
Memory
P P P P
Memory
P P P P
Multiprocessors Multiprocessors Multiprocessors Multiprocessors