9/18/2006 ELEG652-06F 1 Topic 2 Vector Processing & Vector Architectures Lasciate Ogne Speranza, Voi Ch’Intrate Dante’s Inferno
Feb 05, 2016
9/18/2006 ELEG652-06F 1
Topic 2
Vector Processing & Vector Architectures
Lasciate Ogne Speranza, Voi Ch’IntrateDante’s Inferno
9/18/2006 ELEG652-06F 2
Reading List
• Slides: Topic2x
• Henn&Patt: Appendix G
• Other assigned readings from homework
and classes
9/18/2006 ELEG652-06F 3
Vector Architectures
Register-Register Archs Memory-Memory Archs
Vector Arch Components:
Vector Register Banks Capable of holding a n number of vector elements. Two extra registers
Vector Functional Units Fully pipelined, hazard detection (structural and data)
Vector Load-Store Unit
A Scalar Unit A set of registers, FUs and CUs
Types:
9/18/2006 ELEG652-06F 4
An Intro to DLXV
• A simplified vector architecture
• Consist of one lane per functional unit– Lane: The number of vector instructions that
can be executed in parallel by a functional unit
• Loosely based on Cray 1 arch and ISA
• Extension of DLX ISA for vector arch
9/18/2006 ELEG652-06F 5
DLXV Configuration
• Vector Registers – Eight Vector regs / 64 element each. – Two read ports and one write port per register – Sixteen read ports and eight write ports in total
• Vector Functional Unit– Five Functional Units
• Vector Load and Store Unit– A bandwidth of 1 word per cycle– Double as a scalar load / store unit
• A set of scalar registers– 32 general and 32 FP regs
9/18/2006 ELEG652-06F 6
A Vector / Register Arch
Main Memory
Vector Load& Store
FP Add
FP Multiply
FP Divide
Logical
Integer
Scalar Register FileVector Register File
9/18/2006 ELEG652-06F 7
Advantages
• A single vector instruction A lot of work• No data hazards
– No need to check for data hazards inside vector instructions– Parallelism inside the vector operation
• Deep pipeline or array of processing elements
• Known Access Pattern– Latency only paid once per vector (pipelined loading)– Memory address can be mapped to memory modules to reduce
contentions
• Reduction in code size and simplification of hazards– Loop related control hazards from loop are eliminated.
9/18/2006 ELEG652-06F 8
DAXPY: DLX Code
Y = a * X + Y
LD F0, aADDI R4, Rx, #512 ; last address to load
Loop: LD F2, 0(Rx) ; load X(i)MULTD F2, F0, F2 ; a x X(i)LD F4, 0 (Ry) ; load Y(i)ADDD F4, F2, F4 ; a x X(i) + Y(i)SD F4, 0 (Ry) ; store into Y(i)ADDI Rx, Rx, #8 ; increment index to XADDI Ry, Ry, #8 ; increment index to YSUB R20, R4, Rx ; compute boundBNZ R20, loop ; check if done
The bold instructions are part of the loop index calculation and branching
9/18/2006 ELEG652-06F 9
DAXPY: DLXV Code
Y = a * X + Y
LD F0, a ; load scalar aLV V1, Rx ; load vector XMULTSV V2, F0, V1 ; vector-scalar
multiplyLV V3, Ry ; load vector YADDV V4, V2, V3 ; addSV Ry, V4 ; store the result
Instruction Number [Bandwidth] for 64 elements
DLX Code 578 Instructions
DLXV Code 6 Instructions
9/18/2006 ELEG652-06F 10
Dead Time
The time that it takes the pipeline to be ready for the next vector instruction
9/18/2006 ELEG652-06F 11
Issues
• Vector Length Control– Vector lengths are not usually less or even a
multiple of the hardware vector length
• Vector Stride– Access to vectors may not be consequently
• Solutions:– Two special registers
• One for vector length up to a maximum vector length
• One for vector mask
9/18/2006 ELEG652-06F 12
Vector Length Control
Question: Assume the maximum hardware vector length is MVL which may be less than n. How should we do the above computation ?
for(i = 0; i < n; ++i) y[i] = a * x[i] +y[i]
An Example:
9/18/2006 ELEG652-06F 13
Vector Length ControlStrip Mining
Low = 0VL = n % MVLfor(j = 0; j < (int)(n / (float)(MVL) + 0.5); ++j){ for(i = Low; i < Low + VL – 1; ++i) y[i] = a * x[i] + y[i]; Low += VL; VL = MVL;}
Strip Mined Code
for(i = 0; i < n; ++i) y[i] = a * x[i] +y[i]
Original Code
9/18/2006 ELEG652-06F 14
Vector Length ControlStrip Mining
0 1 2 3 n/MVL
0M-1
MM-MVL-1
M-MVLM-2*MVL-1
M-2*MVLM-3*MVL-1
n-MVL-1n-1
Value of j
Value of i
For a vector of arbitrary length M =n % MVL
The vector length control register takes values similar to the Vector Length variable (VL) in the C code
9/18/2006 ELEG652-06F 15
Vector Stride
for(i = 0; i < n; ++i) for(j = 0; j < n; ++j){ c[i][j] = 0.0; for(k = 0; k < n; ++k) c[i][j] += a[i][k] * b[k][j]; }
Matrix Multiply Code
How to vectorize this code?
How stride works here?
Consider that in C the arrays are saved in memory row-wise. Therefore, a and c are loaded correctly. How about b?
9/18/2006 ELEG652-06F 16
Vector Stride
• Stride for a and c 1– Also called unit stride
• Stride for b n elements
• Use special instructions– Load and store with stride (LVWS, SVWS)
• Memory banks Stride can complicate access patterns == Contention
9/18/2006 ELEG652-06F 17
Vector Stride &Memory Systems
Example
Memory System
Eight Memory BanksBusy Time: 6 cyclesMemory Latency: 12 cycles
Vector
64 elementsStride of 1 and 32
Time to Complete a Load
12 + 64 76 Cycles for Stride 1 or 1.2 cycle per element
12 + 1 + 6 * 63 391 Cycles for Stride 32 or 6.2 cycle per element
9/18/2006 ELEG652-06F 18
Cray-1“The World Most Expensive
Love Seat...”
Picture courtesy of CrayOriginal Source: Cray 1 Computer System Hardware Reference Manual
9/18/2006 ELEG652-06F 19
Cray 1 Data Sheet
• Designed by: Seymour Cray• Price: 5 to 8.8 Millions dollars• Units Shipped: 85• Technology: SIMD, deep pipelined functional
units• Performance: up to 160 MFLOPS• Date Released: 1976• Best Know for: The computer that made the term
supercomputer mainstream
9/18/2006 ELEG652-06F 20
Architectural Components
RegistersFunctional UnitsInstruction Buffers
Computation Section
Memory Section
From 0.25 to 1 Million 64-bit
words
I/O Section
12 Input channels12 Output channels
MCU
Mass StorageSubsystem
Front End Comp.I/O Stations
Peripheral Equipment
9/18/2006 ELEG652-06F 21The
Cra
y-1
Arc
hite
ctur
e
Computation Section
Vector Components
Scalar Components
Address & Instruction Calculation Components
9/18/2006 ELEG652-06F 22
Architecture Features
• 64-bit word• 12.5 nanosecond clock
period• 2’s Complement
arithmetic• Scalar and Vector
Processing modes• Twelve functional units• Eight 24-bit address
registers (A)• Sixty Four 24-bit
intermediate address (B) registers
• Eight 64-bit scalar (S) registers
• Sixty four 64-bit intermediate scalar (T) registers
• Eight 64-element vector (V) registers, 64-bit per element
• Four Instructions buffers of 64 16-bit parcels each
• Integer and floating point arithmetic
• 128 Instruction codes
9/18/2006 ELEG652-06F 23
Architecture Features
• Up to 1,048,576 words memory– 64 data bits and 8 error
correction bits
• Eight or sixteen banks of 65,536 words each
• Busy Bank time: 4• Transfer rate:
– B, T, V registers One word per cycle
– A, S registers One word per two cycles
– Instruction Buffers Four words per clock cycle
• SEC - DED
• Twelve input and twelve output channels
• Loss data detection• Channel group
– Contains either six input or six output channels
– Served equally by memory (scanned every 4 cycles)
– Priority resolved within the groups
– Sixteen data bits, 3 controls bits per channel and 4 parity bits
Original Source: Cray 1 Computer System Hardware Reference Manual
9/18/2006 ELEG652-06F 24
Register-Register Architecture
• All ALU operands are in registers• Registers are specialized by function (A,
B, T, etc) thus avoiding conflict• Transfer between Memory and registers is
treated differently than ALU• RISC based idea• Effective use of the Cray-1 requires careful
planning to exploit its register resources– 4 Kbytes of high speed registers
9/18/2006 ELEG652-06F 25
Registers
Primary Registers
Intermediate Registers
Directly addressable by the Functional UnitsNamed V (vector), S (scalar) and A (address)
Used as buffers for the functional unitsNamed T (Scalar transport) and B (Address buffering)
M IR PR FU
9/18/2006 ELEG652-06F 26
Registers
• Memory Access Time: 11 cycles• Register Access Time: 1 ~ 2 cycles• Primary Registers:
– Address Regs: 8 x 24 bits– Scalar Regs: 8 x 64 bits– Vector Regs: 8 x 64 words
• Intermediate Registers:– B Regs: 64 x 24 bits– T Regs: 64 x 64 bits
• Special Registers:– Vector Length Register: 0 <= VL <= 64– Vector Masks Register: 64 bits
• Total Size: 4,888 bytes
9/18/2006 ELEG652-06F 27
Instruction Format
A parcel 16-bit
Instruction word 16 (one parcel) or 32 (two parcels) according to type
4 3 3 3 3
Op CodeResult Reg
Operand reg
Operand reg
A One Parcel Instruction: Arithmetic Logical Instruction Word
4 3 3 22
A Two Parcels Instruction: Memory Instruction Word
Op code
Addr Index Reg
Result Reg
Address
9/18/2006 ELEG652-06F 28
Functional Unit PipelinesFunctional pipelines Register Pipeline delays
usage (clock periods)
Address functional units Address add unit A 2 Address multiply unit A 6
Scalar functional units Scalar add unit S 3 Scalar shift unit S 2 or 3 Scalar logical unit S 1 Population/leading zero count unit S 3
Vector functional units Vector add unit V or S 3 Vector shift unit V or S 4 Vector logical unit V or S 2
Floating-point functional units Floating-point add unit S and V 6 Floating-point multiply unit S and V 7 Reciprocal approximation unit S and V 14
9/18/2006 ELEG652-06F 29
Vector Units
• Vector Addition / Subtraction– Functional Unit Delay: 3
• Vector Logical Unit– Boolean Ops between 64 bit elements of the vectors
• AND, XOR, OR, MERGE, MASK GENERATION
– Functional Unit Delay: 2
• Vector Shift– Shift values of a 64 bit (or 128 bits) element of a
vector– Functional Unit Delay: 4
9/18/2006 ELEG652-06F 30
Instruction Set
• 128 Instructions
• Ten Vector Types
• Thirteen Scalar Types
• Three Addressing Modes
9/18/2006 ELEG652-06F 31
Implementation Philosophy
• Instruction Processing– Instruction Buffering: Four Instructions buffers of 64
16-bit parcels each
• Memory Hierarchy– Memory Banks, T and B register banks
• Register and Function Unit Reservation– Example: Vector ops, register operands, register
result and FU are checked as reserved
• Vector Processing
9/18/2006 ELEG652-06F 32
Instruction Processing
“Issue one instruction per cycle”
• 4 x 64 word• 16 - 32 bit instructions• Instruction parcel pre-fetch• Branch in buffer• 4 inst/cycle fetched to LRU I-buffer
Special Resources
P, NIP, CIP and LIP
9/18/2006 ELEG652-06F 33
Reservations
• Vector operands, results and functional unit are marked reserved
• The vector result reservation is lifted when the chain slot time has passed– Chain Slot: Functional Unit delay plus two
clock cyclesExamples:
V1 = V2 * V3V4 = V5 + V6
V1 = V2 * V3V4 = V5 + V2
V1 = V2 * V3V4 = V5 * V6
Second Instruction cannot begin until First is finished
DittoIndependent
9/18/2006 ELEG652-06F 34
(a) Type 1 vector instruction (b) Type 2 vector instruction
Vj
~ ~
~ ~
Vk
~ ~
Vi
~ ~
Vk
~ ~
Vi
123
n
123
n
. . .. . .
Sj
V. Instructions in the Cray-1
9/18/2006 ELEG652-06F 35
~ ~
~ ~
Vi
Mem
ory
1234567
123456
Mem
ory
(c) Type 3 vector instruction (d) Type 4 vector instruction
Vj
V. Instructions in the Cray-1
9/18/2006 ELEG652-06F 36
Vector Loops
• Long vectors with N > 64 are sectioned
• Each time through a vector loop 64 elements are processed
• Remainder handling
• “transparent” to the programmer
9/18/2006 ELEG652-06F 37
V. Chaining
• Internal forwarding techniques of 360/91• A “linking process” that occurs when results
obtained from one pipeline unit are directly fed into the operand registers of another function pipe.
• Chaining allow operations to be issued as soon as the first result becomes available
• Registers/F-units must be properly reserved.• Limited by the number of Vector Registers and
Functional Units• From 2 to 5
9/18/2006 ELEG652-06F 38
Memory
...
......
......
...
1234567
1
2
3
4
1
2
3
1
2
a
Memoryfetchpipe
VO
V1
V2
V3V4
V5
Rightshiftpipe
Vectoraddpipe
LogicalProductPipe
d
c
d
g
i
jj
l
f
Cha
inin
g E
xam
ple
Mem V0 (M-fetch)V0 + V1 V2 (V-add)V2 < A3 V3 (left shift)V3 ^ V4 V5 (logical product)
Fetching Adding
FetchingAdding
Chain Slot
9/18/2006 ELEG652-06F 39
MultiplyPipe
V1
Add Pipe
..
Access Pipe
Memory
..
Access Pipe
Memory
Access Pipe
Read/write port
V3
Memory
V4
V4
Read/write port
Vector register
V1
(Load Y)
(Load X)
V2
(S*)
S
(Store Y)
(Vadd)
Multiply Pipe
..
..
Add Pipe
Memory
Access Pipe(Load Y)
Read port 2
V3
Memory
V4
write port
Vector register
(S*)
S
(VAdd)
Read port 1
..
V1
Access Pipe(Load X)
V2
Scalar register
Access Pipe(Store Y)
Limited chaining using only onememory-access pipe in the Gray 1
Complete chaining using threememory-access pipes in the Cray X-MP
..
..
..
..
..
..
....
.. .. ..
.. ..
..
..
..
..
Mul
tipip
elin
e ch
aini
ng
SA
XP
Y c
ode
Y(1:N) = A x X(1:N) + Y(1:N)
9/18/2006 ELEG652-06F 40
Cray 1 Performance
• 3 to 160 MFLOPS– Application and Programming Skills
• Scalar Performance: 12 MFLOPS
• Vector Dot Product: 22 MFLOPS
• Peak Performance: 153 MFLOPS
9/18/2006 ELEG652-06F 41
Cray X-MP Data Sheet
• Designed by: Steve Chen• Price: 15 Millions dollars• Units Shipped: N/A• Technology: SIMD, deep pipelined functional
units• Performance: up to 200 MFLOPS (for single
CPU)• Date Released: 1982• Best Know for: The Successor of Cray 1 and the
first parallel vector computer from Cray Inc.
An NSA CRAY X-MP/24, on exhibit at the National Crypto logic Museum.
9/18/2006 ELEG652-06F 42
Cray X-MP
• Multiprocessor/multiprocessing
• 8 times of Cray-1 memory bandwidth
• Clock 9.5ns - 8.5 ns
• Scalar speedup*: 1.25 ~ 2.5
• Throughput*: 2.5 ~ 5 times
*Speedup with respect to Cray 1
9/18/2006 ELEG652-06F 43
Irregular Vector Ops
• Scatter: Use a vector to scatter another vector elements across Memory– X[A[i]] = B[i]
• Gather: The reverse operation of scatter– X[i] = B[C[i]]
• Compress– Using a Vector Mask, compress a vector
• No Single instruction to do these before 1984– Poor Performance: 2.5 MFLOPS
9/18/2006 ELEG652-06F 44
Gather Operation
V1[i] = A[V0[i]]
VL
A0
4
100
4270
V0 V1
200 100300 140400 180500 1C0600 200700 240100 280250 2C0350 300
Memory Contents / Addresses
4270
V0
600400250200
V1
200 100300 140400 180500 1C0600 200700 240100 280250 2C0350 300
Memory Contents / Addresses
01
2
3
45
6
7
8
Example:V1[2] = A[V0[2]] = A[7] = 250
9/18/2006 ELEG652-06F 45
Scatter Operation
A[V0[i]] = V1[i]
VL
A0
4
100
4270
V0
200300400500
V1
x 100x 140x 180x 1C0x 200x 240x 280x 2C0x 300
Memory Contents / Addresses
4270
V0
200300400500
V1
500 100x 140
300 180x 1C0
200 200x 240x 280
400 2C0x 300
Memory Contents / Addresses
01
2
3
45
6
7
8
Example:A[V0[0]] = V1[0]A[4] = 200*(0x200)= 200
9/18/2006 ELEG652-06F 46
Vector Compression Operation
VL
VM
14
010110011101 …
0-105
-150024-7130
-1700
012
345
6789
10
111213
V0 V1
V1 = Compress(V0, VM, Z)0-105
01030407
-150024
080911
-7130
-1700
012
345
6789
V0 V1
9/18/2006 ELEG652-06F 47
Characteristics of Several Vector Architectures
9/18/2006 ELEG652-06F 48
The VMIPS Vector Instructions
A MIPS ISA extended to support Vector Instructions. The same as DLXV
9/18/2006 ELEG652-06F 49
Multiple Lanes
9/18/2006 ELEG652-06F 50
Vectorizing Compilers
9/18/2006 ELEG652-06F 51
Performance Analysis of Vector Architectures
Topic 2a
9/18/2006 ELEG652-06F 52
Serial, Parallel and Pipelines
1
2
3
4
x y
z = x + y
Serial
Pipeline Array
Overlap Replicate
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
x1y1
x2y2
z1
z2
z1
x1y1
x2y2
z2
z2z1
x1y1 x2y2
1 result per cycle
1 result per 4 cycles
N result per cycle
9/18/2006 ELEG652-06F 53
Generic Performance Formula (R. Hockney & C. Jesshope 81)
)(2
11 nnrt
MFLOPS performance of the architecture with an infinite length vector
The vector length needed to achieve half of the peak performance
Vector lengthn
n
r
21
1
9/18/2006 ELEG652-06F 54
Serial Architecture
0
*
**
)(
21
1
21
1
n
lr
nlt
nnrt
serial
Generic Formula:
Parameters:
1
2
3
4
1
2
3
4
z1
x2y2
z2
s
l
Number of stages
Time per stage
Start up time
9/18/2006 ELEG652-06F 55
Pipeline Architecture
1
)1(
))1((
)(
21
1
21
1
lsn
r
lsnt
nlst
nnrt
pipeline
pipeline
Generic Formula:
Parameters:
1
2
3
4
1
2
3
4
s
l
Number of stages
Time per stage
Start up time
s
l
The number of elements that will come out of the pipeline after the initial penalty has been paid
Initial Penalty
9/18/2006 ELEG652-06F 56
Thus …
• The Asymptotic Performance Parameter– Memory Limited Peak Performance Factor. A scale
factor applied to the performance of a particular architecture implemented with a particular technology. Equivalent to a serial processor with a 100% cache miss
• The N half Parameter– The amount of parallelism that is presented in a given
architecture.– Determined by a combination of vector unit startup
and vector unit latency
9/18/2006 ELEG652-06F 57
The N Half Range
21nSerial Machine Infinite array of
processor
0 ∞
The relative performance of different algorithms on a computer is determined by the value of N half(matching problem parallelism with architecture parallelism)
9/18/2006 ELEG652-06F 58
Vector Length v.s. Vector Performance
21n
1r
)(2
11 nnrt
Slope:
n
t
9/18/2006 ELEG652-06F 59
Pseudo codeAsymptotic and N half Calculation
overhead = -clock();overhead += clock();for(N = 0; N < NMAX; ++N){ time[N] = -clock(); for(i = 0; i < N; ++i) A[i] = B[i] * C[i]; time[N] += clock(); time[N] = time[N] - overhead;}
9/18/2006 ELEG652-06F 60
Parameters of Several Parallel Architectures
Computer N half R infinity Nv
CRAY-1 10-20 80 1.5-3
BSP 25-150 50 1-8
2-pipe CDC CYBER 205
100 100 11
1-pipe TIASC 30 12 7
CDC STAR 100 150 25 12
(64 x 64) ICL DAP 2048 16 5
9/18/2006 ELEG652-06F 61
Another ExampleThe Chaining Effect
1
21
1
1
1
1
])1[(*1
1
])1[(
)]1([
r
lsn
nlsmt
tmt
nlst
nlst
m
iii
m
m
iiim
m
iiim
Assume that all s’s are the same. The same goes for the l’s
Assume m vector operations unchained
Thus
So
9/18/2006 ELEG652-06F 62
Another ExampleThe Chaining Effect
Assume m vector operations chained
mr
lsmn
nlsmmt
tmt
nlsmt
nlst
m
m
m
iiim
1
21
1
1)(
]1)(*[*1
1
]1)(*[
)]1()([
Thus
So
9/18/2006 ELEG652-06F 63
Explanation
)1(1 lsnr
Vop 1Vop 2
Vop 3s+l
)1(1 lsnr )1(1
lsnr
Vop 1 Vop 2 Vop 3
)1(1 lsnr
Unchained
Chained
Usually this would mean an increase in r∞ and N1/2 by m
9/18/2006 ELEG652-06F 64
Topic 2b
Memory System Design in Vector Machines
9/18/2006 ELEG652-06F 65
Multi Port Memory System
Pipelined AdderProcessor Array
Stream A
Stream C = A + B
Stream B
A vector addition is possible when two vectors (data stream) are processed by vector hardware. This being vector adder or a processor array.
Bandwidth Challenge
9/18/2006 ELEG652-06F 66
Memory in Vector Architectures
• Increase bandwidth– Make “wider” memory
• Wider busses– Create several memory banks
• Interleaved Memory Banks– Several Memory Banks– Shared resources– Broadcast of memory access– Interleaving factor– Number of Banks >= Bank Busy Time
• Independent Memory Banks– The same idea as Interleaved memory– Independent resources
Pipelined Adder
Stream A
Stream B
Stream C = A + B
M
M
M
M
M
M
M
M
9/18/2006 ELEG652-06F 67
However …
• Vector machines prefer independent memory banks.– Multiple loads or stores per clock
• Memory bank cycle time > CPU cycle time
– Striding– Parallel processors
9/18/2006 ELEG652-06F 68
Bandwidth Requirement
• Assume d being the cycles to access a module• Thus BW1 is equal to 1/d words per cycle
– It means that after d cycles you will get a word
• Restrictions: Most [Vector] memory systems requires at least a word per cycle!!!!
• Imagine that a requirement of 3 words per cycle (two reads and one write) for total bandwidth (TBW)
• Thus, Total Number of Modules = TBW / BW1 – For a total of 3d Modules (3/1/d 3d)
9/18/2006 ELEG652-06F 69
Bandwidth Example
Busy Time 15 / 2.167 = 6.9 ~ 7 Cycles
Memory Requests 6 * 32 = 192 In flight requests
Number of Banks TBW / BW1 = 192 * 7 1344
Historical Note: Cray T90 early configurations only supported up to 1024 Memory Banks
The Cray T90
Clock Cycle: 2.167 nsLargest Configuration: 32Maximum Memory requests per cycle: 6 (Four Loads and Two stores)SRAM Time: 15 ns
9/18/2006 ELEG652-06F 70
Bandwidth v.s. Latency
If each element is in a different memory module, does this guarantees maximum bandwidth??
For each module
An initialization rate named r A latency to access a word d
r * d <= 100%r <= 1/d
NO
rm <= m / d BW <= m / d
9/18/2006 ELEG652-06F 71
An Example
• How arrays A, B, and W should be layout in memory such that the memory can sustain one operation per cycle for the following operation:W = A + B;
• If elements from W, A and B should start from module 0, Conflict will result!!!
9/18/2006 ELEG652-06F 72
The Memory Accesses
543210Pipeline Stage 4
6543210Pipeline Stage 3
76543210Pipeline Stage 2
76543210Pipeline Stage 1
RB7RB7RA7RA7Memory 7
RB6RB6RA6RA6Memory 6
RB5RB5RA5RA5Memory 5
RA8RB4RB4RA4RA4Memory 4
RA8RA8RB3RB3RA4RA3Memory 3
RA8RA8RA8RB2RB2RA2RA2Memory 2
RA8RA8RA8RA8RB1RB1RA1RA1Memory 1
W0RA8RA8RA8RA8RB0RB0RA0RA0Memory 0
131211109876543210
Assume: Pipeline adder has 4 stages, memory cycle time = 2
A[0] is ready!!!
B[0] is ready!!!W[0] has been computed but it cannot be written back!!
W[0] is written back
9/18/2006 ELEG652-06F 73
Another Memory Layout
A[0] B[6] W[4]Modulo 0
A[1] B[7] W[5]Modulo 1
A[2] B[0] W[6]Modulo 2
A[3] B[1] W[7]Modulo 3
A[4] B[2] W[0]Modulo 4
A[5] B[3] W[1]Modulo 5
A[6] B[4] W[2]Modulo 6
A[7] B[5] W[3]Modulo 7
9/18/2006 ELEG652-06F 74
Memory Access
76543210Pipeline Stage 4
76543210Pipeline Stage 3
76543210Pipeline Stage 2
76543210Pipeline Stage 1
W3W3RA7RA7RB5RB5Memory 7
W2W2RA6RA6RB4RB4Memory 6
W1W1RA5RA5RB3RB3Memory 5
W0W0RA4RA4RB2RB2Memory 4
RA3RA3RB1RB1Memory 3
W6RB8RB8RA2RA2RB0RB0Memory 2
W5W5RA9RA9RB7RB7RA1RA1Memory 1
W4W4RA8RA8RB6RB6RA0RA0Memory 0
131211109876543210
Assume: Pipeline adder has 4 stages, memory cycle time = 2
Both A[0] and B[0] are ready
W[0] is ready and writing back
9/18/2006 ELEG652-06F 75
2-D Array Layout
a(1,0)a(2,0)a(3,0)a(4,0)a(5,0)a(6,0)a(7,0)
a(0,0)a(1,1)a(2,1)a(3,1)a(4,1)a(5,1)a(6,1)a(7,1)
a(0,1)a(1,2)a(2,2)a(3,2)a(4,2)a(5,2)a(6,2)a(7,2)
a(0,2)a(1,3)a(2,3)a(3,3)a(4,3)a(5,3)a(6,3)a(7,3)
a(0,3)a(1,4)a(2,4)a(3,4)a(4,4)a(5,4)a(6,4)a(7,4)
a(0,4)a(1,5)a(2,5)a(3,5)a(4,5)a(5,5)a(6,5)a(7,5)
a(0,5)a(1,6)a(2,6)a(3,6)a(4,6)a(5,6)a(6,6)a(7,6)
a(0,6)a(1,7)a(2,7)a(3,7)a(4,7)a(5,7)a(6,7)a(7,7)
a(0,7)
M0 M1 M2 M3 M4 M5 M6 M7
9/18/2006 ELEG652-06F 76
Row and Column Access
a(1,0)a(2,0)a(3,0)a(4,0)a(5,0)a(6,0)a(7,0)
a(0,0)a(1,1)a(2,1)a(3,1)a(4,1)a(5,1)a(6,1)a(7,1)
a(0,1)a(1,2)a(2,2)a(3,2)a(4,2)a(5,2)a(6,2)a(7,2)
a(0,2)a(1,3)a(2,3)a(3,3)a(4,3)a(5,3)a(6,3)a(7,3)
a(0,3)a(1,4)a(2,4)a(3,4)a(4,4)a(5,4)a(6,4)a(7,4)
a(0,4)a(1,5)a(2,5)a(3,5)a(4,5)a(5,5)a(6,5)a(7,5)
a(0,5)a(1,6)a(2,6)a(3,6)a(4,6)a(5,6)a(6,6)a(7,6)
a(0,6)a(1,7)a(2,7)a(3,7)a(4,7)a(5,7)a(6,7)a(7,7)
a(0,7)
M0 M1 M2 M3 M4 M5 M6 M7
a(0,1)a(0,2)a(0,3)a(0,4)a(0,5)a(0,6)a(0,7)
a(0,0)a(1,1)a(1,2)a(1,3)a(1,4)a(1,5)a(1,6)a(1,7)
a(1,0)a(2,1)a(2,2)a(2,3)a(2,4)a(2,5)a(2,6)a(2,7)
a(2,0)a(3,1)a(3,2)a(3,3)a(3,4)a(3,5)a(3,6)a(3,7)
a(3,0)a(4,1)a(4,2)a(4,3)a(4,4)a(4,5)a(4,6)a(4,7)
a(4,0)a(5,1)a(5,2)a(5,3)a(5,4)a(5,5)a(5,6)a(5,7)
a(5,0)a(6,1)a(6,2)a(6,3)a(6,4)a(6,5)a(6,6)a(6,7)
a(6,0)a(7,1)a(7,2)a(7,3)a(7,4)a(7,5)a(7,6)a(7,7)
a(7,0)
M0 M1 M2 M3 M4 M5 M6 M7
Good for Row Vectors
Good for Column Vectors
9/18/2006 ELEG652-06F 77
What to do?
• Use algorithms that– Access row or column exclusively– Possible for some (i.e. LU decomposition) but
not for others
• Skew the matrix
9/18/2006 ELEG652-06F 78
Row and Column Layout
Problem: Diagonals!!!!!Row stride 1Column Stride ?
a(1,7)
a(2,6)
a(3,5)
a(4,4)
a(5,3)
a(6,2)
a(0,0)
a(1,0)
a(2,7)
a(3,6)
a(4,5)
a(5,4)
a(6,3)
a(0,1)
a(1,1)
a(2,0)
a(3,7)
a(4,6)
a(5,5)
a(6,4)
a(0,2)
a(1,2)
a(2,1)
a(3,0)
a(4,7)
a(5,6)
a(6,5)
a(0,3)
a(1,3)
a(2,2)
a(3,1)
a(4,0)
a(5,7)
a(6,6)
a(0,4)
a(1,4)
a(2,3)
a(3,2)
a(4,1)
a(5,0)
a(6,7)
a(0,5)
a(1,5)
a(2,4)
a(3,3)
a(4,2)
a(5,1)
a(6,0)
a(0,6)
a(1,6)
a(2,5)
a(3,4)
a(4,3)
a(5,2)
a(6,1)
a(7,0)
a(0,7)
M0 M1 M2 M3 M4 M5 M6 M7
a(7,1) a(7,2) a(7,3) a(7,4) a(7,5) a(7,6) a(7,7)
9/18/2006 ELEG652-06F 79
Vector Access Specs
• Initial Address– Association between addresses and memory banks in
independent memory banks
• Number of elements– Needed for the vector control length and the possible
multiple access to memory
• Precision of the elements– Vector or Floating point (if shared)
• Stride– Memory access and memory bank detection
9/18/2006 ELEG652-06F 80
Tricks for Strided Access
• If array A was saved consequently in across M memory modules
• Assume that V is a subset of A and it has an access stride of S
• Then M consequently access to V will be dropped in
),( MSGCD
MMemory Modules
9/18/2006 ELEG652-06F 81
Our Example
Number of Memory Modules (M): 8
Row stride is 1
Column stride is 9
GCD(M,S) = 1
GCD(M,S) = 1
In general, if M = 2k
Select S = M + 1 (odd)
Guarantees GCD(M,S) = 1
9/18/2006 ELEG652-06F 82
Another Approach
• Let M be prime– All GCD(M,S) where S < M will be 1
• Example: The Burroughs Scientific Processor (BSP) [Kuck & Budnick 71]
• 16 PE• 17 Memory• 2 Alignment Networks
9/18/2006 ELEG652-06F 83
The Burroughs Scientific Processor
Input Alignment Network
M M M M M M M M M M M M M M M M M
Output Alignment Network
P P P P P P P P P P P P P P P P
17
16
17
16
9/18/2006 ELEG652-06F 84
Special Note and Other Approaches
• If memory time is one cycle (i.e. d = 1) then consecutive access causes no problem (not the case at all)
• Another solution to reduce contention– RANDOMIZE!!!!– Generate a random bank to save a word– Used by the Tera MTA
9/18/2006 ELEG652-06F 85
Questions??Comments??
Questions??Comments??
9/18/2006 ELEG652-06F 86
Bibliography
• Stork, Christian. “Exploring the Tera MTA by Example” May 10, 2000
• Cray 1 Computer System Hardware Reference Manual
• Koopman, Phillips, “17. Vector Performance”, November 9, 1998. A lecture in Carnegie Mellon University
9/18/2006 ELEG652-06F 87
Side Note 2Gordon Bells’ Rules for Supercomputers
• Put attention to performance• Orchestrate your resources• Scalars are bottlenecks
– Plan your scalar section very carefully
• Use existing (affordable) technology to provide a peak vector performance– Vector Performance: Bandwidth and vector register capacity.– Rule of Thumb: At least two results per cycle
• Do not ignore expensive ops– Division are uncommon but they are used!!!!
• Design to Market– Create something to brag about
9/18/2006 ELEG652-06F 88
Side Note 2Gordon Bells’ Rules for Supercomputers
• 10 ~ 7– Support around two extra bits of addressing per 3
years (10 years ~= 7 bits)
• Increase Productivity• Stand on the shoulders of giants• Pipeline your design
– Design for one and then other and then other…
• Give you enough lay time in your schedule for unexpected– Beware of Murphy’s Laws!!!!