Top Banner
9/18/2006 ELEG652-06F 1 Topic 2 Vector Processing & Vector Architectures Lasciate Ogne Speranza, Voi Ch’Intrate Dante’s Inferno
88

Topic 2

Feb 05, 2016

Download

Documents

gunda

Topic 2. Vector Processing & Vector Architectures. Lasciate Ogne Speranza, Voi Ch’Intrate. Dante’s Inferno. Reading List. Slides: Topic2x Henn&Patt: Appendix G Other assigned readings from homework and classes. Vector Architectures. Types:. Memory-Memory Archs. Register-Register Archs. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Topic 2

9/18/2006 ELEG652-06F 1

Topic 2

Vector Processing & Vector Architectures

Lasciate Ogne Speranza, Voi Ch’IntrateDante’s Inferno

Page 2: Topic 2

9/18/2006 ELEG652-06F 2

Reading List

• Slides: Topic2x

• Henn&Patt: Appendix G

• Other assigned readings from homework

and classes

Page 3: Topic 2

9/18/2006 ELEG652-06F 3

Vector Architectures

Register-Register Archs Memory-Memory Archs

Vector Arch Components:

Vector Register Banks Capable of holding a n number of vector elements. Two extra registers

Vector Functional Units Fully pipelined, hazard detection (structural and data)

Vector Load-Store Unit

A Scalar Unit A set of registers, FUs and CUs

Types:

Page 4: Topic 2

9/18/2006 ELEG652-06F 4

An Intro to DLXV

• A simplified vector architecture

• Consist of one lane per functional unit– Lane: The number of vector instructions that

can be executed in parallel by a functional unit

• Loosely based on Cray 1 arch and ISA

• Extension of DLX ISA for vector arch

Page 5: Topic 2

9/18/2006 ELEG652-06F 5

DLXV Configuration

• Vector Registers – Eight Vector regs / 64 element each. – Two read ports and one write port per register – Sixteen read ports and eight write ports in total

• Vector Functional Unit– Five Functional Units

• Vector Load and Store Unit– A bandwidth of 1 word per cycle– Double as a scalar load / store unit

• A set of scalar registers– 32 general and 32 FP regs

Page 6: Topic 2

9/18/2006 ELEG652-06F 6

A Vector / Register Arch

Main Memory

Vector Load& Store

FP Add

FP Multiply

FP Divide

Logical

Integer

Scalar Register FileVector Register File

Page 7: Topic 2

9/18/2006 ELEG652-06F 7

Advantages

• A single vector instruction A lot of work• No data hazards

– No need to check for data hazards inside vector instructions– Parallelism inside the vector operation

• Deep pipeline or array of processing elements

• Known Access Pattern– Latency only paid once per vector (pipelined loading)– Memory address can be mapped to memory modules to reduce

contentions

• Reduction in code size and simplification of hazards– Loop related control hazards from loop are eliminated.

Page 8: Topic 2

9/18/2006 ELEG652-06F 8

DAXPY: DLX Code

Y = a * X + Y

LD F0, aADDI R4, Rx, #512 ; last address to load

Loop: LD F2, 0(Rx) ; load X(i)MULTD F2, F0, F2 ; a x X(i)LD F4, 0 (Ry) ; load Y(i)ADDD F4, F2, F4 ; a x X(i) + Y(i)SD F4, 0 (Ry) ; store into Y(i)ADDI Rx, Rx, #8 ; increment index to XADDI Ry, Ry, #8 ; increment index to YSUB R20, R4, Rx ; compute boundBNZ R20, loop ; check if done

The bold instructions are part of the loop index calculation and branching

Page 9: Topic 2

9/18/2006 ELEG652-06F 9

DAXPY: DLXV Code

Y = a * X + Y

LD F0, a ; load scalar aLV V1, Rx ; load vector XMULTSV V2, F0, V1 ; vector-scalar

multiplyLV V3, Ry ; load vector YADDV V4, V2, V3 ; addSV Ry, V4 ; store the result

Instruction Number [Bandwidth] for 64 elements

DLX Code 578 Instructions

DLXV Code 6 Instructions

Page 10: Topic 2

9/18/2006 ELEG652-06F 10

Dead Time

The time that it takes the pipeline to be ready for the next vector instruction

Page 11: Topic 2

9/18/2006 ELEG652-06F 11

Issues

• Vector Length Control– Vector lengths are not usually less or even a

multiple of the hardware vector length

• Vector Stride– Access to vectors may not be consequently

• Solutions:– Two special registers

• One for vector length up to a maximum vector length

• One for vector mask

Page 12: Topic 2

9/18/2006 ELEG652-06F 12

Vector Length Control

Question: Assume the maximum hardware vector length is MVL which may be less than n. How should we do the above computation ?

for(i = 0; i < n; ++i) y[i] = a * x[i] +y[i]

An Example:

Page 13: Topic 2

9/18/2006 ELEG652-06F 13

Vector Length ControlStrip Mining

Low = 0VL = n % MVLfor(j = 0; j < (int)(n / (float)(MVL) + 0.5); ++j){ for(i = Low; i < Low + VL – 1; ++i) y[i] = a * x[i] + y[i]; Low += VL; VL = MVL;}

Strip Mined Code

for(i = 0; i < n; ++i) y[i] = a * x[i] +y[i]

Original Code

Page 14: Topic 2

9/18/2006 ELEG652-06F 14

Vector Length ControlStrip Mining

0 1 2 3 n/MVL

0M-1

MM-MVL-1

M-MVLM-2*MVL-1

M-2*MVLM-3*MVL-1

n-MVL-1n-1

Value of j

Value of i

For a vector of arbitrary length M =n % MVL

The vector length control register takes values similar to the Vector Length variable (VL) in the C code

Page 15: Topic 2

9/18/2006 ELEG652-06F 15

Vector Stride

for(i = 0; i < n; ++i) for(j = 0; j < n; ++j){ c[i][j] = 0.0; for(k = 0; k < n; ++k) c[i][j] += a[i][k] * b[k][j]; }

Matrix Multiply Code

How to vectorize this code?

How stride works here?

Consider that in C the arrays are saved in memory row-wise. Therefore, a and c are loaded correctly. How about b?

Page 16: Topic 2

9/18/2006 ELEG652-06F 16

Vector Stride

• Stride for a and c 1– Also called unit stride

• Stride for b n elements

• Use special instructions– Load and store with stride (LVWS, SVWS)

• Memory banks Stride can complicate access patterns == Contention

Page 17: Topic 2

9/18/2006 ELEG652-06F 17

Vector Stride &Memory Systems

Example

Memory System

Eight Memory BanksBusy Time: 6 cyclesMemory Latency: 12 cycles

Vector

64 elementsStride of 1 and 32

Time to Complete a Load

12 + 64 76 Cycles for Stride 1 or 1.2 cycle per element

12 + 1 + 6 * 63 391 Cycles for Stride 32 or 6.2 cycle per element

Page 18: Topic 2

9/18/2006 ELEG652-06F 18

Cray-1“The World Most Expensive

Love Seat...”

Picture courtesy of CrayOriginal Source: Cray 1 Computer System Hardware Reference Manual

Page 19: Topic 2

9/18/2006 ELEG652-06F 19

Cray 1 Data Sheet

• Designed by: Seymour Cray• Price: 5 to 8.8 Millions dollars• Units Shipped: 85• Technology: SIMD, deep pipelined functional

units• Performance: up to 160 MFLOPS• Date Released: 1976• Best Know for: The computer that made the term

supercomputer mainstream

Page 20: Topic 2

9/18/2006 ELEG652-06F 20

Architectural Components

RegistersFunctional UnitsInstruction Buffers

Computation Section

Memory Section

From 0.25 to 1 Million 64-bit

words

I/O Section

12 Input channels12 Output channels

MCU

Mass StorageSubsystem

Front End Comp.I/O Stations

Peripheral Equipment

Page 21: Topic 2

9/18/2006 ELEG652-06F 21The

Cra

y-1

Arc

hite

ctur

e

Computation Section

Vector Components

Scalar Components

Address & Instruction Calculation Components

Page 22: Topic 2

9/18/2006 ELEG652-06F 22

Architecture Features

• 64-bit word• 12.5 nanosecond clock

period• 2’s Complement

arithmetic• Scalar and Vector

Processing modes• Twelve functional units• Eight 24-bit address

registers (A)• Sixty Four 24-bit

intermediate address (B) registers

• Eight 64-bit scalar (S) registers

• Sixty four 64-bit intermediate scalar (T) registers

• Eight 64-element vector (V) registers, 64-bit per element

• Four Instructions buffers of 64 16-bit parcels each

• Integer and floating point arithmetic

• 128 Instruction codes

Page 23: Topic 2

9/18/2006 ELEG652-06F 23

Architecture Features

• Up to 1,048,576 words memory– 64 data bits and 8 error

correction bits

• Eight or sixteen banks of 65,536 words each

• Busy Bank time: 4• Transfer rate:

– B, T, V registers One word per cycle

– A, S registers One word per two cycles

– Instruction Buffers Four words per clock cycle

• SEC - DED

• Twelve input and twelve output channels

• Loss data detection• Channel group

– Contains either six input or six output channels

– Served equally by memory (scanned every 4 cycles)

– Priority resolved within the groups

– Sixteen data bits, 3 controls bits per channel and 4 parity bits

Original Source: Cray 1 Computer System Hardware Reference Manual

Page 24: Topic 2

9/18/2006 ELEG652-06F 24

Register-Register Architecture

• All ALU operands are in registers• Registers are specialized by function (A,

B, T, etc) thus avoiding conflict• Transfer between Memory and registers is

treated differently than ALU• RISC based idea• Effective use of the Cray-1 requires careful

planning to exploit its register resources– 4 Kbytes of high speed registers

Page 25: Topic 2

9/18/2006 ELEG652-06F 25

Registers

Primary Registers

Intermediate Registers

Directly addressable by the Functional UnitsNamed V (vector), S (scalar) and A (address)

Used as buffers for the functional unitsNamed T (Scalar transport) and B (Address buffering)

M IR PR FU

Page 26: Topic 2

9/18/2006 ELEG652-06F 26

Registers

• Memory Access Time: 11 cycles• Register Access Time: 1 ~ 2 cycles• Primary Registers:

– Address Regs: 8 x 24 bits– Scalar Regs: 8 x 64 bits– Vector Regs: 8 x 64 words

• Intermediate Registers:– B Regs: 64 x 24 bits– T Regs: 64 x 64 bits

• Special Registers:– Vector Length Register: 0 <= VL <= 64– Vector Masks Register: 64 bits

• Total Size: 4,888 bytes

Page 27: Topic 2

9/18/2006 ELEG652-06F 27

Instruction Format

A parcel 16-bit

Instruction word 16 (one parcel) or 32 (two parcels) according to type

4 3 3 3 3

Op CodeResult Reg

Operand reg

Operand reg

A One Parcel Instruction: Arithmetic Logical Instruction Word

4 3 3 22

A Two Parcels Instruction: Memory Instruction Word

Op code

Addr Index Reg

Result Reg

Address

Page 28: Topic 2

9/18/2006 ELEG652-06F 28

Functional Unit PipelinesFunctional pipelines Register Pipeline delays

usage (clock periods)

Address functional units Address add unit A 2 Address multiply unit A 6

Scalar functional units Scalar add unit S 3 Scalar shift unit S 2 or 3 Scalar logical unit S 1 Population/leading zero count unit S 3

Vector functional units Vector add unit V or S 3 Vector shift unit V or S 4 Vector logical unit V or S 2

Floating-point functional units Floating-point add unit S and V 6 Floating-point multiply unit S and V 7 Reciprocal approximation unit S and V 14

Page 29: Topic 2

9/18/2006 ELEG652-06F 29

Vector Units

• Vector Addition / Subtraction– Functional Unit Delay: 3

• Vector Logical Unit– Boolean Ops between 64 bit elements of the vectors

• AND, XOR, OR, MERGE, MASK GENERATION

– Functional Unit Delay: 2

• Vector Shift– Shift values of a 64 bit (or 128 bits) element of a

vector– Functional Unit Delay: 4

Page 30: Topic 2

9/18/2006 ELEG652-06F 30

Instruction Set

• 128 Instructions

• Ten Vector Types

• Thirteen Scalar Types

• Three Addressing Modes

Page 31: Topic 2

9/18/2006 ELEG652-06F 31

Implementation Philosophy

• Instruction Processing– Instruction Buffering: Four Instructions buffers of 64

16-bit parcels each

• Memory Hierarchy– Memory Banks, T and B register banks

• Register and Function Unit Reservation– Example: Vector ops, register operands, register

result and FU are checked as reserved

• Vector Processing

Page 32: Topic 2

9/18/2006 ELEG652-06F 32

Instruction Processing

“Issue one instruction per cycle”

• 4 x 64 word• 16 - 32 bit instructions• Instruction parcel pre-fetch• Branch in buffer• 4 inst/cycle fetched to LRU I-buffer

Special Resources

P, NIP, CIP and LIP

Page 33: Topic 2

9/18/2006 ELEG652-06F 33

Reservations

• Vector operands, results and functional unit are marked reserved

• The vector result reservation is lifted when the chain slot time has passed– Chain Slot: Functional Unit delay plus two

clock cyclesExamples:

V1 = V2 * V3V4 = V5 + V6

V1 = V2 * V3V4 = V5 + V2

V1 = V2 * V3V4 = V5 * V6

Second Instruction cannot begin until First is finished

DittoIndependent

Page 34: Topic 2

9/18/2006 ELEG652-06F 34

(a) Type 1 vector instruction (b) Type 2 vector instruction

Vj

~ ~

~ ~

Vk

~ ~

Vi

~ ~

Vk

~ ~

Vi

123

n

123

n

. . .. . .

Sj

V. Instructions in the Cray-1

Page 35: Topic 2

9/18/2006 ELEG652-06F 35

~ ~

~ ~

Vi

Mem

ory

1234567

123456

Mem

ory

(c) Type 3 vector instruction (d) Type 4 vector instruction

Vj

V. Instructions in the Cray-1

Page 36: Topic 2

9/18/2006 ELEG652-06F 36

Vector Loops

• Long vectors with N > 64 are sectioned

• Each time through a vector loop 64 elements are processed

• Remainder handling

• “transparent” to the programmer

Page 37: Topic 2

9/18/2006 ELEG652-06F 37

V. Chaining

• Internal forwarding techniques of 360/91• A “linking process” that occurs when results

obtained from one pipeline unit are directly fed into the operand registers of another function pipe.

• Chaining allow operations to be issued as soon as the first result becomes available

• Registers/F-units must be properly reserved.• Limited by the number of Vector Registers and

Functional Units• From 2 to 5

Page 38: Topic 2

9/18/2006 ELEG652-06F 38

Memory

...

......

......

...

1234567

1

2

3

4

1

2

3

1

2

a

Memoryfetchpipe

VO

V1

V2

V3V4

V5

Rightshiftpipe

Vectoraddpipe

LogicalProductPipe

d

c

d

g

i

jj

l

f

Cha

inin

g E

xam

ple

Mem V0 (M-fetch)V0 + V1 V2 (V-add)V2 < A3 V3 (left shift)V3 ^ V4 V5 (logical product)

Fetching Adding

FetchingAdding

Chain Slot

Page 39: Topic 2

9/18/2006 ELEG652-06F 39

MultiplyPipe

V1

Add Pipe

..

Access Pipe

Memory

..

Access Pipe

Memory

Access Pipe

Read/write port

V3

Memory

V4

V4

Read/write port

Vector register

V1

(Load Y)

(Load X)

V2

(S*)

S

(Store Y)

(Vadd)

Multiply Pipe

..

..

Add Pipe

Memory

Access Pipe(Load Y)

Read port 2

V3

Memory

V4

write port

Vector register

(S*)

S

(VAdd)

Read port 1

..

V1

Access Pipe(Load X)

V2

Scalar register

Access Pipe(Store Y)

Limited chaining using only onememory-access pipe in the Gray 1

Complete chaining using threememory-access pipes in the Cray X-MP

..

..

..

..

..

..

....

.. .. ..

.. ..

..

..

..

..

Mul

tipip

elin

e ch

aini

ng

SA

XP

Y c

ode

Y(1:N) = A x X(1:N) + Y(1:N)

Page 40: Topic 2

9/18/2006 ELEG652-06F 40

Cray 1 Performance

• 3 to 160 MFLOPS– Application and Programming Skills

• Scalar Performance: 12 MFLOPS

• Vector Dot Product: 22 MFLOPS

• Peak Performance: 153 MFLOPS

Page 41: Topic 2

9/18/2006 ELEG652-06F 41

Cray X-MP Data Sheet

• Designed by: Steve Chen• Price: 15 Millions dollars• Units Shipped: N/A• Technology: SIMD, deep pipelined functional

units• Performance: up to 200 MFLOPS (for single

CPU)• Date Released: 1982• Best Know for: The Successor of Cray 1 and the

first parallel vector computer from Cray Inc.

An NSA CRAY X-MP/24, on exhibit at the National Crypto logic Museum.

Page 42: Topic 2

9/18/2006 ELEG652-06F 42

Cray X-MP

• Multiprocessor/multiprocessing

• 8 times of Cray-1 memory bandwidth

• Clock 9.5ns - 8.5 ns

• Scalar speedup*: 1.25 ~ 2.5

• Throughput*: 2.5 ~ 5 times

*Speedup with respect to Cray 1

Page 43: Topic 2

9/18/2006 ELEG652-06F 43

Irregular Vector Ops

• Scatter: Use a vector to scatter another vector elements across Memory– X[A[i]] = B[i]

• Gather: The reverse operation of scatter– X[i] = B[C[i]]

• Compress– Using a Vector Mask, compress a vector

• No Single instruction to do these before 1984– Poor Performance: 2.5 MFLOPS

Page 44: Topic 2

9/18/2006 ELEG652-06F 44

Gather Operation

V1[i] = A[V0[i]]

VL

A0

4

100

4270

V0 V1

200 100300 140400 180500 1C0600 200700 240100 280250 2C0350 300

Memory Contents / Addresses

4270

V0

600400250200

V1

200 100300 140400 180500 1C0600 200700 240100 280250 2C0350 300

Memory Contents / Addresses

01

2

3

45

6

7

8

Example:V1[2] = A[V0[2]] = A[7] = 250

Page 45: Topic 2

9/18/2006 ELEG652-06F 45

Scatter Operation

A[V0[i]] = V1[i]

VL

A0

4

100

4270

V0

200300400500

V1

x 100x 140x 180x 1C0x 200x 240x 280x 2C0x 300

Memory Contents / Addresses

4270

V0

200300400500

V1

500 100x 140

300 180x 1C0

200 200x 240x 280

400 2C0x 300

Memory Contents / Addresses

01

2

3

45

6

7

8

Example:A[V0[0]] = V1[0]A[4] = 200*(0x200)= 200

Page 46: Topic 2

9/18/2006 ELEG652-06F 46

Vector Compression Operation

VL

VM

14

010110011101 …

0-105

-150024-7130

-1700

012

345

6789

10

111213

V0 V1

V1 = Compress(V0, VM, Z)0-105

01030407

-150024

080911

-7130

-1700

012

345

6789

V0 V1

Page 47: Topic 2

9/18/2006 ELEG652-06F 47

Characteristics of Several Vector Architectures

Page 48: Topic 2

9/18/2006 ELEG652-06F 48

The VMIPS Vector Instructions

A MIPS ISA extended to support Vector Instructions. The same as DLXV

Page 49: Topic 2

9/18/2006 ELEG652-06F 49

Multiple Lanes

Page 50: Topic 2

9/18/2006 ELEG652-06F 50

Vectorizing Compilers

Page 51: Topic 2

9/18/2006 ELEG652-06F 51

Performance Analysis of Vector Architectures

Topic 2a

Page 52: Topic 2

9/18/2006 ELEG652-06F 52

Serial, Parallel and Pipelines

1

2

3

4

x y

z = x + y

Serial

Pipeline Array

Overlap Replicate

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

x1y1

x2y2

z1

z2

z1

x1y1

x2y2

z2

z2z1

x1y1 x2y2

1 result per cycle

1 result per 4 cycles

N result per cycle

Page 53: Topic 2

9/18/2006 ELEG652-06F 53

Generic Performance Formula (R. Hockney & C. Jesshope 81)

)(2

11 nnrt

MFLOPS performance of the architecture with an infinite length vector

The vector length needed to achieve half of the peak performance

Vector lengthn

n

r

21

1

Page 54: Topic 2

9/18/2006 ELEG652-06F 54

Serial Architecture

0

*

**

)(

21

1

21

1

n

lr

nlt

nnrt

serial

Generic Formula:

Parameters:

1

2

3

4

1

2

3

4

z1

x2y2

z2

s

l

Number of stages

Time per stage

Start up time

Page 55: Topic 2

9/18/2006 ELEG652-06F 55

Pipeline Architecture

1

)1(

))1((

)(

21

1

21

1

lsn

r

lsnt

nlst

nnrt

pipeline

pipeline

Generic Formula:

Parameters:

1

2

3

4

1

2

3

4

s

l

Number of stages

Time per stage

Start up time

s

l

The number of elements that will come out of the pipeline after the initial penalty has been paid

Initial Penalty

Page 56: Topic 2

9/18/2006 ELEG652-06F 56

Thus …

• The Asymptotic Performance Parameter– Memory Limited Peak Performance Factor. A scale

factor applied to the performance of a particular architecture implemented with a particular technology. Equivalent to a serial processor with a 100% cache miss

• The N half Parameter– The amount of parallelism that is presented in a given

architecture.– Determined by a combination of vector unit startup

and vector unit latency

Page 57: Topic 2

9/18/2006 ELEG652-06F 57

The N Half Range

21nSerial Machine Infinite array of

processor

0 ∞

The relative performance of different algorithms on a computer is determined by the value of N half(matching problem parallelism with architecture parallelism)

Page 58: Topic 2

9/18/2006 ELEG652-06F 58

Vector Length v.s. Vector Performance

21n

1r

)(2

11 nnrt

Slope:

n

t

Page 59: Topic 2

9/18/2006 ELEG652-06F 59

Pseudo codeAsymptotic and N half Calculation

overhead = -clock();overhead += clock();for(N = 0; N < NMAX; ++N){ time[N] = -clock(); for(i = 0; i < N; ++i) A[i] = B[i] * C[i]; time[N] += clock(); time[N] = time[N] - overhead;}

Page 60: Topic 2

9/18/2006 ELEG652-06F 60

Parameters of Several Parallel Architectures

Computer N half R infinity Nv

CRAY-1 10-20 80 1.5-3

BSP 25-150 50 1-8

2-pipe CDC CYBER 205

100 100 11

1-pipe TIASC 30 12 7

CDC STAR 100 150 25 12

(64 x 64) ICL DAP 2048 16 5

Page 61: Topic 2

9/18/2006 ELEG652-06F 61

Another ExampleThe Chaining Effect

1

21

1

1

1

1

])1[(*1

1

])1[(

)]1([

r

lsn

nlsmt

tmt

nlst

nlst

m

iii

m

m

iiim

m

iiim

Assume that all s’s are the same. The same goes for the l’s

Assume m vector operations unchained

Thus

So

Page 62: Topic 2

9/18/2006 ELEG652-06F 62

Another ExampleThe Chaining Effect

Assume m vector operations chained

mr

lsmn

nlsmmt

tmt

nlsmt

nlst

m

m

m

iiim

1

21

1

1)(

]1)(*[*1

1

]1)(*[

)]1()([

Thus

So

Page 63: Topic 2

9/18/2006 ELEG652-06F 63

Explanation

)1(1 lsnr

Vop 1Vop 2

Vop 3s+l

)1(1 lsnr )1(1

lsnr

Vop 1 Vop 2 Vop 3

)1(1 lsnr

Unchained

Chained

Usually this would mean an increase in r∞ and N1/2 by m

Page 64: Topic 2

9/18/2006 ELEG652-06F 64

Topic 2b

Memory System Design in Vector Machines

Page 65: Topic 2

9/18/2006 ELEG652-06F 65

Multi Port Memory System

Pipelined AdderProcessor Array

Stream A

Stream C = A + B

Stream B

A vector addition is possible when two vectors (data stream) are processed by vector hardware. This being vector adder or a processor array.

Bandwidth Challenge

Page 66: Topic 2

9/18/2006 ELEG652-06F 66

Memory in Vector Architectures

• Increase bandwidth– Make “wider” memory

• Wider busses– Create several memory banks

• Interleaved Memory Banks– Several Memory Banks– Shared resources– Broadcast of memory access– Interleaving factor– Number of Banks >= Bank Busy Time

• Independent Memory Banks– The same idea as Interleaved memory– Independent resources

Pipelined Adder

Stream A

Stream B

Stream C = A + B

M

M

M

M

M

M

M

M

Page 67: Topic 2

9/18/2006 ELEG652-06F 67

However …

• Vector machines prefer independent memory banks.– Multiple loads or stores per clock

• Memory bank cycle time > CPU cycle time

– Striding– Parallel processors

Page 68: Topic 2

9/18/2006 ELEG652-06F 68

Bandwidth Requirement

• Assume d being the cycles to access a module• Thus BW1 is equal to 1/d words per cycle

– It means that after d cycles you will get a word

• Restrictions: Most [Vector] memory systems requires at least a word per cycle!!!!

• Imagine that a requirement of 3 words per cycle (two reads and one write) for total bandwidth (TBW)

• Thus, Total Number of Modules = TBW / BW1 – For a total of 3d Modules (3/1/d 3d)

Page 69: Topic 2

9/18/2006 ELEG652-06F 69

Bandwidth Example

Busy Time 15 / 2.167 = 6.9 ~ 7 Cycles

Memory Requests 6 * 32 = 192 In flight requests

Number of Banks TBW / BW1 = 192 * 7 1344

Historical Note: Cray T90 early configurations only supported up to 1024 Memory Banks

The Cray T90

Clock Cycle: 2.167 nsLargest Configuration: 32Maximum Memory requests per cycle: 6 (Four Loads and Two stores)SRAM Time: 15 ns

Page 70: Topic 2

9/18/2006 ELEG652-06F 70

Bandwidth v.s. Latency

If each element is in a different memory module, does this guarantees maximum bandwidth??

For each module

An initialization rate named r A latency to access a word d

r * d <= 100%r <= 1/d

NO

rm <= m / d BW <= m / d

Page 71: Topic 2

9/18/2006 ELEG652-06F 71

An Example

• How arrays A, B, and W should be layout in memory such that the memory can sustain one operation per cycle for the following operation:W = A + B;

• If elements from W, A and B should start from module 0, Conflict will result!!!

Page 72: Topic 2

9/18/2006 ELEG652-06F 72

The Memory Accesses

543210Pipeline Stage 4

6543210Pipeline Stage 3

76543210Pipeline Stage 2

76543210Pipeline Stage 1

RB7RB7RA7RA7Memory 7

RB6RB6RA6RA6Memory 6

RB5RB5RA5RA5Memory 5

RA8RB4RB4RA4RA4Memory 4

RA8RA8RB3RB3RA4RA3Memory 3

RA8RA8RA8RB2RB2RA2RA2Memory 2

RA8RA8RA8RA8RB1RB1RA1RA1Memory 1

W0RA8RA8RA8RA8RB0RB0RA0RA0Memory 0

131211109876543210

Assume: Pipeline adder has 4 stages, memory cycle time = 2

A[0] is ready!!!

B[0] is ready!!!W[0] has been computed but it cannot be written back!!

W[0] is written back

Page 73: Topic 2

9/18/2006 ELEG652-06F 73

Another Memory Layout

A[0] B[6] W[4]Modulo 0

A[1] B[7] W[5]Modulo 1

A[2] B[0] W[6]Modulo 2

A[3] B[1] W[7]Modulo 3

A[4] B[2] W[0]Modulo 4

A[5] B[3] W[1]Modulo 5

A[6] B[4] W[2]Modulo 6

A[7] B[5] W[3]Modulo 7

Page 74: Topic 2

9/18/2006 ELEG652-06F 74

Memory Access

76543210Pipeline Stage 4

76543210Pipeline Stage 3

76543210Pipeline Stage 2

76543210Pipeline Stage 1

W3W3RA7RA7RB5RB5Memory 7

W2W2RA6RA6RB4RB4Memory 6

W1W1RA5RA5RB3RB3Memory 5

W0W0RA4RA4RB2RB2Memory 4

RA3RA3RB1RB1Memory 3

W6RB8RB8RA2RA2RB0RB0Memory 2

W5W5RA9RA9RB7RB7RA1RA1Memory 1

W4W4RA8RA8RB6RB6RA0RA0Memory 0

131211109876543210

Assume: Pipeline adder has 4 stages, memory cycle time = 2

Both A[0] and B[0] are ready

W[0] is ready and writing back

Page 75: Topic 2

9/18/2006 ELEG652-06F 75

2-D Array Layout

a(1,0)a(2,0)a(3,0)a(4,0)a(5,0)a(6,0)a(7,0)

a(0,0)a(1,1)a(2,1)a(3,1)a(4,1)a(5,1)a(6,1)a(7,1)

a(0,1)a(1,2)a(2,2)a(3,2)a(4,2)a(5,2)a(6,2)a(7,2)

a(0,2)a(1,3)a(2,3)a(3,3)a(4,3)a(5,3)a(6,3)a(7,3)

a(0,3)a(1,4)a(2,4)a(3,4)a(4,4)a(5,4)a(6,4)a(7,4)

a(0,4)a(1,5)a(2,5)a(3,5)a(4,5)a(5,5)a(6,5)a(7,5)

a(0,5)a(1,6)a(2,6)a(3,6)a(4,6)a(5,6)a(6,6)a(7,6)

a(0,6)a(1,7)a(2,7)a(3,7)a(4,7)a(5,7)a(6,7)a(7,7)

a(0,7)

M0 M1 M2 M3 M4 M5 M6 M7

Page 76: Topic 2

9/18/2006 ELEG652-06F 76

Row and Column Access

a(1,0)a(2,0)a(3,0)a(4,0)a(5,0)a(6,0)a(7,0)

a(0,0)a(1,1)a(2,1)a(3,1)a(4,1)a(5,1)a(6,1)a(7,1)

a(0,1)a(1,2)a(2,2)a(3,2)a(4,2)a(5,2)a(6,2)a(7,2)

a(0,2)a(1,3)a(2,3)a(3,3)a(4,3)a(5,3)a(6,3)a(7,3)

a(0,3)a(1,4)a(2,4)a(3,4)a(4,4)a(5,4)a(6,4)a(7,4)

a(0,4)a(1,5)a(2,5)a(3,5)a(4,5)a(5,5)a(6,5)a(7,5)

a(0,5)a(1,6)a(2,6)a(3,6)a(4,6)a(5,6)a(6,6)a(7,6)

a(0,6)a(1,7)a(2,7)a(3,7)a(4,7)a(5,7)a(6,7)a(7,7)

a(0,7)

M0 M1 M2 M3 M4 M5 M6 M7

a(0,1)a(0,2)a(0,3)a(0,4)a(0,5)a(0,6)a(0,7)

a(0,0)a(1,1)a(1,2)a(1,3)a(1,4)a(1,5)a(1,6)a(1,7)

a(1,0)a(2,1)a(2,2)a(2,3)a(2,4)a(2,5)a(2,6)a(2,7)

a(2,0)a(3,1)a(3,2)a(3,3)a(3,4)a(3,5)a(3,6)a(3,7)

a(3,0)a(4,1)a(4,2)a(4,3)a(4,4)a(4,5)a(4,6)a(4,7)

a(4,0)a(5,1)a(5,2)a(5,3)a(5,4)a(5,5)a(5,6)a(5,7)

a(5,0)a(6,1)a(6,2)a(6,3)a(6,4)a(6,5)a(6,6)a(6,7)

a(6,0)a(7,1)a(7,2)a(7,3)a(7,4)a(7,5)a(7,6)a(7,7)

a(7,0)

M0 M1 M2 M3 M4 M5 M6 M7

Good for Row Vectors

Good for Column Vectors

Page 77: Topic 2

9/18/2006 ELEG652-06F 77

What to do?

• Use algorithms that– Access row or column exclusively– Possible for some (i.e. LU decomposition) but

not for others

• Skew the matrix

Page 78: Topic 2

9/18/2006 ELEG652-06F 78

Row and Column Layout

Problem: Diagonals!!!!!Row stride 1Column Stride ?

a(1,7)

a(2,6)

a(3,5)

a(4,4)

a(5,3)

a(6,2)

a(0,0)

a(1,0)

a(2,7)

a(3,6)

a(4,5)

a(5,4)

a(6,3)

a(0,1)

a(1,1)

a(2,0)

a(3,7)

a(4,6)

a(5,5)

a(6,4)

a(0,2)

a(1,2)

a(2,1)

a(3,0)

a(4,7)

a(5,6)

a(6,5)

a(0,3)

a(1,3)

a(2,2)

a(3,1)

a(4,0)

a(5,7)

a(6,6)

a(0,4)

a(1,4)

a(2,3)

a(3,2)

a(4,1)

a(5,0)

a(6,7)

a(0,5)

a(1,5)

a(2,4)

a(3,3)

a(4,2)

a(5,1)

a(6,0)

a(0,6)

a(1,6)

a(2,5)

a(3,4)

a(4,3)

a(5,2)

a(6,1)

a(7,0)

a(0,7)

M0 M1 M2 M3 M4 M5 M6 M7

a(7,1) a(7,2) a(7,3) a(7,4) a(7,5) a(7,6) a(7,7)

Page 79: Topic 2

9/18/2006 ELEG652-06F 79

Vector Access Specs

• Initial Address– Association between addresses and memory banks in

independent memory banks

• Number of elements– Needed for the vector control length and the possible

multiple access to memory

• Precision of the elements– Vector or Floating point (if shared)

• Stride– Memory access and memory bank detection

Page 80: Topic 2

9/18/2006 ELEG652-06F 80

Tricks for Strided Access

• If array A was saved consequently in across M memory modules

• Assume that V is a subset of A and it has an access stride of S

• Then M consequently access to V will be dropped in

),( MSGCD

MMemory Modules

Page 81: Topic 2

9/18/2006 ELEG652-06F 81

Our Example

Number of Memory Modules (M): 8

Row stride is 1

Column stride is 9

GCD(M,S) = 1

GCD(M,S) = 1

In general, if M = 2k

Select S = M + 1 (odd)

Guarantees GCD(M,S) = 1

Page 82: Topic 2

9/18/2006 ELEG652-06F 82

Another Approach

• Let M be prime– All GCD(M,S) where S < M will be 1

• Example: The Burroughs Scientific Processor (BSP) [Kuck & Budnick 71]

• 16 PE• 17 Memory• 2 Alignment Networks

Page 83: Topic 2

9/18/2006 ELEG652-06F 83

The Burroughs Scientific Processor

Input Alignment Network

M M M M M M M M M M M M M M M M M

Output Alignment Network

P P P P P P P P P P P P P P P P

17

16

17

16

Page 84: Topic 2

9/18/2006 ELEG652-06F 84

Special Note and Other Approaches

• If memory time is one cycle (i.e. d = 1) then consecutive access causes no problem (not the case at all)

• Another solution to reduce contention– RANDOMIZE!!!!– Generate a random bank to save a word– Used by the Tera MTA

Page 85: Topic 2

9/18/2006 ELEG652-06F 85

Questions??Comments??

Questions??Comments??

Page 86: Topic 2

9/18/2006 ELEG652-06F 86

Bibliography

• Stork, Christian. “Exploring the Tera MTA by Example” May 10, 2000

• Cray 1 Computer System Hardware Reference Manual

• Koopman, Phillips, “17. Vector Performance”, November 9, 1998. A lecture in Carnegie Mellon University

Page 87: Topic 2

9/18/2006 ELEG652-06F 87

Side Note 2Gordon Bells’ Rules for Supercomputers

• Put attention to performance• Orchestrate your resources• Scalars are bottlenecks

– Plan your scalar section very carefully

• Use existing (affordable) technology to provide a peak vector performance– Vector Performance: Bandwidth and vector register capacity.– Rule of Thumb: At least two results per cycle

• Do not ignore expensive ops– Division are uncommon but they are used!!!!

• Design to Market– Create something to brag about

Page 88: Topic 2

9/18/2006 ELEG652-06F 88

Side Note 2Gordon Bells’ Rules for Supercomputers

• 10 ~ 7– Support around two extra bits of addressing per 3

years (10 years ~= 7 bits)

• Increase Productivity• Stand on the shoulders of giants• Pipeline your design

– Design for one and then other and then other…

• Give you enough lay time in your schedule for unexpected– Beware of Murphy’s Laws!!!!