CS252 Graduate Computer Architecture Lecture 12 Vector Processing (Con’t) Branch Prediction

CS252Graduate Computer Architecture

Lecture 12

Vector Processing (Con’t)Branch Prediction

John KubiatowiczElectrical Engineering and Computer Sciences

University of California, Berkeley

http://www.eecs.berkeley.edu/~kubitron/cs252http://www-inst.eecs.berkeley.edu/~cs252

3/5/2007 cs252-S07, Lecture 12 2

+ + + + + +

[0] [1] [VLR-1]

Vector Arithmetic Instructions

ADDV v3, v1, v2 v3

v2v1

Scalar Registers

r0

r15Vector Registers

v0

v15

[0] [1] [2] [VLRMAX-1]

VLRVector Length Register

v1Vector Load and

Store InstructionsLV v1, r1, r2

Base, r1 Stride, r2 Memory

Vector Register

Review: Vector Programming Model

3/5/2007 cs252-S07, Lecture 12 3

Review: Vector Unit Structure

Lane

Functional Unit

VectorRegisters

Memory Subsystem

Elements 0, 4, 8, …




3/5/2007 cs252-S07, Lecture 12 4

Review: Vector StripminingProblem: Vector registers have finite lengthSolution: Break loops into pieces that fit into vector

registers, “Stripmining” ANDI R1, N, 63 # N mod 64 MTC1 VLR, R1 # Do remainderloop: LV V1, RA DSLL R2, R1, 3 # Multiply by 8 DADDU RA, RA, R2 # Bump pointer LV V2, RB DADDU RB, RB, R2 ADDV.D V3, V1, V2 SV V3, RC DADDU RC, RC, R2 DSUBU N, N, R1 # Subtract elements LI R1, 64 MTC1 VLR, R1 # Reset full length BGTZ N, loop # Any more to do?

for (i=0; i<N; i++) C[i] = A[i]+B[i];

+

+

+

A B C

64 elements

Remainder

3/5/2007 cs252-S07, Lecture 12 5

Vector ReductionsProblem: Loop-carried dependence on reduction variables

sum = 0;for (i=0; i<N; i++) sum += A[i]; # Loop-carried dependence on sum

Solution: Re-associate operations if possible, use binary tree to perform reduction# Rearrange as:sum[0:VL-1] = 0 # Vector of VL partial sumsfor(i=0; i<N; i+=VL) # Stripmine VL-sized chunks sum[0:VL-1] += A[i:i+VL-1]; # Vector sum# Now have VL partial sums in one vector registerdo { VL = VL/2; # Halve vector length sum[0:VL-1] += sum[VL:2*VL-1] # Halve no. of partials} while (VL>1)

3/5/2007 cs252-S07, Lecture 12 6

Novel Matrix Multiply Solution• Consider the following:

/* Multiply a[m][k] * b[k][n] to get c[m][n] */for (i=1; i<m; i++) { for (j=1; j<n; j++) {

sum = 0; for (t=1; t<k; t++)

sum += a[i][t] * b[t][j]; c[i][j] = sum; }}

• Do you need to do a bunch of reductions? NO!– Calculate multiple independent sums within one vector register– You can vectorize the j loop to perform 32 dot-products at the same

time (Assume Max Vector Length is 32)• Show it in C source code, but can imagine the

assembly vector instructions from it

3/5/2007 cs252-S07, Lecture 12 7

Optimized Vector Example/* Multiply a[m][k] * b[k][n] to get c[m][n] */for (i=1; i<m; i++) { for (j=1; j<n; j+=32) {/* Step j 32 at a time. */

sum[0:31] = 0; /* Init vector reg to zeros. */ for (t=1; t<k; t++) { a_scalar = a[i][t]; /* Get scalar */ b_vector[0:31] = b[t][j:j+31]; /* Get vector */

/* Do a vector-scalar multiply. */prod[0:31] = b_vector[0:31]*a_scalar;

/* Vector-vector add into results. */ sum[0:31] += prod[0:31];

}/* Unit-stride store of vector of results. */

c[i][j:j+31] = sum[0:31];}

}

3/5/2007 cs252-S07, Lecture 12 8

How Pick Vector Length?

• Longer good because:1) Hide vector startup2) lower instruction bandwidth3) tiled access to memory reduce scalar processor memory

bandwidth needs4) if know max length of app. is < max vector length, no strip

mining overhead5) Better spatial locality for memory access

• Longer not much help because:1) diminishing returns on overhead savings as keep doubling

number of element2) need natural app. vector length to match physical register

length, or no help (lots of short vectors in modern codes!)

3/5/2007 cs252-S07, Lecture 12 9

How Pick Number of Vector Registers?

• More Vector Registers:1) Reduces vector register “spills” (save/restore)

» 20% reduction to 16 registers for su2cor and tomcatv» 40% reduction to 32 registers for tomcatv» others 10%-15%

2) Aggressive scheduling of vector instructinons: better compiling to take advantage of ILP

• Fewer:1) Fewer bits in instruction format (usually 3 fields)2) Easier implementation

3/5/2007 cs252-S07, Lecture 12 10

Context switch overhead:Huge amounts of state!• Extra dirty bit per processor

– If vector registers not written, don’t need to save on context switch

• Extra valid bit per vector register, cleared on process start

– Don’t need to restore on context switch until needed

3/5/2007 cs252-S07, Lecture 12 11

Exception handling: External Interrupts?• If external exception, can just put pseudo-op into

pipeline and wait for all vector ops to complete– Alternatively, can wait for scalar unit to complete and begin

working on exception code assuming that vector unit will not cause exception and interrupt code does not use vector unit

3/5/2007 cs252-S07, Lecture 12 12

Exception handling: Arithmetic Exceptions

• Arithmetic traps harder• Precise interrupts => large performance loss!• Alternative model: arithmetic exceptions set

vector flag registers, 1 flag bit per element• Software inserts trap barrier instructions from

SW to check the flag bits as needed• IEEE Floating Point requires 5 flag bits

3/5/2007 cs252-S07, Lecture 12 13

Exception handling: Page Faults• Page Faults must be precise• Instruction Page Faults not a problem

– Could just wait for active instructions to drain– Also, scalar core runs page-fault code anyway

• Data Page Faults harder• Option 1: Save/restore internal vector unit state

– Freeze pipeline, dump vector state– perform needed ops– Restore state and continue vector pipeline

• Option 2: expand memory pipeline to check addresses before send to memory + memory buffer between address check and registers

– multiple queues to transfer from memory buffer to registers; check last address in queues before load 1st element from buffer.

– Per Address Instruction Queue (PAIQ) which sends to TLB and memory while in parallel go to Address Check Instruction Queue (ACIQ)

– When passes checks, instruction goes to Committed Instruction Queue (CIQ) to be there when data returns.

– On page fault, only save intructions in PAIQ and ACIQ

3/5/2007 cs252-S07, Lecture 12 14

Multimedia Extensions• Very short vectors added to existing ISAs for micros• Usually 64-bit registers split into 2x32b or 4x16b or

8x8b• Newer designs have 128-bit registers (Altivec, SSE2)• Limited instruction set:

– no vector length control– no strided load/store or scatter/gather– unit-stride loads must be aligned to 64/128-bit boundary

• Limited vector register length:– requires superscalar dispatch to keep multiply/add/load units busy– loop unrolling to hide latencies increases register pressure

• Trend towards fuller vector support in microprocessors

3/5/2007 cs252-S07, Lecture 12 15

“Vector” for Multimedia?• Intel MMX: 57 additional 80x86 instructions (1st since

386)– similar to Intel 860, Mot. 88110, HP PA-71000LC, UltraSPARC

• 3 data types: 8 8-bit, 4 16-bit, 2 32-bit in 64bits– reuse 8 FP registers (FP and MMX cannot mix)

• short vector: load, add, store 8 8-bit operands

• Claim: overall speedup 1.5 to 2X for 2D/3D graphics, audio, video, speech, comm., ...

– use in drivers or added to library routines; no compiler

+

3/5/2007 cs252-S07, Lecture 12 16

MMX Instructions• Move 32b, 64b• Add, Subtract in parallel: 8 8b, 4 16b, 2 32b

– opt. signed/unsigned saturate (set to max) if overflow

• Shifts (sll,srl, sra), And, And Not, Or, Xor in parallel: 8 8b, 4 16b, 2 32b

• Multiply, Multiply-Add in parallel: 4 16b• Compare = , > in parallel: 8 8b, 4 16b, 2 32b

– sets field to 0s (false) or 1s (true); removes branches

• Pack/Unpack– Convert 32b<–> 16b, 16b <–> 8b– Pack saturates (set to max) if number is too large

3/5/2007 cs252-S07, Lecture 12 17

Administrivia• Exam: Wednesday 3/14

Location: TBATIME: 5:30 - 8:30

• This info is on the Lecture page (has been)• Meet at LaVal’s afterwards for Pizza and

Beverages • CS252 Project proposal due by Monday 3/5

– Need two people/project (although can justify three for right project)

– Complete Research project in 8 weeks» Typically investigate hypothesis by building an artifact and

measuring it against a “base case”» Generate conference-length paper/give oral presentation» Often, can lead to an actual publication.

3/5/2007 cs252-S07, Lecture 12 18

Spec92fp Operations (Millions) Instructions (M)Program RISC Vector R / V RISC Vector R / Vswim256 11595 1.1x115 0.8142xhydro2d 5840 1.4x 58 0.8 71xnasa7 6941 1.7x 69 2.2 31xsu2cor 5135 1.4x 51 1.8 29xtomcatv 1510 1.4x 15 1.3 11xwave5 2725 1.1x 27 7.2 4xmdljdp2 3252 0.6x 32 15.8 2x

Operation & Instruction Count: RISC v. Vector Processor(from F. Quintana, U. Barcelona.)

Vector reduces ops by 1.2X, instructions by 20X

3/5/2007 cs252-S07, Lecture 12 19

Common Vector Metrics

• R: MFLOPS rate on an infinite-length vector– vector “speed of light”– Real problems do not have unlimited vector lengths, and the start-up penalties

encountered in real problems will be larger – (Rn is the MFLOPS rate for a vector of length n)

• N1/2: The vector length needed to reach one-half of R – a good measure of the impact of start-up

• NV: The vector length needed to make vector mode faster than scalar mode – measures both start-up and speed of scalars relative to vectors, quality

of connection of scalar unit to vector unit

3/5/2007 cs252-S07, Lecture 12 20

Vector Execution Time• Time = f(vector length, data dependicies, struct. hazards) • Initiation rate: rate that FU consumes vector elements

(= number of lanes; usually 1 or 2 on Cray T-90)• Convoy: set of vector instructions that can begin

execution in same clock (no struct. or data hazards)• Chime: approx. time for a vector operation• m convoys take m chimes; if each vector length is n,

then they take approx. m x n clock cycles (ignores overhead; good approximization for long vectors)

4 convoys, 1 lane, VL=64=> 4 x 64 = 256 clocks(or 4 clocks per result)

1: LV V1,Rx ;load vector X2: MULV V2,F0,V1 ;vector-scalar

mult.LV V3,Ry ;load vector Y

3: ADDV V4,V2,V3 ;add4: SV Ry,V4 ;store the result

3/5/2007 cs252-S07, Lecture 12 2132

Memory operations• Load/store operations move groups of data between registers

and memory• Three types of addressing

– Unit stride» Contiguous block of information in memory» Fastest: always possible to optimize this

– Non-unit (constant) stride» Harder to optimize memory system for all possible strides» Prime number of data banks makes it easier to support different strides at full

bandwidth– Indexed (gather-scatter)

» Vector equivalent of register indirect» Good for sparse arrays of data» Increases number of programs that vectorize

3/5/2007 cs252-S07, Lecture 12 22

Interleaved Memory Layout

• Great for unit stride: – Contiguous elements in different DRAMs– Startup time for vector operation is latency of single read

• What about non-unit stride?– Above good for strides that are relatively prime to 8– Bad for: 2, 4

Vector Processor

Unpipelined

DRAM

Unpipelined

DRAM

Unpipelined

DRAM

Unpipelined

DRAM

Unpipelined

DRAM

Unpipelined

DRAM

Unpipelined

DRAM

Unpipelined

DRAM

AddrMod 8

= 0

AddrMod 8

= 1

AddrMod 8

= 2

AddrMod 8

= 4

AddrMod 8

= 5

AddrMod 8

= 3

AddrMod 8

= 6

AddrMod 8

= 7

3/5/2007 cs252-S07, Lecture 12 23

How to get full bandwidth for Unit Stride?• Memory system must sustain (# lanes x word) /clock• No. memory banks > memory latency to avoid stalls

– m banks m words per memory lantecy l clocks– if m < l, then gap in memory pipeline:clock: 0 …l l+1 l+2… l+m- 1 l+m…2 lword: -- …0 1 2… m-1 --…m– may have 1024 banks in SRAM

• If desired throughput greater than one word per cycle– Either more banks (start multiple requests simultaneously)– Or wider DRAMS. Only good for unit stride or large data types

• More banks/weird numbers of banks good to support more strides at full bandwidth

– can read paper on how to do prime number of banks efficiently

3/5/2007 cs252-S07, Lecture 12 24

Avoiding Bank Conflicts• Lots of banksint x[256][512];for (j = 0; j < 512; j = j+1)for (i = 0; i < 256; i = i+1)x[i][j] = 2 * x[i][j];

• Even with 128 banks, since 512 is multiple of 128, conflict on word accesses

• SW: loop interchange or declaring array not power of 2 (“array padding”)• HW: Prime number of banks

– bank number = address mod number of banks– address within bank = address / number of words in bank– modulo & divide per memory access with prime no. banks?– address within bank = address mod number words in bank– bank number? easy if 2N words per bank

3/5/2007 cs252-S07, Lecture 12 25

• Chinese Remainder TheoremAs long as two sets of integers ai and bi follow these rules

and that ai and aj are co-prime if i j, then the integer x has only one solution (unambiguous mapping):

– bank number = b0, number of banks = a0 (= 3 in example)– address within bank = b1, number of words in bank = a1

(= 8 in example)– N word address 0 to N-1, prime no. banks, words power of 2

bi xmodai,0bi ai, 0x a0a1a2

Fast Bank Number

Seq. Interleaved Modulo InterleavedBank Number: 0 1 2 0 1 2

Address within Bank: 0 0 1 2 0 16 8

1 3 4 5 9 1 172 6 7 8 18 10 23 9 10 11 3 19 114 12 13 14 12 4 205 15 16 17 21 13 56 18 19 20 6 22 147 21 22 23 15 7 23

3/5/2007 cs252-S07, Lecture 12 26

Vectors Are Inexpensive

Scalar• N ops per cycle

2) circuitry• HP PA-8000

• 4-way issue• reorder buffer:

850K transistors• incl. 6,720 5-bit register

number comparators

Vector• N ops per cycle

2) circuitry• T0 vector micro

• 24 ops per cycle• 730K transistors total

• only 23 5-bit register number comparators

• No floating point

3/5/2007 cs252-S07, Lecture 12 27

Vectors Lower PowerVector

• One inst fetch, decode, dispatch per vector

• Structured register accesses

• Smaller code for high performance, less power in instruction cache misses

• Bypass cache

• One TLB lookup pergroup of loads or stores

• Move only necessary dataacross chip boundary

Single-issue Scalar• One instruction fetch, decode,

dispatch per operation• Arbitrary register accesses,

adds area and power• Loop unrolling and software

pipelining for high performance increases instruction cache footprint

• All data passes through cache; waste power if no temporal locality

• One TLB lookup per load or store

• Off-chip access in whole cache lines

3/5/2007 cs252-S07, Lecture 12 28

Superscalar Energy Efficiency Even Worse

Vector• Control logic grows

linearly with issue width• Vector unit switches

off when not in use

• Vector instructions expose parallelism without speculation

• Software control ofspeculation when desired:

– Whether to use vector mask or compress/expand for conditionals

Superscalar• Control logic grows

quadratically with issue width

• Control logic consumes energy regardless of available parallelism

• Speculation to increase visible parallelism wastes energy

3/5/2007 cs252-S07, Lecture 12 29

Vector Applications

Limited to scientific computing?• Multimedia Processing (compress., graphics, audio synth, image proc.)

• Standard benchmark kernels (Matrix Multiply, FFT, Convolution, Sort)• Lossy Compression (JPEG, MPEG video and audio)• Lossless Compression (Zero removal, RLE, Differencing, LZW)• Cryptography (RSA, DES/IDEA, SHA/MD5)• Speech and handwriting recognition• Operating systems/Networking (memcpy, memset, parity, checksum)• Databases (hash/join, data mining, image/video serving)• Language run-time support (stdlib, garbage collection)• even SPECint95

3/5/2007 cs252-S07, Lecture 12 30

Older Vector Machines

Machine Year Clock Regs Elements FUs LSUsCray 1 1976 80 MHz8 64 6 1Cray XMP 1983 120 MHz8 64 8 2 L, 1 SCray YMP 1988 166 MHz8 64 8 2 L, 1 SCray C-90 1991 240 MHz8 128 8 4Cray T-90 1996 455 MHz8 128 8 4Conv. C-1 1984 10 MHz8 128 4 1Conv. C-4 1994 133 MHz16 128 3 1Fuj. VP200 1982 133 MHz8-256 32-1024 3 2Fuj. VP300 1996 100 MHz8-256 32-1024 3 2NEC SX/2 1984 160 MHz8+8K 256+var 16 8NEC SX/3 1995 400 MHz8+8K 256+var 16 8

3/5/2007 cs252-S07, Lecture 12 31

Newer Vector Computers• Cray X1

– MIPS like ISA + Vector in CMOS

• NEC Earth Simulator– Fastest computer in world for 3 years; 40 TFLOPS– 640 CMOS vector nodes

3/5/2007 cs252-S07, Lecture 12 32

Key Architectural Features of X1

New vector instruction set architecture (ISA)– Much larger register set (32x64 vector, 64+64 scalar)

– 64- and 32-bit memory and IEEE arithmetic

– Based on 25 years of experience compiling with Cray1 ISA

Decoupled Execution– Scalar unit runs ahead of vector unit, doing addressing and control– Hardware dynamically unrolls loops, and issues multiple loops concurrently– Special sync operations keep pipeline full, even across barriers Allows the processor to perform well on short nested loops

Scalable, distributed shared memory (DSM) architecture– Memory hierarchy: caches, local memory, remote memory

– Low latency, load/store access to entire machine (tens of TBs)

– Processors support 1000’s of outstanding refs with flexible addressing– Very high bandwidth network– Coherence protocol, addressing and synchronization optimized for DM

3/5/2007 cs252-S07, Lecture 12 33

• Technology refresh of the X1 (0.13m)– ~50% faster processors

– Scalar performance enhancements

– Doubling processor density

– Modest increase in memory system bandwidth

– Same interconnect and I/O

• Machine upgradeable– Can replace Cray X1 nodes with X1E nodes

Cray X1E Mid-life Enhancement

3/5/2007 cs252-S07, Lecture 12 34

ESS – configuration of a general purpose supercomputer

1. Processor Nodes (PN) Total number of processor nodes is 640. Each processor node consists of eight vector processors of 8 GFLOPS and 16GB shared memories. Therefore, total numbers of processors is 5,120 and total peak performance and main memory of the system are 40 TFLOPS and 10 TB, respectively. Two nodes are installed into one cabinet, which size is 40”x56”x80”. 16 nodes are in a cluster. Power consumption per cabinet is approximately 20 KVA.

2) Interconnection Network (IN): Each node is coupled together with more than 83,000 copper cables via single-stage crossbar switches of 16GB/s x2 (Load + Store). The total length of the cables is approximately 1,800 miles.

3) Hard Disk. Raid disks are used for the system. The capacities are 450 TB for the systems operations and 250 TB for users.

4) Mass Storage system: 12 Automatic Cartridge Systems (STK PowderHorn9310); total storage capacity is approximately 1.6 PB.

From Horst D. Simon, NERSC/LBNL, May 15, 2002, “ESS Rapid Response Meeting”

3/5/2007 cs252-S07, Lecture 12 35

Earth Simulator

3/5/2007 cs252-S07, Lecture 12 36

Earth Simulator Building

3/5/2007 cs252-S07, Lecture 12 37

ESS – complete system installed 4/1/2002

3/5/2007 cs252-S07, Lecture 12 38

Vector Summary• Vector is alternative model for exploiting ILP• If code is vectorizable, then simpler hardware,

more energy efficient, and better real-time model than Out-of-order machines

• Design issues include number of lanes, number of functional units, number of vector registers, length of vector registers, exception handling, conditional operations

• Fundamental design issue is memory bandwidth– With virtual address translation and caching

• Will multimedia popularity revive vector architectures?

CS252 Graduate Computer Architecture Lecture 12 Vector Processing (Con’t) Branch Prediction

Documents

vector stripminingproblem

vector reductionsproblem

vector ops

vector flag registers

vector unitcs252s07

processorif vector registers

max vector length

ioptimized vector example