This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CS252Graduate Computer Architecture
Lecture 12
Vector Processing (Con’t)Branch Prediction
John KubiatowiczElectrical Engineering and Computer Sciences
Review: Vector StripminingProblem: Vector registers have finite lengthSolution: Break loops into pieces that fit into vector
registers, “Stripmining” ANDI R1, N, 63 # N mod 64 MTC1 VLR, R1 # Do remainderloop: LV V1, RA DSLL R2, R1, 3 # Multiply by 8 DADDU RA, RA, R2 # Bump pointer LV V2, RB DADDU RB, RB, R2 ADDV.D V3, V1, V2 SV V3, RC DADDU RC, RC, R2 DSUBU N, N, R1 # Subtract elements LI R1, 64 MTC1 VLR, R1 # Reset full length BGTZ N, loop # Any more to do?
for (i=0; i<N; i++) C[i] = A[i]+B[i];
+
+
+
A B C
64 elements
Remainder
3/5/2007 cs252-S07, Lecture 12 5
Vector ReductionsProblem: Loop-carried dependence on reduction variables
sum = 0;for (i=0; i<N; i++) sum += A[i]; # Loop-carried dependence on sum
Solution: Re-associate operations if possible, use binary tree to perform reduction# Rearrange as:sum[0:VL-1] = 0 # Vector of VL partial sumsfor(i=0; i<N; i+=VL) # Stripmine VL-sized chunks sum[0:VL-1] += A[i:i+VL-1]; # Vector sum# Now have VL partial sums in one vector registerdo { VL = VL/2; # Halve vector length sum[0:VL-1] += sum[VL:2*VL-1] # Halve no. of partials} while (VL>1)
3/5/2007 cs252-S07, Lecture 12 6
Novel Matrix Multiply Solution• Consider the following:
/* Multiply a[m][k] * b[k][n] to get c[m][n] */for (i=1; i<m; i++) { for (j=1; j<n; j++) {
sum = 0; for (t=1; t<k; t++)
sum += a[i][t] * b[t][j]; c[i][j] = sum; }}
• Do you need to do a bunch of reductions? NO!– Calculate multiple independent sums within one vector register– You can vectorize the j loop to perform 32 dot-products at the same
time (Assume Max Vector Length is 32)• Show it in C source code, but can imagine the
assembly vector instructions from it
3/5/2007 cs252-S07, Lecture 12 7
Optimized Vector Example/* Multiply a[m][k] * b[k][n] to get c[m][n] */for (i=1; i<m; i++) { for (j=1; j<n; j+=32) {/* Step j 32 at a time. */
sum[0:31] = 0; /* Init vector reg to zeros. */ for (t=1; t<k; t++) { a_scalar = a[i][t]; /* Get scalar */ b_vector[0:31] = b[t][j:j+31]; /* Get vector */
/* Do a vector-scalar multiply. */prod[0:31] = b_vector[0:31]*a_scalar;
/* Vector-vector add into results. */ sum[0:31] += prod[0:31];
}/* Unit-stride store of vector of results. */
c[i][j:j+31] = sum[0:31];}
}
3/5/2007 cs252-S07, Lecture 12 8
How Pick Vector Length?
• Longer good because:1) Hide vector startup2) lower instruction bandwidth3) tiled access to memory reduce scalar processor memory
bandwidth needs4) if know max length of app. is < max vector length, no strip
mining overhead5) Better spatial locality for memory access
• Longer not much help because:1) diminishing returns on overhead savings as keep doubling
number of element2) need natural app. vector length to match physical register
length, or no help (lots of short vectors in modern codes!)
3/5/2007 cs252-S07, Lecture 12 9
How Pick Number of Vector Registers?
• More Vector Registers:1) Reduces vector register “spills” (save/restore)
» 20% reduction to 16 registers for su2cor and tomcatv» 40% reduction to 32 registers for tomcatv» others 10%-15%
2) Aggressive scheduling of vector instructinons: better compiling to take advantage of ILP
• Fewer:1) Fewer bits in instruction format (usually 3 fields)2) Easier implementation
3/5/2007 cs252-S07, Lecture 12 10
Context switch overhead:Huge amounts of state!• Extra dirty bit per processor
– If vector registers not written, don’t need to save on context switch
• Extra valid bit per vector register, cleared on process start
– Don’t need to restore on context switch until needed
3/5/2007 cs252-S07, Lecture 12 11
Exception handling: External Interrupts?• If external exception, can just put pseudo-op into
pipeline and wait for all vector ops to complete– Alternatively, can wait for scalar unit to complete and begin
working on exception code assuming that vector unit will not cause exception and interrupt code does not use vector unit
3/5/2007 cs252-S07, Lecture 12 12
Exception handling: Arithmetic Exceptions
• Arithmetic traps harder• Precise interrupts => large performance loss!• Alternative model: arithmetic exceptions set
vector flag registers, 1 flag bit per element• Software inserts trap barrier instructions from
SW to check the flag bits as needed• IEEE Floating Point requires 5 flag bits
3/5/2007 cs252-S07, Lecture 12 13
Exception handling: Page Faults• Page Faults must be precise• Instruction Page Faults not a problem
– Could just wait for active instructions to drain– Also, scalar core runs page-fault code anyway
• Data Page Faults harder• Option 1: Save/restore internal vector unit state
– Freeze pipeline, dump vector state– perform needed ops– Restore state and continue vector pipeline
• Option 2: expand memory pipeline to check addresses before send to memory + memory buffer between address check and registers
– multiple queues to transfer from memory buffer to registers; check last address in queues before load 1st element from buffer.
– Per Address Instruction Queue (PAIQ) which sends to TLB and memory while in parallel go to Address Check Instruction Queue (ACIQ)
– When passes checks, instruction goes to Committed Instruction Queue (CIQ) to be there when data returns.
– On page fault, only save intructions in PAIQ and ACIQ
3/5/2007 cs252-S07, Lecture 12 14
Multimedia Extensions• Very short vectors added to existing ISAs for micros• Usually 64-bit registers split into 2x32b or 4x16b or
3: ADDV V4,V2,V3 ;add4: SV Ry,V4 ;store the result
3/5/2007 cs252-S07, Lecture 12 2132
Memory operations• Load/store operations move groups of data between registers
and memory• Three types of addressing
– Unit stride» Contiguous block of information in memory» Fastest: always possible to optimize this
– Non-unit (constant) stride» Harder to optimize memory system for all possible strides» Prime number of data banks makes it easier to support different strides at full
bandwidth– Indexed (gather-scatter)
» Vector equivalent of register indirect» Good for sparse arrays of data» Increases number of programs that vectorize
3/5/2007 cs252-S07, Lecture 12 22
Interleaved Memory Layout
• Great for unit stride: – Contiguous elements in different DRAMs– Startup time for vector operation is latency of single read
• What about non-unit stride?– Above good for strides that are relatively prime to 8– Bad for: 2, 4
Vector Processor
Unpipelined
DRAM
Unpipelined
DRAM
Unpipelined
DRAM
Unpipelined
DRAM
Unpipelined
DRAM
Unpipelined
DRAM
Unpipelined
DRAM
Unpipelined
DRAM
AddrMod 8
= 0
AddrMod 8
= 1
AddrMod 8
= 2
AddrMod 8
= 4
AddrMod 8
= 5
AddrMod 8
= 3
AddrMod 8
= 6
AddrMod 8
= 7
3/5/2007 cs252-S07, Lecture 12 23
How to get full bandwidth for Unit Stride?• Memory system must sustain (# lanes x word) /clock• No. memory banks > memory latency to avoid stalls
– m banks m words per memory lantecy l clocks– if m < l, then gap in memory pipeline:clock: 0 …l l+1 l+2… l+m- 1 l+m…2 lword: -- …0 1 2… m-1 --…m– may have 1024 banks in SRAM
• If desired throughput greater than one word per cycle– Either more banks (start multiple requests simultaneously)– Or wider DRAMS. Only good for unit stride or large data types
• More banks/weird numbers of banks good to support more strides at full bandwidth
– can read paper on how to do prime number of banks efficiently
3/5/2007 cs252-S07, Lecture 12 24
Avoiding Bank Conflicts• Lots of banksint x[256][512];for (j = 0; j < 512; j = j+1)for (i = 0; i < 256; i = i+1)x[i][j] = 2 * x[i][j];
• Even with 128 banks, since 512 is multiple of 128, conflict on word accesses
• SW: loop interchange or declaring array not power of 2 (“array padding”)• HW: Prime number of banks
– bank number = address mod number of banks– address within bank = address / number of words in bank– modulo & divide per memory access with prime no. banks?– address within bank = address mod number words in bank– bank number? easy if 2N words per bank
3/5/2007 cs252-S07, Lecture 12 25
• Chinese Remainder TheoremAs long as two sets of integers ai and bi follow these rules
and that ai and aj are co-prime if i j, then the integer x has only one solution (unambiguous mapping):
– bank number = b0, number of banks = a0 (= 3 in example)– address within bank = b1, number of words in bank = a1
(= 8 in example)– N word address 0 to N-1, prime no. banks, words power of 2
• NEC Earth Simulator– Fastest computer in world for 3 years; 40 TFLOPS– 640 CMOS vector nodes
3/5/2007 cs252-S07, Lecture 12 32
Key Architectural Features of X1
New vector instruction set architecture (ISA)– Much larger register set (32x64 vector, 64+64 scalar)
– 64- and 32-bit memory and IEEE arithmetic
– Based on 25 years of experience compiling with Cray1 ISA
Decoupled Execution– Scalar unit runs ahead of vector unit, doing addressing and control– Hardware dynamically unrolls loops, and issues multiple loops concurrently– Special sync operations keep pipeline full, even across barriers Allows the processor to perform well on short nested loops
– Low latency, load/store access to entire machine (tens of TBs)
– Processors support 1000’s of outstanding refs with flexible addressing– Very high bandwidth network– Coherence protocol, addressing and synchronization optimized for DM
3/5/2007 cs252-S07, Lecture 12 33
• Technology refresh of the X1 (0.13m)– ~50% faster processors
– Scalar performance enhancements
– Doubling processor density
– Modest increase in memory system bandwidth
– Same interconnect and I/O
• Machine upgradeable– Can replace Cray X1 nodes with X1E nodes
Cray X1E Mid-life Enhancement
3/5/2007 cs252-S07, Lecture 12 34
ESS – configuration of a general purpose supercomputer
1. Processor Nodes (PN) Total number of processor nodes is 640. Each processor node consists of eight vector processors of 8 GFLOPS and 16GB shared memories. Therefore, total numbers of processors is 5,120 and total peak performance and main memory of the system are 40 TFLOPS and 10 TB, respectively. Two nodes are installed into one cabinet, which size is 40”x56”x80”. 16 nodes are in a cluster. Power consumption per cabinet is approximately 20 KVA.
2) Interconnection Network (IN): Each node is coupled together with more than 83,000 copper cables via single-stage crossbar switches of 16GB/s x2 (Load + Store). The total length of the cables is approximately 1,800 miles.
3) Hard Disk. Raid disks are used for the system. The capacities are 450 TB for the systems operations and 250 TB for users.
4) Mass Storage system: 12 Automatic Cartridge Systems (STK PowderHorn9310); total storage capacity is approximately 1.6 PB.
From Horst D. Simon, NERSC/LBNL, May 15, 2002, “ESS Rapid Response Meeting”
3/5/2007 cs252-S07, Lecture 12 35
Earth Simulator
3/5/2007 cs252-S07, Lecture 12 36
Earth Simulator Building
3/5/2007 cs252-S07, Lecture 12 37
ESS – complete system installed 4/1/2002
3/5/2007 cs252-S07, Lecture 12 38
Vector Summary• Vector is alternative model for exploiting ILP• If code is vectorizable, then simpler hardware,
more energy efficient, and better real-time model than Out-of-order machines
• Design issues include number of lanes, number of functional units, number of vector registers, length of vector registers, exception handling, conditional operations
• Fundamental design issue is memory bandwidth– With virtual address translation and caching
• Will multimedia popularity revive vector architectures?