UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors
2014-4-15
John Lazzaro(not a prof - “John” is always OK)
CS 152Computer Architecture and Engineering
www-inst.eecs.berkeley.edu/~cs152/
TA: Eric Love
Lecture 22 -- GPU + SIMD + Vectors I
Play:
UC Regents Fall 2006 © UCBCS 152 L22: GPU + SIMD + Vectors
Today: Architecture for data parallelism
The Landscape: Three chips that deliver TeraOps/s in 2014, and how they differ.
GK110: nVidia’s flagship Kepler GPU, customized for compute applications.
Short Break
E5-2600v2: Stretching the Xeon server approach for compute-intensive apps.
Sony/IBM Playstation PS3 Cell Chip - Released 2006
Sony PS3 Cell Processor SPE Floating-Point
32-bit 32-bit 32-bit 32-bitSingle-Instruction
Multiple-Data
4
single-precisionmultiply-
addsissue in lockstep(SIMD)
per cycle.6 cycle latency(in blue)
6 gamer SPEs,
3.2 GHz clock,
--> 150 GigaOps/s
Sony PS3 Cell Processor SPE Floating-Point
32-bit 32-bit 32-bit 32-bitSingle-Instruction
Multiple-DataIn the 1970s a big part
of a computer
architecture class would be learning how to build
units like this.Top-down
(f.p. format)&&
Bottom-up(logic design)
Sony PS3 Cell Processor SPE Floating-PointThe PS3 ceded ground to Xbox not because it
was underpowered, but because it was hard to program.
Today, the formats are standards (IEEE f.p.)
and the bottom-up is now “EE.”
Architects focus on how to organize
floating point units into
programmable machines
for application domains.
UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors
2014: TeraOps/Sec Chips
Intel E5-2600v2
12-core Xeon Ivy Bridge
0.52 TeraOps/s
12 cores @ 2.7 GHzEach core
can issue 16 single-
precisionoperations per cycle.
$2,600 per chip
Haswell: 1.04 TeraOps/s
EECS 150: Graphics Processors UC Regents Fall 2013 © UCB
nVidia GPU5.12
TeraOps/s
2880 MACs @ 889 MHz
single-precision
multiply-adds
Kepler GK 110
$999
GTX Titan Black with
6GB GDDR5 (and 1 GPU)
Typical application: Medical imaging scanners, for first stage of processing after the A/D converters.
XC7VX980T
Xilinx Virtex 7 with the most
DSP blocks.
3600 MACs @ 714 MHzComparable
to single-precision
floating-point.
5.14 TeraOps/s
$16,824 per chip
(die photo of a related part)
Intel E5-2600v2
12 cores @ 2.7 GHz
How?
Haswell coresissue
32/cycle.
12 cores @ 2.7 GHzEach core
can issue 16 single-
precisionops/cycle.
Die closeup of one Sandy Bridge core
Advanced Vector Extension (AVX) unit
Smaller than L3 cache, but larger than L2 cache.Relative area has increased in
Haswell
Programmers ModelAVX
IA-32 Nehalem
8 128-bit registers
Each register holds 4 IEEE single-precision floats
The programmers model has many variants, which we will introduce in the slides that
follow
Example AVX Opcode
VMULPS XMM4 XMM2 XMM3
XMM2
XMM3
XMM4op = *
Multiply two 4-element vectors ofsingle-precision floats, element by element.
New issue every cycle. 5 cycle latency (Haswell).
Aside from its use of a special register set, VMULPS execute like normal IA-32
instructions.
Sandy Bridge, Haswell
Sandy Bridge extends register set to 256 bits: vectors are twice the
size.
IA-64 AVX/AVX2
has 16 registers
(IA-32: 8)
Haswell adds 3-operand instructions a*b + c
Fused multiply-add (FMA)
2 EX units with FMA --> 2X increase in ops/cycle
OoO Issue Haswell
(2013)
Haswell sustains 4 micro-op issues per cycle.One possibility:2 for AVX, and 2 for Loads, Stores and book-keeping.
Haswell has two copies of the FMA engine, on separate ports.
AVX: Not just single-precision floating-pointAVX instruction variants interpret 128-bit
registersas 4 floats, 2 doubles, 16 8-bit integers, etc ...
256-bit version -> double-precision vectors of length 4
Exception Model
MXCSR: AVX
condition codes
register
Floating-point exceptions: Always a contentious issue in ISA design ...
Exception Handling
Use MXCSRto configureAVX to halt
program for divide by
zero, etc ...
Or, configure AVX for show must go onsemantics: on error,
results are set to +Inf, -Inf, NaN, ...
Data movesAVX register file reads pass through a permute
and shuffle networks in both “X” and “Y” dimensions.
Many AVX instructions rely on this feature ...
Pure data
move opcode.
Or, part of a
math opcode.
Permutes over 2 sets of 4 fields
of one vector.
Arbitrary data
alignment
Shuffling two vectors.
Memory System
Gather: Reading non-unit-stride memory locations into arbitrary positions in an AVX register, while minimizing redundant reads.
Values in memory.Specified indices.
Final result.
Positive observations ...
Best for applications that are a good fit for Xeon’s memory system: Large on-chip caches, up-to-a-TeraByte of DRAM, but only moderate bandwidth requirements to DRAM. Applications that do “a lot of everything” --integer, random-access loads/stores, string ops -- gain access to a significant fraction of a TeraOp/sof floating point, with no context switching.If you’re planning on experimenting with GPUs,you need a Xeon server anyway ...aside from $$$, why not buy a high-core-count variant?
Negative observations ...
AVX changes each generation, in a backward compatible way, to add the latest features. AVX is difficult for compilers. Ideally, someone has written a library of hand-crafted AVX assembly code that does exactly what you want.Two FMA units per core (50% of issue width) is probably the limit. So, scaling vector size or scaling core count are the only upgrade paths.
0.52 TeraOp/s (Ivy Bridge) << 5.12 TeraOp/s (GK110)And $2700 (chip only) >> $999 (Titan Black card).59.6 GB/s << 336 GB/s (memory bandwidth)
UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors
Break
Play:
EECS 150: Graphics Processors UC Regents Fall 2013 © UCB
nVidia GPU
The granularity of SMX
cores (15 per
die)matches the Xeon
core count (12 per
die)
Kepler GK 110
SMX core(28 nm)
Sandy Bridge core
(32 nm)
889 MHz GK 110 SMX core vs 2.7 GHz Haswell core
single prec.
double prec.
1024-bit SIMD vectors: 4X more than Haswell32 single-precision floats or 16 double-precision floats
singleprecisio
n
singleprecisio
n
singleprecisio
n
singleprecisio
n
singleprecisio
n
singleprecisio
n
doubleprecisio
n
doubleprecisio
n
specialops
memory ops
Execution units vs. Haswell 3X (single-precision), 1X (double-precision)
Clock speed vs Ivy Bridge Xeon: 3X slower
4X single-precision, 1.33X double-precision
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Organization: Multi-threaded like Niagara
Thread scheduler
2048 registers in total. Several programmer models available. Largest model has 256 registers per thread, supporting 8 active threads.
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Organization: Multi-threaded, In-order
Thread scheduler
The SIMD math units live here
Each cycle, 3 threads can issue 2 in-order instructions.
Bandwidth to DRAM
is 5.6X XeonIvy Bridge
But, DRAM limited to
6GB, and all caches are
small compared
to Xeon
EECS 150: Graphics Processors UC Regents Fall 2013 © UCB
nVidia GPU5.12
TeraOps/s
Kepler GK 110
$999
GTX Titan Black with
6GB GDDR5 (and 1 GPU)
2880 MACs @ 889 MHz
single-precision
multiply-adds
On Thursday
To be continued ...
Have fun in section !