GP#GPUs Christos Kozyrakis h.p://cs316.stanford.edu CS316 – Fall 2014 – Lecture 12
GP#GPUs&&
Christos(Kozyrakis((
h.p://cs316.stanford.edu(
CS316&–&Fall&2014&–&Lecture&12&
2
Announcements&
! Recommended(reading(! J.(Hennessy(&(D.(Pa.erson,(Computer(Architecture,(chapter(4(! D.(Pa.erson(&(J.(Hennessy,(Computer(OrganizaHon,(appendix(A(
! Wri.en(by(J.(Nickolls(&(D.(Kirk(from(Nvidia(! Watch(out(for(insHtuHonal(bias(
! Both(available(in(the(engineering(library(
! Credits:(Tor(Aamodt,(UBC(! Some(slides(from(his(tutorial(on(GPU(Architectures(
! Reminders(! HW2(and(project(
3
Reminder:&Advantages&of&Vector&ISAs&
! Compact:(single(instrucHon(defines(N(operaHons(! AmorHzes(the(cost(of(instrucHon(fetch/decode/issue(! Also(reduces(the(frequency(of(branches(
! Parallel:(N(operaHons(are((data)(parallel(! No(dependencies(((! No(need(for(complex(hardware(to(detect(parallelism((similar(to(VLIW)(! Can(execute(in(parallel(assuming(N(parallel(datapaths(
! Expressive:(memory(operaHons(describe(pa.erns(! ConHnuous(or(regular(memory(access(pa.ern(! Can(prefetch(or(accelerate(using(wide/mulH_banked(memory(! Can(amorHze(high(latency(for(1st(element(over(large(sequenHal(pa.ern(
4
Intel&Xeon&Phi&(aka&Knights&Corner)&
! Vector((512b,(4(lanes)(+(mulH_threaded((4x)(+(mulH_core((>60)(! But(in_order,(2_way(issue,(and(1.1GHz(! Why?((
L2 Ctl
L1 TLB and 32KB
Code Cache
T0 IP
4 Threads In-Order
TLB Miss
Code Cache Miss
Decode uCode
16B/Cycle (2 IPC)
Pipe 0
X87 RF Scalar RF
X87 ALU 0 ALU 1
VPU RF
VPU 512b SIMD
Pipe 1
TLB Miss Handler
L2 TLB
T1 IP T2 IP T3 IP
L1 TLB and 32KB Data Cache DCache Miss
TLB Miss
To On-Die Interconnect
HWP
Core
512KB L2 Cache
5
Vector&Unit&Design&
! Vector(ISA(! 32(vector(registers((512b),(8(mask(registers,(sca.er/gather(
! Microarchitecture(features(! Fast(read(from(L1,(numeric(type(conversion(on(register(read,(…(
PPF PF D0 D1 D2 E WB
VC2 V1-V4 WB D2 E VC1
VC2 V1 V2 D2 E VC1 V3 V4
DEC VPU RF
3R, 1W
Mask RF
Scatter Gather
ST
LD
EMU Vector ALUs
16 Wide x 32 bit 8 Wide x 64 bit
Fused Multiply Add
6
Vector&FuncEonal&Units&
! 16_wide(SP(SIMD,(8_wide(DP(SIMD(
Shared Multiplier Circuit for SP/DP
RF3 RF2 RF1 RF0
SP 15 DP
7 SP 14
SP 13 DP
6 SP 12
SP 11 DP
5 SP 10
SP 9 DP
4 SP 8
SP 7 DP
3 SP 6
SP 5 DP
2 SP 4
SP 3 DP
1 SP 2
SP 1 DP
0 SP 0
7
Gather/ScaGer&
! Gather/sca.er(takes(advantage(of(cache(locality(
gather-prime loop: gather-step; jump-mask-not-zero loop
Index0
+
Base Address
Addr0
Index1
+
Addr1
Index2
+
Addr2
Index3
+
Addr3
Index4
+
Addr4
Index5
+
Addr5
Index6
+
Addr6
Index7
+
Addr7 1 1
1 1
1 1
1 1
Clear
Clear = =
Access Address
Find First
Gather/Scatter machine takes advantage of cache-line locality
Gather Instruction Loop
Scalar Register
Vector Register
Mask Register
To TLB/ DCACHE
8
Graphics&Processors&(GPUs)&
9
GPUs&Timeline&
! Till(mid(90s(! VGA(controllers((used(to(accelerate(some(display(funcHons(
! Mid(90s(to(mid(00s(! Fixed_funcHon(graphics(accelerators(for(the(OpenGL(and(DirectX(APIs(
! Some(GP_GPU(capabiliHes(by(on(top(of(the(interfaces(! 3D(graphics:(triangle(setup(&(rasterizaHon,(texture(mapping(&(shading(
! Modern(GPUs(! Programmable(mulHprocessors(opHmized(for(data_parallel(ops(
! OpenGL/DirectX(and(general(purpose(languages(CUDA,(openCL,(…(! Some(fixed(funcHon(hardware((texture,(raster(ops,(…)(! Oken(integrated(in(the(same(chip(with(mulH_core(CPU((why?)(
! Otherwise(as(a(PCIe_based(accelerator(
10
Our&Focus&Today&
! GPUs(as(programmable(mulH_core(chips(! Hardware(architecture(and(sokware(model(
! A(good(way(to(think(of(GPUs(! MulH_core(chips,(where(every(core(is(a(threaded(SIMD/vector(core(! Not(100%(accurate(but(good(enough(as(a(model(for(SW(developers(
! For(the(graphics(view(of(the(world,(refer(to(the(graphics(courses(
! Nvidia_biased(lecture(! They(tend(to(be(more(open(about(their(architecture(! Some(notes(on(ATI/AMD(towards(the(end(
11
GPU&Thread&Model&SoMware&View&
! Single(instrucHon(mulHple(threads(! (SIMT)(
! Each(thread(has(local(memory(
! Parallel(threads(packed(in(blocks(! Access(to(per_block(shared(memory(! Can(synchronize(with(barrier(
! Grids(include(independent(groups(! May(execute(concurrently(
12
Code&Example:&SAXPY&C&Code& CUDA&Code&
! CUDA(code(launches(256(threads(per(block(! Thread(=(1(iteraHon(of(scalar(loop((1(element(op(in(vector(code)(! Block(=(body(of(vectorized(loop((with(VL=256(in(this(example)(! Grid(=(vectorizable(loop((mulHple(iteraHons(of(vectorized(loops(body)(
! Moves(parallelizaHon(from(compiler(to(programmer(! Hopefully(program(wri.en(once(but(scales(to(many(chips(
13
GPU&Microarchitecture&(10,000&feet)&
Single-Instruction, Multiple-Threads
GPU
Interconnection Network
SIMT Core Cluster
SIMT&Core&
SIMT&Core&
Memory&ParEEon&
GDDR3/GDDR5&
Memory&ParEEon&
GDDR3/GDDR5&
Memory&ParEEon&
GDDR3/GDDR5& Off#chip&DRAM
SIMT Core Cluster
SIMT&Core&
SIMT&Core&
SIMT Core Cluster
SIMT&Core&
SIMT&Core&
14
Example&GPU&Architecture:&NVIDIA&Tesla&
Streaming multiprocessor
8 × Streaming processors
15
Example&GPU&Architecture:&Nvidia&Kepler&GK110&
! 15(SMX(processors(! Shared(L2(cache(! 6(memory(
controllers(! 1TFLOPS(double_
precision((
! HW_based(thread(scheduling(
��
An Overview of the GK110 Kepler Architecture Kepler�GK110�was�built�first�and�foremost�for�Tesla,�and�its�goal�was�to�be�the�highest�performing�parallel�computing�microprocessor�in�the�world.�GK110�not�only�greatly�exceeds�the�raw�compute�horsepower�delivered�by�Fermi,�but�it�does�so�efficiently,�consuming�significantly�less�power�and�generating�much�less�heat�output.��
A�full�Kepler�GK110�implementation�includes�15�SMX�units�and�six�64Ͳbit�memory�controllers.��Different�products�will�use�different�configurations�of�GK110.��For�example,�some�products�may�deploy�13�or�14�SMXs.��
Key�features�of�the�architecture�that�will�be�discussed�below�in�more�depth�include:�
x The�new�SMX�processor�architecture�x An�enhanced�memory�subsystem,�offering�additional�caching�capabilities,�more�bandwidth�at�
each�level�of�the�hierarchy,�and�a�fully�redesigned�and�substantially�faster�DRAM�I/O�implementation.�
x Hardware�support�throughout�the�design�to�enable�new�programming�model�capabilities�
�
Kepler�GK110�Full�chip�block�diagram�
16
Streaming&&MulEprocessor&(SMX)&
! The(core(! MulHthreded(! Data(parallel(
! CapabiliHes(! 64K(registers(! 192(simple(cores(
! Int(and(SP(FPU(! 64(DP(FPUs(! 32(LSUs,(32(SFUs(
! Scheduling(! 4(warp(schedulers(! 2_way(dispatch(per(warp(
��
Streaming�Multiprocessor�(SMX)�Architecture�
Kepler�GK110’s�new�SMX�introduces�several�architectural�innovations�that�make�it�not�only�the�most�powerful�multiprocessor�we’ve�built,�but�also�the�most�programmable�and�powerͲefficient.��
�
SMX:�192�singleͲprecision�CUDA�cores,�64�doubleͲprecision�units,�32�special�function�units�(SFU),�and�32�load/store�units�(LD/ST).�
17
SIMT&ExecuEon&Model&
! Programmers(sees(MIMD(threads((scalar)(! GPU(HW(bundles(threads(into(warps(and(runs(them(in(lockstep(on(vector_like(hardware((SIMD)(
A: v = foo[tid.x];
B: if (v < 10)
C: v = 0;
else
D: v = 10;
E: w = bar[tid.x]+v;
Time
A T1 T2 T3 T4
B T1 T2 T3 T4
C T1 T2
D T3 T4
E T1 T2 T3 T4
foo[] = {4,8,12,16};
18
&InstrucEon&&&Thread&Scheduling:&&Where&Threads&Meets&Data&Parallelism&
! In(theory,(all(threads(can(be(independent(! HW(implements(zero_overhead(switching(
! For(efficiency,(32(threads(are(packed(in(warps(! Warp:(set(of(parallel(threads(the(execute(same(instrucHon(
! Wrap(=(a(thread(of(vector(instrucHons(! Warps(introduce(data(parallelism((
! 1(warp(instrucHon(keeps(cores(busy(for(mulHple(cycles(! Individual(threads(may(be(inacHve(
! Because(they(branched(differently(or(predicaHon(! This(is(the(equivalent(of(condiHonal(execuHon(! Loss(of(efficiency(if(not(data(parallel(
! SW(thread(blocks(mapped(to(warps(! When(HW(resources(are(available(
19
Inside&a&SIMT&Core&
! SIMT(front(end(/(SIMD(backend(! Fine_grained(mulHthreading(
! Interleave(warp(execuHon(to(hide(latency(! Register(values(of(all(threads(stays(in(core(
SIMT Front End SIMD Datapath
Fetch Decode
Schedule Branch
Memory Subsystem Icnt. Network SMem L1 D$ Tex $ Const$
Reg File
20
SIMT Front End
Inside&an&“NVIDIA#style”&SIMT&Core&
SIMD Datapath
ALU ALU ALU
I-Cache Decode
I-Buffer
Score Board
Issue Operand Collector
MEM
ALU
Fetch SIMT-Stack
Done (WID)
Valid[1:N]
Branch Target PC
Pred. Active Mask
! Three(decoupled(warp(schedulers(! Scoreboard(! Large(register(file(! MulHple(SIMD(funcHonal(units(
Scheduler 1
Scheduler 2
Scheduler 3
21
Fetch&+&Decode&
! Arbitrate(the(I_cache(among(warps(! Cache(miss(handled(by(fetching(again(later(
! Fetched(instrucHon(is(decoded(and(then(stored(in(the(I_Buffer(! 1(or(more(entries(/(warp(! Only(warp(with(vacant(entries(are(considered(in(fetch(
Inst. W1 r Inst. W2 Inst. W3
v r v r v
To Fetch
Issue
Decode Score- Board
Issue ARB
PC 1 PC 2 PC 3
A R B
Selection T o I -
C a c h
e
Valid[1:N]
I-Cache Decode
I-Buffer
Fetch Valid[1:N]
22
InstrucEon&Issue&
! Select(a(warp(and(issue(an(instrucHon(from(its((I_Buffer(for(execuHon(! Scheduling:(Greedy_Then_Oldest((GTO)(! GT200/later(Fermi/Kepler:((Allow(dual(issue((superscalar)(
! Fermi:(Odd/Even(scheduler(! To(avoid(stalling(pipeline(might(((((keep(instrucHon(in(I_buffer(unHl(((((know(it(can(complete((replay)(
Inst. W1 rInst. W2Inst. W3
vrvrv
ToFetch
Issue
DecodeScore-Board
IssueARB
23
In#Order&Scoreboard&
! Check(for(RAW(and(WAW(hazard(! InstrucHons(reserves(registers(at(issue(! Release(them(at(writeback(! ImplementaHon?(
! Flag(instrucHons(with(hazards(as(not&ready(in(I_Buffer(so(not(considered(by(scheduler(
! Track(up(to(6(registers(per(warp((out(of(128)(! I_buffer(6_entry(bitvector:(1b(per(register(dependency(! Lookup(source(operands,(set(bitvector(in(I_buffer.(As(results(wri.en(per(warp,(clear(corresponding(bit(
24 24 !
- G 1111 TOS
B
C D
E
F
A
G
SIMT&&&Branches!
Thread Warp Common PC
Thread 2
Thread 3
Thread 4
Thread 1
B/1111
C/1001 D/0110
E/1111
A/1111
G/1111
- A 1111 TOS E D 0110 E C 1001 TOS
- E 1111 E D 0110 TOS - E 1111
A D G A
Time
C B E
- B 1111 TOS - E 1111 TOS Reconv. PC Next PC Active Mask
Stack
E D 0110 E E 1001 TOS
- E 1111
25
Tracking&Branch&Divergence&
! Similar(to(vector(processors(but(masks(handled(internally(! No(explicit(mask(register(! Per(warp(stack(that(stores(PCs(and(masks(for(“not(taken”(paths(
! On(a(condiHonal(branch(! Push(the(current(mask(onto(stack(! Push(the(mask(and(PC(for(the(“not(taken”(path(! Set(the(mask(for(the(“taken”(path(and(execute(
! At(the(end(of(the(“taken”(path(! Pop(the(mask(and(PC(for(the(“not(taken”(path(and(execute(
! At(the(end(of(the(“not(taken”(path(! Pop(the(original(mask(before(the(branch(instrucHon(
! If(a(mask(is(all(zeros,(the(block(is(skipped(
26
Register&File&
! 32 warps, 32 threads per warp, 16 x 32-bit registers per thread = 64KB register file. ! Need “4 ports” (e.g., FMA) greatly increase area.
! Alternative: banked single ported register file ! Conflicts avoided using arbitrator + collector (a small issue window)(
27
Lost&in&TranslaEon:&Vector&Vs.&GPU&
28
Lost&in&TranslaEon:&GPU&"&Vector&&
! From(Computer(Architecture,(4th(ediHon(by(J.(Hennessy(and(D.(Pa.erson(
29
Memory&Hierarchy&
! Each(SMX(has(64KB(of(memory(! Split(between(shared(mem(and(L1(cache(
! 16/48,(32/32,(48/16(! 256B(per(access(
! 48KB(read_only(data(cache(! Unified(address(
! 1.5MB(shared(L2(! Supports(synchronizaHon(operaHons(! AtomicCAS,(atomicADD,(…(
! R/W(memories(use(ECC(! RO(memories(use(parity(
��
Kepler�Memory�Subsystem�–�L1,�L2,�ECC�
Kepler’s�memory�hierarchy�is�organized�similarly�to�Fermi.�The�Kepler�architecture�supports�a�unified�memory�request�path�for�loads�and�stores,�with�an�L1�cache�per�SMX�multiprocessor.�Kepler�GK110�also�enables�compilerͲdirected�use�of�an�additional�new�cache�for�readͲonly�data,�as�described�below.�
�
�
64�KB�Configurable�Shared�Memory�and�L1�Cache�
In�the�Kepler�GK110�architecture,�as�in�the�previous�generation�Fermi�architecture,�each�SMX�has�64�KB�of�onͲchip�memory�that�can�be�configured�as�48�KB�of�Shared�memory�with�16�KB�of�L1�cache,�or�as�16�KB�of�shared�memory�with�48�KB�of�L1�cache.�Kepler�now�allows�for�additional�flexibility�in�configuring�the�allocation�of�shared�memory�and�L1�cache�by�permitting�a�32KB�/�32KB�split�between�shared�memory�and�L1�cache.�To�support�the�increased�throughput�of�each�SMX�unit,�the�shared�memory�bandwidth�for�64b�and�larger�load�operations�is�also�doubled�compared�to�the�Fermi�SM,�to�256B�per�core�clock.�
48KB�ReadͲOnly�Data�Cache�
In�addition�to�the�L1�cache,�Kepler�introduces�a�48KB�cache�for�data�that�is�known�to�be�readͲonly�for�the�duration�of�the�function.�In�the�Fermi�generation,�this�cache�was�accessible�only�by�the�Texture�unit.�Expert�programmers�often�found�it�advantageous�to�load�data�through�this�path�explicitly�by�mapping�their�data�as�textures,�but�this�approach�had�many�limitations.��
30
Thread&SynchronizaEon&
! Barrier(synchronizaHon(within(a(thread(block(! Tracking(simplified(by(grouping(threads(into(wraps(! Counter(used(to(track(number(of(threads(that(have(arrived(to(barrier(
! Atomic(operaHons(to(L2/global(memory(! Atomic(read_modify_write((add,(min,(max,(and,(or,(xor)(! Atomic(exchange(or(compare(and(swap(! They(are(Hed(to(L2(latency((
31
Hardware&&MulE#core&Scheduling&
! HW(unit(schedules(grids(on(SMX((! Priority(based(scheduling(
! 32(acHve(grids(! More(queued/paused(
! Grids(launched(by(CPU(or(GPU(! Work(from(mulHple(CPU(cores(
��
�
The�redesigned�Kepler�HOST�to�GPU�workflow�shows�the�new�Grid�Management�Unit,�which�allows�it�to�manage�the�actively�dispatching�grids,�pause�dispatch,�and�hold�pending�and�suspended�grids.�
�
NVIDIA�GPUDirect™�
When�working�with�a�large�amount�of�data,�increasing�the�data�throughput�and�reducing�latency�is�vital�to�increasing�compute�performance.�Kepler�GK110�supports�the�RDMA�feature�in�NVIDIA�GPUDirect,�which�is�designed�to�improve�performance�by�allowing�direct�access�to�GPU�memory�by�thirdͲparty�devices�such�as�IB�adapters,�NICs,�and�SSDs.�When�using�CUDA�5.0,�GPUDirect�provides�the�following�important�features:�
x Direct�memory�access�(DMA)�between�NIC�and�GPU�without�the�need�for�CPUͲside�data�buffering.�
x Significantly�improved�MPISend/MPIRecv�efficiency�between�GPU�and�other�nodes�in�a�network.�x Eliminates�CPU�bandwidth�and�latency�bottlenecks�x Works�with�variety�of�3rdͲparty�network,�capture,�and�storage�devices�
32
Discussion&
! How(do(we(get(data(in(&(out(a(GPU(! Challenges?(! SoluHons?(
! How(would(you(connect(two(GPUs?(! How(would(you(connect(10(GPUs?(
! Do(GPUs(need(caches?(
33
AMD/ATI&GPUs&
! Source:(2012(Hot(Chips(talk(on(Radeon(HD(7970(! Available(at(hotchips.org((
3 | GCN | HotChips 2012
AMD RADEON��HD 7970 ARCHITECTURE
Graphic Core Next (GCN) � 4.3 billion 28nm transistors
34
AMD/ATI&GPUs:&Graphic&Core&Next&
! Memory(system(! 16KB($I(per(4(CUs(! 32KB(R/W($D(per(CU(! 32KB(scalar($D(per(4(CUs(! 768KB(R/W(shared(L2(! 64KB(shared(memory(for(synchronizaHon(
! 6(GDDR5(interfaces(! 264GB/sec((
! ECC(protecHon(6 | GCN | HotChips 2012
AMD RADEON��HD 7970 ARCHITECTURE
Graphic Core Next (GCN) � 384-bit GDDR5 - 264GB/Sec
� Unified R/W Cache Hierarchy
� 768KB R/W L2 Cache
� 16KB R/W L1 Per CU
� 16KB Instruction Cache(I$)/4CU
� 32KB Scalar Data Cache(K$)/4CU
35
AMD/ATI&GPUs:&GCN&Compute&
! MulHthreaded((mulHple(kernels)(! Vector(+(wide(issue(
! 4_way(issue(! 16_element(vectors(
18 | GCN | HotChips 2012
GCN COMPUTE UNIT
� Basic GPU building block of unified shader system � New instruction set architecture
x Non-VLIW x Vector unit + scalar co-processor x Distributed programmable scheduler x Unstructured flow control, function calls, recursion, Exception Support x Un-Typed, Typed, and Image Memory operations
� Each compute unit can execute instructions from multiple kernels simultaneously � Designed for programming simplicity, high utilization, high throughput, with multi-tasking
Branch & Message Unit
Scalar Unit
Vector Units (4x SIMD-16)
Vector Registers (4x 64KB)
Texture Filter Units (4)
Local Data Share (64KB)
L1 Cache (16KB)
Scheduler Texture Fetch
Load / Store Units (16)
Scalar Registers (4KB)
36
AMD/ATI&GPUs:&GCN&Compute&
! MulHthreaded((mulHple(kernels)(! Vector(+(wide(issue(
! 4_way(issue,(16_element(vectors(
19 | GCN | HotChips 2012
GCN COMPUTE UNIT (CU) ARCHITECTURE
Input Data: PC/State/Vector Register/Scalar Register
SIMD PC & IB
Instru
ction F
etch
Arb
itratio
n
4 CU Shared 32KB Instruction L1
R/W L2
Instru
ction A
rbitra
tion
4 CU Shared 16KB Scalar Read Only L1 Rqst Arb
Msg Bus
Scalar Decode
Integer ALU
8 KB Registers
Scalar Unit
Vector Decode
Vector Memory Decode
R/W L2
Export/GDS Decode
Export Bus
MP Vector ALU
64 KB Registers
SIMD3
64 KB LDS Memory LDS Decode
MP Vector ALU
64 KB Registers
SIMD0
SIMD PC & IB
SIMD PC & IB MP
Vector ALU
64 KB Registers
SIMD2
SIMD PC & IB
MP Vector ALU
64 KB Registers
SIMD1
Branch & MSG Unit
R/W data L1
16KB
http://developer.amd.com/afds/assets/presentations/2620_final.pdf
37
AMD/ATI&GPUs:&Local&Data&Share&Access&
! 32_bank(sokware(managed(structure(! High(bandwidth(for(sequenHal(and(indexed(pa.erns(! Support(for(synchronizaHon((barriers)(20 | GCN | HotChips 2012
LOCAL DATA SHARED MEMORY ARCHITECTURE
� 64 kbyte, 32 bank Shared Memory � Direct mode ± Interpolation @ rate or 1 broadcast
read 32/16/8 bit � Index Mode ± 64 dwords per 2 clks - Service 2
waves per 4 clks
� Advantages � Low Latency and Bandwidth amplifier for lower power � Software managed cache � Software consistency/coherency - thread group via
Hardware barrier
38
AMD/ATI&GPUs:&Cache&Hierarchy&
! L2(is(coherent(! Relaxed(consistency(model(
30 | GCN | HotChips 2012
R/W CACHE HIERARCHY
L2
L1 read/write 16kb write through caches
64 Bytes / CU / clock
L2 read/write cache partitions (64kb/128kb) write back caches
64 Bytes / partition / clock
Each CU has 256kb registers and 64kb local data share
K$ I$ 16KB instruction cache (I$) + 32 KB scalar data cache (K$)
shared per 4 CUs with L2 backing
K$ I$
GDS
Global data share facilitates synchronization between CUs
L2 L2
K$
L1 L1 L1 L1 L1 L1 L1 L1 L1
64b Dual Channel Memory Controller
64b Dual Channel Memory Controller
64b Dual Channel Memory Controller
39
Summary&
! GPUs(! Massively(parallel(processors(! Data(parallelism,(threading,(mulH_core(! Getng(more(general_purpose(every(day(
! The(driving(force(for(gaming(and(HPC(