Top Banner
Computer Graphics and Imaging UC Berkeley CS184 GPUs (based on Spring 2019 Lec 23)
60

(based on Spring 2019 Lec 23) - CS184

Apr 20, 2023

Download

Documents

Khang Minh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: (based on Spring 2019 Lec 23) - CS184

Computer Graphics and Imaging UC Berkeley CS184

GPUs (based on Spring 2019 Lec 23)

Page 2: (based on Spring 2019 Lec 23) - CS184

CS184

Unreal Engine Kite Demo (Epic Games 2015)

Goal: Highly Complex 3D Scenes in Realtime• Complex vertex and fragment shader computations • 100’s of thousands to millions of triangles in a scene • High resolution (2-4 megapixel + supersampling) • 30-60 frames per second (even higher for VR)

Page 3: (based on Spring 2019 Lec 23) - CS184

CS184

Unreal Engine 5 (2020)

Page 4: (based on Spring 2019 Lec 23) - CS184

CS184

A di!use re"ectance shader

�3

samplermySamp; Texture2D<float3>myTex; float3lightDir;

float4diffuseShader(float3norm,float2uv) { float3kd; kd=myTex.Sample(mySamp,uv); kd*=clamp(dot(lightDir,norm),0.0,1.0); returnfloat4(kd,1.0); }

How much compute is this?

4 multiply-adds & 1 texture fetch

4K = 8 MPixels x 5x Overdraw = 40 MPixels / frame x 60hz = 2.4 GPixels/sec

~10 GFLOPS ~10 GB/sec

A real game is 10s to 100s of times more!

Page 5: (based on Spring 2019 Lec 23) - CS184

CS184

Today

• Key ideas/motivation behind GPU architecture • High-performance graphics • RTX

Page 6: (based on Spring 2019 Lec 23) - CS184

CS184

Part 1: throughput processing

Three key concepts behind how modern GPU processing cores run code

Knowing these concepts will help you: 1. Understand space of GPU core (and throughput CPU core) designs 2. Understand how “GPU” cores do (and don’t!) di!er from “CPU” cores 3. Optimize shaders/compute kernels 4. Establish intuition: what workloads might bene"t from the design of these

architectures?

�5

Page 7: (based on Spring 2019 Lec 23) - CS184

CS184

What’s in a GPU?

�6

Shader Core

Shader Core

Shader Core

Shader Core

Shader Core

Shader Core

Shader Core

Shader Core

Tex

Tex

Tex

Tex

Input Assembly

Rasterizer

Output Blend

Video Decode

Work Distributor

A GPU is a heterogeneous chip multi-processor (highly tuned for graphics)

Page 8: (based on Spring 2019 Lec 23) - CS184

CS184

A di!use re"ectance shader

�7

samplermySamp; Texture2D<float3>myTex; float3lightDir;

float4diffuseShader(float3norm,float2uv) { float3kd; kd=myTex.Sample(mySamp,uv); kd*=clamp(dot(lightDir,norm),0.0,1.0); returnfloat4(kd,1.0); }

Shader programming model:Fragments are processed independently, but there is no explicit parallel programming Key architectural ideas: How can we exploit parallelism to run faster?

Page 9: (based on Spring 2019 Lec 23) - CS184

CS184

Compile shader

�8

<diffuseShader>:

sampler0,v4,t0,s0

mulr3,v0,cb0[0]

maddr3,v1,cb0[1],r3

maddr3,v2,cb0[2],r3

clmpr3,r3,l(0.0),l(1.0)

mulo0,r0,r3

mulo1,r1,r3

mulo2,r2,r3

movo3,l(1.0)

1 unshaded fragment input record

1 shaded fragment output record

samplermySamp; Texture2D<float3>myTex; float3lightDir;

float4diffuseShader(float3norm,float2uv) { float3kd; kd=myTex.Sample(mySamp,uv); kd*=clamp(dot(lightDir,norm),0.0,1.0); returnfloat4(kd,1.0); }

Page 10: (based on Spring 2019 Lec 23) - CS184

CS184

Execute shader

�9

<diffuseShader>:

sampler0,v4,t0,s0

mulr3,v0,cb0[0]

maddr3,v1,cb0[1],r3

maddr3,v2,cb0[2],r3

clmpr3,r3,l(0.0),l(1.0)

mulo0,r0,r3

mulo1,r1,r3

mulo2,r2,r3

movo3,l(1.0)

Fetch/ Decode

Execution Context

ALU (Execute)

Page 11: (based on Spring 2019 Lec 23) - CS184

CS184

Execute shader

�10

<diffuseShader>:

sampler0,v4,t0,s0 mulr3,v0,cb0[0]

maddr3,v1,cb0[1],r3

maddr3,v2,cb0[2],r3

clmpr3,r3,l(0.0),l(1.0)

mulo0,r0,r3

mulo1,r1,r3

mulo2,r2,r3

movo3,l(1.0)

ALU (Execute)

Fetch/ Decode

Execution Context

Page 12: (based on Spring 2019 Lec 23) - CS184

CS184

Execute shader

�11

<diffuseShader>:

sampler0,v4,t0,s0

mulr3,v0,cb0[0]

maddr3,v1,cb0[1],r3

maddr3,v2,cb0[2],r3

clmpr3,r3,l(0.0),l(1.0)

mulo0,r0,r3

mulo1,r1,r3

mulo2,r2,r3

movo3,l(1.0)

Fetch/ Decode

Execution Context

ALU (Execute)

Page 13: (based on Spring 2019 Lec 23) - CS184

CS184

Execute shader

�12

<diffuseShader>:

sampler0,v4,t0,s0

mulr3,v0,cb0[0]

maddr3,v1,cb0[1],r3

maddr3,v2,cb0[2],r3

clmpr3,r3,l(0.0),l(1.0)

mulo0,r0,r3

mulo1,r1,r3

mulo2,r2,r3

movo3,l(1.0)

Fetch/ Decode

Execution Context

ALU (Execute)

Page 14: (based on Spring 2019 Lec 23) - CS184

CS184

Execute shader

�13

<diffuseShader>:

sampler0,v4,t0,s0

mulr3,v0,cb0[0]

maddr3,v1,cb0[1],r3

maddr3,v2,cb0[2],r3

clmpr3,r3,l(0.0),l(1.0)

mulo0,r0,r3

mulo1,r1,r3

mulo2,r2,r3

movo3,l(1.0)

Fetch/ Decode

Execution Context

ALU (Execute)

Page 15: (based on Spring 2019 Lec 23) - CS184

CS184

Execute shader

�14

<diffuseShader>:

sampler0,v4,t0,s0

mulr3,v0,cb0[0]

maddr3,v1,cb0[1],r3

maddr3,v2,cb0[2],r3

clmpr3,r3,l(0.0),l(1.0)

mulo0,r0,r3

mulo1,r1,r3

mulo2,r2,r3

movo3,l(1.0)

Fetch/ Decode

Execution Context

ALU (Execute)

Page 16: (based on Spring 2019 Lec 23) - CS184

CS184

Execute shader

�15

<diffuseShader>:

sampler0,v4,t0,s0

mulr3,v0,cb0[0]

maddr3,v1,cb0[1],r3

maddr3,v2,cb0[2],r3

clmpr3,r3,l(0.0),l(1.0)

mulo0,r0,r3

mulo1,r1,r3

mulo2,r2,r3

movo3,l(1.0)

Fetch/ Decode

Execution Context

ALU (Execute)

Page 17: (based on Spring 2019 Lec 23) - CS184

CS184

“CPU-style” cores

�16

Fetch/ Decode

Execution Context

ALU (Execute)

Data cache (a big one)

Out-of-order control logic

Fancy branch predictor

Memory pre-fetcher

Page 18: (based on Spring 2019 Lec 23) - CS184

CS184

Slimming down

�17

Fetch/ Decode

Execution Context

ALU (Execute)

Idea #1: Remove components that help a single instruction stream run fast

Page 19: (based on Spring 2019 Lec 23) - CS184

CS184

Two cores (two fragments in parallel)

�18

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

<diffuseShader>: sampler0,v4,t0,s0 mulr3,v0,cb0[0] maddr3,v1,cb0[1],r3 maddr3,v2,cb0[2],r3 clmpr3,r3,l(0.0),l(1.0) mulo0,r0,r3 mulo1,r1,r3 mulo2,r2,r3 movo3,l(1.0)

fragment 1

<diffuseShader>: sampler0,v4,t0,s0 mulr3,v0,cb0[0] maddr3,v1,cb0[1],r3 maddr3,v2,cb0[2],r3 clmpr3,r3,l(0.0),l(1.0) mulo0,r0,r3 mulo1,r1,r3 mulo2,r2,r3 movo3,l(1.0)

fragment 2

Page 20: (based on Spring 2019 Lec 23) - CS184

CS184

Four cores (four fragments in parallel)

�19

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

Page 21: (based on Spring 2019 Lec 23) - CS184

CS184

Sixteen cores (sixteen fragments in parallel)

�20

16 cores = 16 simultaneous instruction streams

Page 22: (based on Spring 2019 Lec 23) - CS184

CS184

Instruction stream sharing

�22

But, many fragments should be able to share an instruction stream!

<diffuseShader>:

sampler0,v4,t0,s0

mulr3,v0,cb0[0]

maddr3,v1,cb0[1],r3

maddr3,v2,cb0[2],r3

clmpr3,r3,l(0.0),l(1.0)

mulo0,r0,r3

mulo1,r1,r3

mulo2,r2,r3

movo3,l(1.0)

Page 23: (based on Spring 2019 Lec 23) - CS184

CS184

Fetch/ Decode

Recall: simple processing core

�23

Execution Context

ALU (Execute)

Page 24: (based on Spring 2019 Lec 23) - CS184

CS184

Add ALUs

�24

Fetch/ Decode

Idea #2: Amortize cost/complexity of managing an instruction stream across many ALUs

SIMD processingCtx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

Page 25: (based on Spring 2019 Lec 23) - CS184

CS184

Modifying the shader

�25

Fetch/ Decode

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

<diffuseShader>:

sampler0,v4,t0,s0

mulr3,v0,cb0[0]

maddr3,v1,cb0[1],r3

maddr3,v2,cb0[2],r3

clmpr3,r3,l(0.0),l(1.0)

mulo0,r0,r3

mulo1,r1,r3

mulo2,r2,r3

movo3,l(1.0)

Original compiled shader:

Processes one fragment using scalar ops on scalar registers

Page 26: (based on Spring 2019 Lec 23) - CS184

CS184

Modifying the shader

�26

Fetch/ Decode

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

New compiled shader:

Processes eight fragments using vector ops on vector registers

<VEC8_diffuseShader>:

VEC8_samplevec_r0,vec_v4,t0,vec_s0

VEC8_mulvec_r3,vec_v0,cb0[0]

VEC8_maddvec_r3,vec_v1,cb0[1],vec_r3

VEC8_maddvec_r3,vec_v2,cb0[2],vec_r3

VEC8_clmpvec_r3,vec_r3,l(0.0),l(1.0)

VEC8_mulvec_o0,vec_r0,vec_r3

VEC8_mulvec_o1,vec_r1,vec_r3

VEC8_mulvec_o2,vec_r2,vec_r3

VEC8_movo3,l(1.0)

Page 27: (based on Spring 2019 Lec 23) - CS184

CS184

DModifying the shader

�27

Fetch/ Decode

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

1 2 3 45 6 7 8

<VEC8_diffuseShader>:

VEC8_samplevec_r0,vec_v4,t0,vec_s0

VEC8_mulvec_r3,vec_v0,cb0[0]

VEC8_maddvec_r3,vec_v1,cb0[1],vec_r3

VEC8_maddvec_r3,vec_v2,cb0[2],vec_r3

VEC8_clmpvec_r3,vec_r3,l(0.0),l(1.0)

VEC8_mulvec_o0,vec_r0,vec_r3

VEC8_mulvec_o1,vec_r1,vec_r3

VEC8_mulvec_o2,vec_r2,vec_r3

VEC8_movo3,l(1.0)

Page 28: (based on Spring 2019 Lec 23) - CS184

CS184

128 fragments in parallel

�28

16 cores = 128 ALUs , 16 simultaneous instruction streams

Page 29: (based on Spring 2019 Lec 23) - CS184

CS184

128 [ ] in parallel

�29

vertices/fragments primitives

OpenCL work items CUDA threads

fragments

vertices

primitives

Page 30: (based on Spring 2019 Lec 23) - CS184

CS184

But what about branches?

�30

ALU 1 ALU 2 . . . ALU 8. . . Time (clocks) 2 . . . 1 . . . 8

if(x>0){

}else{

}

<unconditionalshadercode>

<resumeunconditionalshadercode>

y=pow(x,exp);

y*=Ks;

refl=y+Ka;

x=0;

refl=Ka;

Page 31: (based on Spring 2019 Lec 23) - CS184

CS184

But what about branches?

�31

ALU 1 ALU 2 . . . ALU 8. . . Time (clocks) 2 . . . 1 . . . 8

if(x>0){

}else{

}

<unconditionalshadercode>

<resumeunconditionalshadercode>

y=pow(x,exp);

y*=Ks;

refl=y+Ka;

x=0;

refl=Ka;

T T T F FF F F

Page 32: (based on Spring 2019 Lec 23) - CS184

CS184

But what about branches?

�32

ALU 1 ALU 2 . . . ALU 8. . . Time (clocks) 2 . . . 1 . . . 8

if(x>0){

}else{

}

<unconditionalshadercode>

<resumeunconditionalshadercode>

y=pow(x,exp);

y*=Ks;

refl=y+Ka;

x=0;

refl=Ka;

T T T F FF F F

Not all ALUs do useful work! Worst case: 1/8 peak performance

Page 33: (based on Spring 2019 Lec 23) - CS184

CS184

But what about branches?

�33

ALU 1 ALU 2 . . . ALU 8. . . Time (clocks) 2 . . . 1 . . . 8

if(x>0){

}else{

}

<unconditionalshadercode>

<resumeunconditionalshadercode>

y=pow(x,exp);

y*=Ks;

refl=y+Ka;

x=0;

refl=Ka;

T T T F FF F F

Page 34: (based on Spring 2019 Lec 23) - CS184

CS184

Clari!cation

Option 1: explicit vector instructions x86 SSE, Intel Larrabee Option 2: scalar instructions, implicit HW vectorization HW determines instruction stream sharing across ALUs (amount of sharing hidden from software) NVIDIA GeForce (“SIMT” warps), ATI Radeon architectures (“wavefronts”)

�34

SIMD processing does not imply SIMD instructions

In practice: 16 to 64 fragments share an instruction stream.

Page 35: (based on Spring 2019 Lec 23) - CS184

CS184

�35

Stalls!

Texture access latency = 100’s to 1000’s of cycles

We’ve removed the fancy caches and logic that helps avoid stalls.

Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation.

Page 36: (based on Spring 2019 Lec 23) - CS184

CS184

�36

But we have LOTS of independent fragments.

Idea #3: Interleave processing of many fragments on a single

core to avoid stalls caused by high latency operations.

Page 37: (based on Spring 2019 Lec 23) - CS184

CS184

Hiding shader stalls

�37

Time (clocks) Frag 1 … 8

Fetch/ Decode

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

Page 38: (based on Spring 2019 Lec 23) - CS184

CS184

Hiding shader stalls

�38

Time (clocks)

Fetch/ Decode

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

Frag 9 … 16 Frag 17 … 24 Frag 25 … 32Frag 1 … 8

1 2 3 4

1 2

3 4

Page 39: (based on Spring 2019 Lec 23) - CS184

CS184

Hiding shader stalls

�39

Time (clocks)

Frag 9 … 16 Frag 17 … 24 Frag 25 … 32Frag 1 … 8

1 2 3 4

Stall

Runnable

Page 40: (based on Spring 2019 Lec 23) - CS184

CS184

Hiding shader stalls

�40

Time (clocks)

Frag 9 … 16 Frag 17 … 24 Frag 25 … 32Frag 1 … 8

1 2 3 4

Stall

Runnable

Stall

Stall

Stall

Page 41: (based on Spring 2019 Lec 23) - CS184

CS184

Throughput!

�41

Time (clocks)

Frag 9 … 16 Frag 17 … 24 Frag 25 … 32Frag 1 … 8

1 2 3 4

Stall

Runnable

Stall

Runnable

Stall

Runnable

Stall

Runnable

Done!

Done!

Done!

Done!

Start

Start

Start

Increase run time of one group to increase throughput of many groups

Page 42: (based on Spring 2019 Lec 23) - CS184

CS184

Storing contexts

�42

Fetch/ Decode

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

Pool of context storage 128 KB

Page 43: (based on Spring 2019 Lec 23) - CS184

CS184

Eighteen small contexts

�43

Fetch/ Decode

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

(maximal latency hiding)

Page 44: (based on Spring 2019 Lec 23) - CS184

CS184

Twelve medium contexts

�44

Fetch/ Decode

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

Page 45: (based on Spring 2019 Lec 23) - CS184

CS184

Four large contexts

�45

Fetch/ Decode

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

1 2

3 4

(low latency hiding ability)

Page 46: (based on Spring 2019 Lec 23) - CS184

CS184

Our chip

�46

16 cores

8 mul-add ALUs per core (128 total)

16 simultaneous instruction streams

64 concurrent (but interleaved) instruction streams

512 concurrent fragments

= 256 GFLOPs (@ 1GHz)

Page 47: (based on Spring 2019 Lec 23) - CS184

CS184

Our “enthusiast” chip

�47

32 cores, 16 ALUs per core (512 total) = 1 TFLOP (@ 1 GHz)

Page 48: (based on Spring 2019 Lec 23) - CS184

CS184

NVIDIA GeForce GTX 1080

NVIDIA-speak: 2560 stream processors (“CUDA cores”) “SIMT execution”

Generic speak: 20 cores 4 groups of 32 SIMD functional units per core

�50

Page 49: (based on Spring 2019 Lec 23) - CS184

CS184

NVIDIA GeForce GTX 1080 “core”

�51

= SIMD function unit, control shared across 32 units (1 MUL-ADD per clock)

“Shared” memory (96 KB)

Execution contexts (registers) (256 KB)

• Groups of 32 [fragments/vertices/CUDA threads] share an instruction stream

• Up to 64 groups are simultaneously interleaved

• Up to 2,048 individual contexts can be stored

Source: NVIDIA Pascal tuning guide

Fetch/ Decode

Fetch/ Decode

Fetch/ Decode

Fetch/ Decode

Fetch/ Decode

Fetch/ Decode

Fetch/ Decode

Fetch/ Decode

Page 50: (based on Spring 2019 Lec 23) - CS184

CS184

NVIDIA GeForce GTX 1080

�52

There are 20 of these things on the GTX 1080

That’s 40,960 fragments!

(Or 40,960 “CUDA threads”)

Page 51: (based on Spring 2019 Lec 23) - CS184

CS184

Summary: three key ideas to exploit parallelism for performance1. Use many “slimmed down cores” to run in parallel

2. Pack cores full of ALUs (by sharing instruction stream across groups of fragments)

Option 1: Explicit SIMD vector instructions Option 2: Implicit sharing managed by hardware

3. Avoid latency stalls by interleaving execution of many groups of fragments

When one group stalls, work on another group�48

Page 52: (based on Spring 2019 Lec 23) - CS184

CS184

Page 53: (based on Spring 2019 Lec 23) - CS184

CS184

Pathtracing vs. Rasterization?

• Global Illumination and other rich visual effects are “free” with path tracing

• Included by nature • Many many years of graphics engineering work poured into techniques for simulating those effects

• Reflections, shadows, etc. • “Baked” maps (light maps) • Screen-space effects

Page 54: (based on Spring 2019 Lec 23) - CS184

CS184

Why is pathtracing “hard” for GPUs?• Recursion is hard (if not basically impossible)

• Also, variable # of bounces is bad for SIMD • Could implement as iterative — fixed # of bounces

• Memory coherency is basically non-existent • You cast a ray into the scene —> it can hit basically anything, bounce basically anywhere —> it can hit basically anything, …

• When rasterizing triangles you know exact information about the triangle you’re trying to shade

• From one bounce to the next, no way to know which shader should be called next

Page 55: (based on Spring 2019 Lec 23) - CS184

CS184

BVH Traversal

• Pathtracing without BVH is misery • As we know from Project 3

• But even with the BVH, computing ray intersections takes up primary time

• BVH traversal with shader operations = ??? • Thousands of instructions needed

Page 56: (based on Spring 2019 Lec 23) - CS184

CS184

Source: NVIDIA

Page 57: (based on Spring 2019 Lec 23) - CS184

CS184

Source: NVIDIA

Page 58: (based on Spring 2019 Lec 23) - CS184

CS184

Quake II RTX

(All?) games marketed as supporting RTX actually use raytracing only for particular effects. Exception: Quake II RTX (open source!)

Source: NVIDIA

Page 59: (based on Spring 2019 Lec 23) - CS184

CS184

Source: NVIDIA

Page 60: (based on Spring 2019 Lec 23) - CS184

Ren NgCS184/284A

Acknowledgments

Many thanks to Jonathan Ragan-Kelley for slide materials and lecture content.

Credit for slides and contributions to Kayvon Fatahalian, Kurt Akeley, Solomon Boulos, Mike Doggett, Pat Hanrahan, Mike Houston, Jeremy Sugerman.