Top Banner
Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA
29

Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

Apr 17, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

Page 2: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

2

Joint Session

AMD ● Graphics Core Next (GCN)

● Compute Unit (CU)

● Wavefronts

NVIDIA ● Maxwell, Pascal

● Streaming Multiprocessor (SM)

● Warps

Page 3: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

3

Terminology

Asynchronous: Not independent, async work shares HW

Work Pairing: Items of GPU work that execute simultaneously

Async. Tax: Overhead cost associated with asynchronous compute

Page 4: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

4

Async Compute More Performance

Page 5: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

5

3D 3D

COMPUTE COMPUTE

3 Queue Types:

● Copy/DMA Queue

● Compute Queue

● Graphics Queue

All run asynchronously!

Queue Fundamentals

COPY COPY

Page 6: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

6

● Always profile! ● Can make or break perf

● Maintain non-async paths ● Profile async on/off

● Some HW won’t support async

● ‘Member hyper-threading? ● Similar rules apply

● Avoid throttling shared HW resources

General Advice

3D

COMPUTE

COPY

Page 7: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

7

Regime Pairing

(Technique pairing doesn’t have to be 1-to-1)

Good Pairing

Graphics Compute

Shadow Render (Geometry

limited)

Light culling (ALU heavy)

Poor Pairing

Graphics Compute

G-Buffer (Bandwidth

limited)

SSAO (Bandwidth

limited)

Page 8: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

8

- Red Flags

Problem/Solution Format

Topics: ● Resource Contention -

● Descriptor heaps -

● Synchronization models

● Avoiding “async-compute tax”

Page 9: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

9

Hardware Details -

● 4 SIMD per CU

● Up to 10 Wavefronts scheduled per SIMD

● Accomplish latency hiding

● Graphics and Compute can execute simultanesouly on same CU

● Graphics workloads usually have priority over Compute

Page 10: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

10

Problem: Per SIMD resources are shared between Wavefronts

SIMD executes Wavefronts (of different shaders)

● Occupancy limited by ●# of registers

●Amount of LDS

●Other limits may apply…

● Wavefronts contest for caches

Resource Contention –

Page 11: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

11

● Keep an eye on vector register (VGPR) count

● Beware of cache thrashing! ● Try limiting occupancy by allocating dummy LDS

GCN VGPR Count <=24 28 32 36 40 48 64 84 <=128 >128

Max Waves/SIMD 10 9 8 7 6 5 4 3 2 1

Resource Contention –

Page 12: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

12

Hardware Details -

Maxwell Static SM partitioning

Pascal Dynamic SM partitioning

• Compute scheduled breadth first over SMs • Compute workloads have priority over graphics

• Driver heuristic controls SM distribution

Idle

Graphics

Compute

SM Distribution

Time

Page 13: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

13

Problem: HW only has one – applications can create many

Switching descriptor heap could be a hazard (on current HW)

● GPU must drain work before switching heaps

● Applies to CBV/SRV/UAV and Sampler heaps

● (Redundant changes are filtered)

● D3D: Must call SetDescriptorHeap per CL!

Descriptor Heap -

Page 14: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

14

Avoid hazard if total # descriptors (all heaps) < pool size

Driver sub-allocates descriptor heaps from large pool

Pool sizes (Kepler+): ● CBV/UAV/SRV = 1048576

● Sampler = 2048 + 2032 static + 16 driver owned

● NB. [1048575|4095] [0xFFFFF|0xFFF] (packed into 32-bit)

Descriptor Heap -

Page 15: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

15

Synchronization

GPU synchronization models to consider:

● Fire-and-forget

● Handshake

CPU also has a part to play

● ExecuteCommandLists (ECLs) schedules GPU work

● Gaps between ECLs on CPU can translate to GPU

Page 16: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

16

Fire-and-Forget (Sync.)

● Work beginning synchronized via fences

0

0 GPU

Fn ∞

Signal

Wait Good Pairing

Page 17: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

17

Fire-and-Forget (Sync.)

● Work beginning synchronized via fences

● But, some workloads vary frame-to-frame

● Variance leads to undesired work pairing

● Impacts overall frame time as bad pairing impacts performance

0

0 GPU

Fn ∞

Signal

Wait Bad Good

Page 18: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

18

● Similar situation – CPU plays a role here

GPU

Fn ∞

Signal

0

Wait

0

Good Pairing

CPU Latency (Sync.)

Page 19: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

19

● Similar situation – CPU plays a role here

● Game introduces latency on the CPU between ECLs

● Latency translates to GPU

● Leads to undesired work pairing, etc…

GPU

Fn ∞

Signal

0

Wait

0

Latency Bad Good

CPU Latency (Sync.)

Page 20: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

20

1

● Synchronize begin and end of work pairing

● Ensures pairing determinism

● Might miss some asynchronous opportunity (HW manageable)

● Future proof your code!

0

0 GPU

Fn ∞

Signal

Wait

1

Signal

Wait

Handshake (Sync.)

Page 21: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

21

CPU isn’t innocent, keep an eye on it

Two GPU synchronization models: ● Fire-and-Forget

●Cons: Undeterministic regime pairing

●Pros: Less synchronization == more immediate performance (best case scenario)

● Handshake ●Cons: Additional synchronization might cost performance

●Pros: Regime pairing determinism (all the time)

Synchronize for determinism (as well as correctness)

Synchronization - Advice

Page 22: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

22

Async. Tax

Overhead cost associated with asynchronous compute

● Quantified by: [AC-Off(ms)] / [Serialized AC-On (ms)] % ●serialize manually via graphics API

● Can easily knock out AC gains!

Page 23: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

23

CPU: ● Additional CPU work organizing/scheduling async tasks

● Synchronization/ExecuteCommandLists overhead

GPU: ● Synchronization overhead

● A Difference in work ordering between AC-On/Off

● Different shaders used between AC-On/Off paths

● Additional barriers (cross-queue synchronization)

Async. Tax – Root Cause

Page 24: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

24

First: determine if CPU or GPU is the bottleneck (GPUView)

CPU: ● Count API calls per frame, compare AC-On/Off for differences

● Measure differences through per-thread profiling

GPU: ● Compare GPU cost of shaders for AC-On/Off

● Inspect difference contributors

Async. Tax – Advice

Page 25: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

25

Tools

● API Timestamps: Time enable/disable async compute

● GpuView: (PTO)

Page 26: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

26

GPU View #1

GRAPHICS

COMPUTE

COPY

• Using 3D, Compute, Copy

• Frame boundaries @ Flip Queue packets

• Compute overlapping graphics per-frame

Page 27: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

27

GPU View #2 - Markers

NB. Open with, ctrl + e

Description • Time: GPU accurate • DataSize: size in bytes of Data • Data: Event name emitted

PIXBegin/EndEvent • Byte Array ASCII/Unicode • Manual step

Page 28: Deep Dive - GPUOpen€¦ · Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA

28

GPU View #3 - Events CPU Timeline:

ID3D12Fence::Signal • DxKrnl – SignalSynchronizationObjectFromCpu ID3D12Fence::Wait • DxKrnl – WaitForSynchronizationObjectFromCpu

GPU Timeline:

ID3D12CommandQueue::Signal • DxKrnl – SignalSynchronizationObjectFromGpu ID3D12CommandQueue::Wait • DxKrnl – WaitForSynchronizationObjectFromGpu