Top Banner
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong
21

Demystifying GPU Microarchitecture through …Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong 2 GPUs Graphics Processing Units Increasingly programmable 10x

Mar 20, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Demystifying GPU Microarchitecture through …Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong 2 GPUs Graphics Processing Units Increasingly programmable 10x

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos

University of Toronto

Demystifying GPU Microarchitecture through

Microbenchmarking

Henry Wong

Page 2: Demystifying GPU Microarchitecture through …Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong 2 GPUs Graphics Processing Units Increasingly programmable 10x

2

GPUs

Graphics Processing UnitsIncreasingly programmable

10x arithmetic and memory bandwidth vs. CPUCommodity hardwareRequires 1000-way parallelization

NVIDIA CUDAWrite GPU thread in C variantCompiled into native instructions and run in parallel on GPU

Page 3: Demystifying GPU Microarchitecture through …Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong 2 GPUs Graphics Processing Units Increasingly programmable 10x

3

How Well Do We Know the GPU?

How much do we know?We know enough to write programsWe know basic performance tuning

Caches existApproximate memory speedsInstruction throughput

Why do we want to know more?Detailed optimizationCompiler optimizations. e.g., OcelotPerformance modeling. e.g., GPGPU-Sim

Page 4: Demystifying GPU Microarchitecture through …Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong 2 GPUs Graphics Processing Units Increasingly programmable 10x

4

Outline

Background on CUDAPipeline latency and throughputBranch divergenceSyncthreads barrier synchronizationMemory Hierarchies

Cache and TLB organizationCache sharing

...more in the paper

Page 5: Demystifying GPU Microarchitecture through …Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong 2 GPUs Graphics Processing Units Increasingly programmable 10x

5

CUDA Software Model

GridBlock

Thread

Threads branch independentlyThreads inside Block may interactBlocks are independent

Page 6: Demystifying GPU Microarchitecture through …Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong 2 GPUs Graphics Processing Units Increasingly programmable 10x

6

CUDA Blocks and SMs

Blocks are assigned to SMs

Software Hardware

Page 7: Demystifying GPU Microarchitecture through …Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong 2 GPUs Graphics Processing Units Increasingly programmable 10x

7

CUDA Hardware Model

SM: Streaming MultiprocessorWarp: Groups of 32 threads like a vector

TPC: Thread Processing ClusterInterconnectMemory Hierarchy

Page 8: Demystifying GPU Microarchitecture through …Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong 2 GPUs Graphics Processing Units Increasingly programmable 10x

8

Goal and Methodology

Aim to discover microarchitecture beyond documentation

MicrobenchmarksMeasure code timing and behaviourInfer microarchitecture

Measure time using clock() functiontwo clock cycle resolution

Page 9: Demystifying GPU Microarchitecture through …Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong 2 GPUs Graphics Processing Units Increasingly programmable 10x

9

Arithmetic Pipeline: Methodology

Objective: Measure instruction latency and throughputLatency: One threadThroughput: Many threads

Measure runtime of microbenchmark coreAvoid compiler optimizing away codeDiscard first iteration: Cold instruction cache

for (i=0;i<2;i++) {start_tim e = clock();

t1 += t2;t2 += t1;...t1 += t2;

stop_tim e = clock();}

Page 10: Demystifying GPU Microarchitecture through …Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong 2 GPUs Graphics Processing Units Increasingly programmable 10x

10

Arithmetic Pipeline: Results

Three types of arithmetic unitsSP: 24 cycles, 8 ops/clkSFU: 28 cycles, 2 ops/clkDPU: 48 cycles, 1 op/clk

Peak SM throughput 11.2 ops/clk: MUL or MAD + MUL

Page 11: Demystifying GPU Microarchitecture through …Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong 2 GPUs Graphics Processing Units Increasingly programmable 10x

11

SIMT Control Flow

Warps run in lock-step but threads can branch independentlyBranch divergence

Taken and fall-through paths are serializedTaken path (usually “else”) executed firstFall-through path pushed onto a stack

int __shared__ sharedvar = 0;

while (! sharedvar == tid) ;

sharedvar++;

Page 12: Demystifying GPU Microarchitecture through …Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong 2 GPUs Graphics Processing Units Increasingly programmable 10x

12

Barrier Synchronization

Syncthreads: Synchronizes Warps, not threadsDoes not resync diverged warpsWill sync with other warps

if (warp0) {if (tid < 16) {

shared_array[tid] = tid;__syncthreads();

}else {

__syncthreads();out[tid] = shared_array[tid%16]

}}

if (warp1) {__syncthreads();__syncthreads();

}

1 2

if (tid < 16) {shared_array[tid] = tid;__syncthreads();

}else {

__syncthreads();out[tid] = shared_array[tid%16]

}

Warp0 Warp1

Page 13: Demystifying GPU Microarchitecture through …Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong 2 GPUs Graphics Processing Units Increasingly programmable 10x

13

Texture Memory

Average latency vs. memory footprint, stride accesses5 KB L1 cache, 20-way256 KB L2 cache, 8-way

Page 14: Demystifying GPU Microarchitecture through …Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong 2 GPUs Graphics Processing Units Increasingly programmable 10x

14

Texture L1

5 KB, 32-byte lines, 8 cache sets, 20-way set associativeL2 is located across a non-uniform interconnect

5120bytes256bytes /way

=20waysL1 Hit

L1 Miss

Page 15: Demystifying GPU Microarchitecture through …Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong 2 GPUs Graphics Processing Units Increasingly programmable 10x

15

Constant Cache Sharing

L1

L2

L3

Three levels of constant cacheTwo Blocks accessing constant memory

L3: GlobalL2: Per-TPCL1: Per-SM

Page 16: Demystifying GPU Microarchitecture through …Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong 2 GPUs Graphics Processing Units Increasingly programmable 10x

16

Constant and Instruction Cache Sharing

Two different types of accessesL2 and L3 are Constant and Instruction caches

Page 17: Demystifying GPU Microarchitecture through …Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong 2 GPUs Graphics Processing Units Increasingly programmable 10x

17

Global Memory TLBs

8 MB L1, 512 KB line, 16-way32 MB L2, 4 KB line, 8-way

L2 non-trivial to measure

8 MBL1

32 MBL2

Page 18: Demystifying GPU Microarchitecture through …Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong 2 GPUs Graphics Processing Units Increasingly programmable 10x

18

Conclusions

Microbenchmarking reveals undocumented microarchitecture features

Three arithmetic unit types: latency and throughputBranch divergence using a stackBarrier synchronization on warpsMemory Hierarchy

Cache organization and sharingTLBs

ApplicationsMeasuring other architectures and microarchitectural featuresFor GPU code optimization and performance modeling

Page 19: Demystifying GPU Microarchitecture through …Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong 2 GPUs Graphics Processing Units Increasingly programmable 10x

19

Questions...

Page 20: Demystifying GPU Microarchitecture through …Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong 2 GPUs Graphics Processing Units Increasingly programmable 10x

20

Filling the SP Pipeline

6 warps (24 clocks, 192 “threads”) should fill pipeline2 warps if sufficient instruction-level independence (not shown)

24-cycle SP pipeline latencyFair scheduling on average

Page 21: Demystifying GPU Microarchitecture through …Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong 2 GPUs Graphics Processing Units Increasingly programmable 10x

21

Instruction Fetch?

Capture burst of timing at each iteration, then averageOne thread measures timingOther warps thrash one line of L1 instruction cache

Instructions are fetched from L1 in groups of 64 bytes