Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Peking University Center for Energy-efficient Computing and Applications

Performance Optimization for GPUs GPU 性能优化技术

Yun (Eric) Liang, 梁云

Center for Energy-efficient and Applications (CECA)

School of EECS, Peking University, China

Why GPUs ?

Yun (Eric) Liang @ Peking University 2 9/23/2016

Massive Parallelism

Source: Nvidia Inc

Computing Power

SM SM SM SM SM SM SM SM

Graphics Processing Units

Applications of GPUs


NVIDIA Tegra Series

Samsung Exynos

Qualcomm Snapdragon

Super computing system Embedded system

System Configuration

Titan, Oak Ridge National Lab

Cray XK7 , Opteron 6274 16C

2.200GHz, NVIDIA K20x

Piz Daint CSCS, Switzerland

Cray XC30, Xeon E5-2670 8C

2.600GHz, NVIDIA K20x

Ubiquitous GPU Computing


Augmented Reality Electronic Design Automation

Biology

3D Graphics Rendering

Finance Deep Learning

GPU Performance Optimization

Performance tuning is difficult

• Many architecture, compiler and application parameters

GPU kernel development

• heavy lifting task


Research Summary


Heterogeneous System

Programming model,

Compilation and Run-time

System

MapReduce (TPDS’14, Bigdata’13)

SpMV (CGO’15)

Register (MICRO’15, ASPDAC’16)

Applications

Multitasking (TPDS’15, DATE’16)

Cache Byassing (HPCA’15, ICCAD’13, TCAD’15)

Divergence, Power (IPDPS’12, DAC’14, TCAD’16)

High Level Synthesis (FPGA’13, DAC’13, FCCM’14, TCAD’16)

Tool DAC’16

Memory (TCAD’15)

LTE (PACT’15)

Real-time (DAC’13)

On-chip Storage in GPUs

warp warp warp warp

On-chip storage

“Coordinated Static and Dynamic Cache Bypassing on GPUs”, International Symposium on High Performance Computer Architecture (HPCA), February, 2015

Cache Shared Memory

Register File

Challenge for Cache: Massive Parallelism

0

200

400

600

800

1000

1200

1400

1600

1800

Nu

mb

er o

f A

ctiv

e T

hre

ad

s

Fermi GTX 480

16KB cache , 10 ~ 20 bytes per thread 48KB cache, 30 ~ 80 bytes per thread

“Coordinated Static and Dynamic Cache Bypassing on GPUs”, International Symposium on High Performance Computer Architecture (HPCA), February, 2015

Challenge for Cache: Low Cache Hit Rate


0%

20%

40%

60%

80%

100%

L1

Ca

che

Hit

Ra

te

Fermi: GTX 480

L1 Hit Rate - 16KB L1 Hit Rate - 48KB

Challenges: Resource Congestion Stalls


Memory Requests

…… Memory Coalescing

Return data

L1 Cache

Hit

Miss ……

Miss Status Holding Registers ……

……

Memory stage stall

00 01 00 11 00 10 00 01

00

Cache Bypassing on GPUs


memory requests

……

cache line requests

L1 Cache

return data

miss MSHR

L2 Cache

miss

allocate data

Off-chip memory

allocate data

bypass (L1 cache)

bypass (L1 cache) return data

coalescing

System Overview


L2 Cache

L1 Cache

ld.global …

…

ld.global …

…

ld.global …

Static Cache Bypassing

compile-time

ld.global.ca

…

ld.global.cg

…

ld.global.cm

Dynamic Cache Bypassing

good

bad

medium

cm load

cg load

ca load

Cache thread block

Bypass thread block

Maintain the thread level parallelism

Performance Model

• Definition: Traffic Reduction Graph(G(V,E))

v ∈ V, global load instructions

e ∈ E, reuses between instructions

weighted graph using L2 cache traffic

weight(vi) , weight(ei,j)

• Max-Clique Problem

V3

V1

V2

V4 V5

“An Efficient Compiler for Cache Bypassing on GPUs”, International Conference on Computer Aided Design (ICCAD), 2013

Performance Results (1/2)

Cache sensitive applications on 16KB cache – Average 1.32X performance improvement – 8.6% energy savings


0

0.5

1

1.5

2

2.5

No

rma

lize

d I

PC

Default Static Dynamic Coordinated

1.32X

Register File on GPUs

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Tesla Fermi Kelper

Re

gist

er

File

Siz

e

Thread block Thread block

… …

Large register file, 256 KB register file > L1 cache + shared memory (64KB) Keep increasing

register

“Enabling Coordinated Register Allocation and Thread-level Parallelism Optimization for GPUs”, IEEE/ACM International Symposium on Microarchitecture (MICRO), December, 2015

Thread Throttling Technique

Mitigate cache contention • Balance between parallelism and cache contention


#Thread blocks per SM

Perfo

rm

an

ce

OptTLP

MaxTLP

Thread Throttling helps, but …


1.42X

-51.3%

Register under-utilization

Performance Impact of Register Allocation

Yun (Eric) Liang @ Peking University 9/23/2016

Register

spilling

Code

insert

Thread-level Parallelism

Single-thread Performance

Current Optimization Tool-chain


Register allocation

mov %r0, %tid.x; mov %r1, %ntid.x; mul %r3, %r2, %r1; add %r4, %3, %r0; …

PTX code

binary

assemble

Cache cache

Cache Thread throttling

Thread throttling register

shared memory

Thread block limits

Thread Limits

others

Register under-utilization

Motivational Example (CFD)


0.8

1

1.2

1.4

1.6

1.8

2IPC

MaxTLP: maximum TLP (TLP = 8, Reg = 32)

OptTLP: optimal TLP (TLP = 7, Reg = 32)

OptTLP + Reg (TLP = 7, Reg = 36)

CRAT: Coordinated (TLP = 5, Reg = 50)

0%

5%

10%

15%

20%

25%

L1 Cache Hit Rate

70%

80%

90%

100%

Register Utilization

Design Space


Single-thread performance

TLP

Complex Design Space Trade-off


Design Space

Pruning

Optimized

GPU PTX Kernel

Output

Original GPU PTX

Kernel

.entry PTXkernel(){ … mul.lo.s32 %r3, %r2, %r1; add.s32 %r4, %r0, %r3; add.s32 %r3, %r2, %r1; sub.s32 %r5, %r2, %r1; … }

Input

Register Allocation

Spilling

Optimization

.entry PTXkernel(){ … mul.lo.s32 %r1, %r2, %r1; add.s32 %r2, %r0, %r1; add.s32 %r2, %r0, %r1; … }

CRAT: Coordinated Register Allocation and Thread-level Parallelism Optimization

CRAT

Design Space Pruning


TLP

MaxReg MinReg

MaxTLP

OptTLP

Possible solution

Cache contention

Candidate solutions

Design space • MaxTLP, OptTLP, MinReg, MaxReg

Register Allocation


Register allocator • GPGPU-Sim (Static Single Assignment, SSA )

• Based on Chaitin-Briggs’ register allocator

Control-flow analysis Data-flow analysis Register coloring

Spill code

insert

Spilling Optimization


V0 V1 V2 V3 V4 V5 Spill stack

V0 V1

V1 V4 V5 V3

Splitting Sub

spill stack

Shared memory

Local Memory

Spill to shared memory if possible

Spilled variables

V0 V2 Vn …

Register Coloring

V0 V1 V2 V3 V4 V5 Spill stack

Local Memory

V2 Optimize

Performance Metric

• TPSC: Thread-level Parallelism and Spill Cost


othersshmshmlocallocalt

gain

tgain

NumCostNumCostNumSpill

MaxThreadBlockSizeTLP

BlockSizeTLPTLP

SpillTLPTPSC

cos

cos

1

Main memory

Instruction

Shared memory

Instruction

Computing

Instruction TLP Candidate

solutions

Experimental Evaluation


0.5

0.75

1

1.25

1.5

Norm

alized I

PC 1.25X

MaxTLP OptTLP CRAT

0%

25%

50%

75%

100%

No

rmal

ized

En

ergy

OptTLP CRAT

16.5%

Speedup

Energy saving

Performance Analysis


0

1

2

3

4

5

6

#Th

read

blo

cks/

SM

MaxTLP CRAT

5.1

2.6

Cache Contention

Register Utilization

0%

25%

50%

75%

100%

ESP DTC FDTD CFD HST BLK STE

OptTLP

CRAT

0%

25%

50%

75%

100%

Local Memory Access

DTC FDTD CFD STE Ave

Experimental Results


Kepler Architecture

– 1.32X IPC (compared with OptTLP)

0

0.5

1

1.5

2

2.5

STM ESP SPMV KMN LBM DTC FDTD CFD HST BLK STE Geo

Overall Performance MaxTLP OptTLP CRAT

CRAT: Open Source Project


http://ceca.pku.edu.cn/crat/

Download: CMU, Michigan, USC, etc. Invited internship at IBM TJ Watson.

Multitasking for GPUs: Software Solution

“Efficient GPU Spatial-Temporal Multitasking”, IEEE Transactions on Parallel and Distributed Systems (TPDS), March, 2015

…

App App App

…

Thread block interleaving

via leaky-bucket

Spatial-temporal multitasking

App.

binary

profile …

App.

binary

profile

App.

binary

profile …

A set of independent kernels

Thread

block id 0 1 2 3 4 5

bucket

Multitasking for GPUs: Software Solution

Host (CPU)

compute_mapping(); // mapKernel and mapBlock scheduler( ……);

000

_global_ scheduler( ,…, mapBlk, mapKernel, gridDim_A, blkDim_A, gridDim_B, blkDim_B) { // bid is the blk identifier of the schedule kernel kernel_id = mapKernel[bid]; if(kernel_id == 0) Kernel_A(,..., mapBlk, blkDim_A, gridDim_A); else Kernel_B(,…, mapBlk, blkDim_B, gridDim_B); }

Device (GPU)

-5

0

5

10

15

20

25

30

35

40

45Kepler GTX680 Kepler K20

“Efficient GPU Spatial-Temporal Multitasking”, IEEE Transactions on Parallel and Distributed Systems (TPDS), March, 2015

Mulitasking for GPUs: Hardware Solution

TLP Modulation

Cache Bypassing

grid A

grid B

A’s block

B’s block

SM 0

…

SM 14

L1 C

ach

e

L2 C

ach

e

Bypass

0.8

1.2

1.6

2

BL

K_

HS

T

SP

M_

HS

T

SR

D_

KM

S

KM

S_

ST

C

LB

M_

BK

P

BL

K_

BK

P

SP

M_

BK

P

SP

M_

BL

K

LB

M_

BL

K

HS

T_

KM

S

LB

M_

SP

M

SP

M_

SR

D

SP

M_

KM

S

LB

M_

KM

S

LB

M_

HS

T

BL

K_

KM

S

BL

K_

SR

D

LB

M_

SR

D

HS

T_

ST

C

BK

P_

ST

C

LB

M_

ST

C

BK

P_

KM

S

SP

M_

ST

C

BL

K_

ST

C

SR

D_

ST

C

GE

OM

EA

N

No

rmali

zed

IP

C

TLP modulationTLP modulation + Cache bypassing

"Efficient Kernel Management on GPUs", in the proceedings of the Design Automation and Test in Europe (DATE), March, 2016.

Control Flow Divergence Modeling

Program control flow graph

Basic Block Vector

1. sub r0, r1, r2

2. mul r0, r2, 3

3. load r2 cb[r4]

4. madd r1, r2, r3

5. cmp r1

"An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization", Proceedings of IEEE International Parallel Author Distributed Processing Symposium (IPDPS), May, 2012.

D = input[tid];

If(D > 2)

{

//computation;

}

(1) If statement

(2) If else statement

D = input[tid];

If(D > 2)

{

….

}else{

if(….) // nested divergence

}

(3) For loop statement

D = input[tid]; for( I = D; I < 100; i++) { // computation. }

Control Flow Divergence Modeling

Static Schedule

SM0

tb0

SM1

tb1

tb2

tb3

tb4

Dynamic Schedule

SM0 SM1

tb0tb1

tb2

tb3tb5

tb4tb5

un-weighted

weighted

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

MC SW NW SL SM

Sp

eed

up

Sorting Greedy K-mean

• Simple sorting – Each thread is represented using its

BBV

• Greedy – Merges the most two closet threads and

continue…

• Clustering – K-mean clustering

"An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization", Proceedings of IEEE International Parallel Author Distributed Processing Symposium (IPDPS), May, 2012.

Conclusion

• Ubiquitous GPU Computing

– Supercomputer, datacenter, embedded, IoT

• Challenges

– Performance optimization

• Contribution

– Automatic performance analysis and optimization techniques


Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Documents