Peking University Center for Energy-efficient Computing and Applications Performance Optimization for GPUs GPU 性能优化技术 Yun (Eric) Liang, 梁云 Center for Energy-efficient and Applications (CECA) School of EECS, Peking University, China
Peking University Center for Energy-efficient Computing and Applications
Performance Optimization for GPUs GPU 性能优化技术
Yun (Eric) Liang, 梁云
Center for Energy-efficient and Applications (CECA)
School of EECS, Peking University, China
Why GPUs ?
Yun (Eric) Liang @ Peking University 2 9/23/2016
Massive Parallelism
Source: Nvidia Inc
Computing Power
SM SM SM SM SM SM SM SM
Graphics Processing Units
Applications of GPUs
Yun (Eric) Liang @ Peking University 3 9/23/2016
NVIDIA Tegra Series
Samsung Exynos
Qualcomm Snapdragon
Super computing system Embedded system
System Configuration
Titan, Oak Ridge National Lab
Cray XK7 , Opteron 6274 16C
2.200GHz, NVIDIA K20x
Piz Daint CSCS, Switzerland
Cray XC30, Xeon E5-2670 8C
2.600GHz, NVIDIA K20x
Ubiquitous GPU Computing
Yun (Eric) Liang @ Peking University 4 9/23/2016
Augmented Reality Electronic Design Automation
Biology
3D Graphics Rendering
Finance Deep Learning
GPU Performance Optimization
Performance tuning is difficult
• Many architecture, compiler and application parameters
GPU kernel development
• heavy lifting task
Yun (Eric) Liang @ Peking University 5 9/23/2016
Research Summary
Yun (Eric) Liang @ Peking University 6 9/23/2016
Heterogeneous System
Programming model,
Compilation and Run-time
System
MapReduce (TPDS’14, Bigdata’13)
SpMV (CGO’15)
Register (MICRO’15, ASPDAC’16)
Applications
Multitasking (TPDS’15, DATE’16)
Cache Byassing (HPCA’15, ICCAD’13, TCAD’15)
Divergence, Power (IPDPS’12, DAC’14, TCAD’16)
High Level Synthesis (FPGA’13, DAC’13, FCCM’14, TCAD’16)
Tool DAC’16
Memory (TCAD’15)
LTE (PACT’15)
Real-time (DAC’13)
On-chip Storage in GPUs
warp warp warp warp
On-chip storage
“Coordinated Static and Dynamic Cache Bypassing on GPUs”, International Symposium on High Performance Computer Architecture (HPCA), February, 2015
Cache Shared Memory
Register File
Challenge for Cache: Massive Parallelism
0
200
400
600
800
1000
1200
1400
1600
1800
Nu
mb
er o
f A
ctiv
e T
hre
ad
s
Fermi GTX 480
16KB cache , 10 ~ 20 bytes per thread 48KB cache, 30 ~ 80 bytes per thread
“Coordinated Static and Dynamic Cache Bypassing on GPUs”, International Symposium on High Performance Computer Architecture (HPCA), February, 2015
Challenge for Cache: Low Cache Hit Rate
Yun (Eric) Liang @ Peking University 9 9/23/2016
0%
20%
40%
60%
80%
100%
L1
Ca
che
Hit
Ra
te
Fermi: GTX 480
L1 Hit Rate - 16KB L1 Hit Rate - 48KB
Challenges: Resource Congestion Stalls
Yun (Eric) Liang @ Peking University 10 9/23/2016
Memory Requests
…… Memory Coalescing
Return data
L1 Cache
Hit
Miss ……
Miss Status Holding Registers ……
……
Memory stage stall
00 01 00 11 00 10 00 01
00
Cache Bypassing on GPUs
Yun (Eric) Liang @ Peking University 11 9/23/2016
memory requests
……
cache line requests
L1 Cache
return data
miss MSHR
L2 Cache
miss
allocate data
Off-chip memory
allocate data
bypass (L1 cache)
bypass (L1 cache) return data
coalescing
System Overview
Yun (Eric) Liang @ Peking University 12 9/23/2016
L2 Cache
L1 Cache
ld.global …
…
ld.global …
…
ld.global …
Static Cache Bypassing
compile-time
ld.global.ca
…
ld.global.cg
…
ld.global.cm
Dynamic Cache Bypassing
good
bad
medium
cm load
cg load
ca load
Cache thread block
Bypass thread block
Maintain the thread level parallelism
Performance Model
• Definition: Traffic Reduction Graph(G(V,E))
v ∈ V, global load instructions
e ∈ E, reuses between instructions
weighted graph using L2 cache traffic
weight(vi) , weight(ei,j)
• Max-Clique Problem
V3
V1
V2
V4 V5
“An Efficient Compiler for Cache Bypassing on GPUs”, International Conference on Computer Aided Design (ICCAD), 2013
Performance Results (1/2)
Cache sensitive applications on 16KB cache – Average 1.32X performance improvement – 8.6% energy savings
Yun (Eric) Liang @ Peking University 14 9/23/2016
0
0.5
1
1.5
2
2.5
No
rma
lize
d I
PC
Default Static Dynamic Coordinated
1.32X
Register File on GPUs
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Tesla Fermi Kelper
Re
gist
er
File
Siz
e
Thread block Thread block
… …
Large register file, 256 KB register file > L1 cache + shared memory (64KB) Keep increasing
register
“Enabling Coordinated Register Allocation and Thread-level Parallelism Optimization for GPUs”, IEEE/ACM International Symposium on Microarchitecture (MICRO), December, 2015
Thread Throttling Technique
Mitigate cache contention • Balance between parallelism and cache contention
Yun (Eric) Liang @ Peking University 16 9/23/2016
#Thread blocks per SM
Perfo
rm
an
ce
OptTLP
MaxTLP
Thread Throttling helps, but …
Yun (Eric) Liang @ Peking University 17 9/23/2016
1.42X
-51.3%
Register under-utilization
Performance Impact of Register Allocation
Yun (Eric) Liang @ Peking University 9/23/2016
Register
spilling
Code
insert
Thread-level Parallelism
Single-thread Performance
Current Optimization Tool-chain
Yun (Eric) Liang @ Peking University 19 9/23/2016
Register allocation
mov %r0, %tid.x; mov %r1, %ntid.x; mul %r3, %r2, %r1; add %r4, %3, %r0; …
PTX code
binary
assemble
Cache cache
Cache Thread throttling
Thread throttling register
shared memory
Thread block limits
Thread Limits
others
Register under-utilization
Motivational Example (CFD)
Yun (Eric) Liang @ Peking University 20 9/23/2016
0.8
1
1.2
1.4
1.6
1.8
2IPC
MaxTLP: maximum TLP (TLP = 8, Reg = 32)
OptTLP: optimal TLP (TLP = 7, Reg = 32)
OptTLP + Reg (TLP = 7, Reg = 36)
CRAT: Coordinated (TLP = 5, Reg = 50)
0%
5%
10%
15%
20%
25%
L1 Cache Hit Rate
70%
80%
90%
100%
Register Utilization
Design Space
Yun (Eric) Liang @ Peking University 21 9/23/2016
Single-thread performance
TLP
Complex Design Space Trade-off
Yun (Eric) Liang @ Peking University 22 9/23/2016
Design Space
Pruning
Optimized
GPU PTX Kernel
Output
Original GPU PTX
Kernel
.entry PTXkernel(){ … mul.lo.s32 %r3, %r2, %r1; add.s32 %r4, %r0, %r3; add.s32 %r3, %r2, %r1; sub.s32 %r5, %r2, %r1; … }
Input
Register Allocation
Spilling
Optimization
.entry PTXkernel(){ … mul.lo.s32 %r1, %r2, %r1; add.s32 %r2, %r0, %r1; add.s32 %r2, %r0, %r1; … }
CRAT: Coordinated Register Allocation and Thread-level Parallelism Optimization
CRAT
Design Space Pruning
Yun (Eric) Liang @ Peking University 23 9/23/2016
TLP
MaxReg MinReg
MaxTLP
OptTLP
Possible solution
Cache contention
Candidate solutions
Design space • MaxTLP, OptTLP, MinReg, MaxReg
Register Allocation
Yun (Eric) Liang @ Peking University 24 9/23/2016
Register allocator • GPGPU-Sim (Static Single Assignment, SSA )
• Based on Chaitin-Briggs’ register allocator
Control-flow analysis Data-flow analysis Register coloring
Spill code
insert
Spilling Optimization
Yun (Eric) Liang @ Peking University 25 9/23/2016
V0 V1 V2 V3 V4 V5 Spill stack
V0 V1
V1 V4 V5 V3
Splitting Sub
spill stack
Shared memory
Local Memory
Spill to shared memory if possible
Spilled variables
V0 V2 Vn …
Register Coloring
V0 V1 V2 V3 V4 V5 Spill stack
Local Memory
V2 Optimize
Performance Metric
• TPSC: Thread-level Parallelism and Spill Cost
Yun (Eric) Liang @ Peking University 26 9/23/2016
othersshmshmlocallocalt
gain
tgain
NumCostNumCostNumSpill
MaxThreadBlockSizeTLP
BlockSizeTLPTLP
SpillTLPTPSC
cos
cos
1
Main memory
Instruction
Shared memory
Instruction
Computing
Instruction TLP Candidate
solutions
Experimental Evaluation
Yun (Eric) Liang @ Peking University 27 9/23/2016
0.5
0.75
1
1.25
1.5
Norm
alized I
PC 1.25X
MaxTLP OptTLP CRAT
0%
25%
50%
75%
100%
No
rmal
ized
En
ergy
OptTLP CRAT
16.5%
Speedup
Energy saving
Performance Analysis
Yun (Eric) Liang @ Peking University 28 9/23/2016
0
1
2
3
4
5
6
#Th
read
blo
cks/
SM
MaxTLP CRAT
5.1
2.6
Cache Contention
Register Utilization
0%
25%
50%
75%
100%
ESP DTC FDTD CFD HST BLK STE
OptTLP
CRAT
0%
25%
50%
75%
100%
Local Memory Access
DTC FDTD CFD STE Ave
Experimental Results
Yun (Eric) Liang @ Peking University 29 9/23/2016
Kepler Architecture
– 1.32X IPC (compared with OptTLP)
0
0.5
1
1.5
2
2.5
STM ESP SPMV KMN LBM DTC FDTD CFD HST BLK STE Geo
Overall Performance MaxTLP OptTLP CRAT
CRAT: Open Source Project
Yun (Eric) Liang @ Peking University 30 9/23/2016
http://ceca.pku.edu.cn/crat/
Download: CMU, Michigan, USC, etc. Invited internship at IBM TJ Watson.
Multitasking for GPUs: Software Solution
“Efficient GPU Spatial-Temporal Multitasking”, IEEE Transactions on Parallel and Distributed Systems (TPDS), March, 2015
…
App App App
…
Thread block interleaving
via leaky-bucket
Spatial-temporal multitasking
App.
binary
profile …
App.
binary
profile
App.
binary
profile …
A set of independent kernels
Thread
block id 0 1 2 3 4 5
bucket
Multitasking for GPUs: Software Solution
Host (CPU)
compute_mapping(); // mapKernel and mapBlock scheduler( ……);
000
_global_ scheduler( ,…, mapBlk, mapKernel, gridDim_A, blkDim_A, gridDim_B, blkDim_B) { // bid is the blk identifier of the schedule kernel kernel_id = mapKernel[bid]; if(kernel_id == 0) Kernel_A(,..., mapBlk, blkDim_A, gridDim_A); else Kernel_B(,…, mapBlk, blkDim_B, gridDim_B); }
Device (GPU)
-5
0
5
10
15
20
25
30
35
40
45Kepler GTX680 Kepler K20
“Efficient GPU Spatial-Temporal Multitasking”, IEEE Transactions on Parallel and Distributed Systems (TPDS), March, 2015
Mulitasking for GPUs: Hardware Solution
TLP Modulation
Cache Bypassing
grid A
grid B
A’s block
B’s block
SM 0
…
SM 14
L1 C
ach
e
L2 C
ach
e
Bypass
0.8
1.2
1.6
2
BL
K_
HS
T
SP
M_
HS
T
SR
D_
KM
S
KM
S_
ST
C
LB
M_
BK
P
BL
K_
BK
P
SP
M_
BK
P
SP
M_
BL
K
LB
M_
BL
K
HS
T_
KM
S
LB
M_
SP
M
SP
M_
SR
D
SP
M_
KM
S
LB
M_
KM
S
LB
M_
HS
T
BL
K_
KM
S
BL
K_
SR
D
LB
M_
SR
D
HS
T_
ST
C
BK
P_
ST
C
LB
M_
ST
C
BK
P_
KM
S
SP
M_
ST
C
BL
K_
ST
C
SR
D_
ST
C
GE
OM
EA
N
No
rmali
zed
IP
C
TLP modulationTLP modulation + Cache bypassing
"Efficient Kernel Management on GPUs", in the proceedings of the Design Automation and Test in Europe (DATE), March, 2016.
Control Flow Divergence Modeling
Program control flow graph
Basic Block Vector
1. sub r0, r1, r2
2. mul r0, r2, 3
3. load r2 cb[r4]
4. madd r1, r2, r3
5. cmp r1
"An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization", Proceedings of IEEE International Parallel Author Distributed Processing Symposium (IPDPS), May, 2012.
D = input[tid];
If(D > 2)
{
//computation;
}
(1) If statement
(2) If else statement
D = input[tid];
If(D > 2)
{
….
}else{
if(….) // nested divergence
}
(3) For loop statement
D = input[tid]; for( I = D; I < 100; i++) { // computation. }
Control Flow Divergence Modeling
Static Schedule
SM0
tb0
SM1
tb1
tb2
tb3
tb4
Dynamic Schedule
SM0 SM1
tb0tb1
tb2
tb3tb5
tb4tb5
un-weighted
weighted
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
MC SW NW SL SM
Sp
eed
up
Sorting Greedy K-mean
• Simple sorting – Each thread is represented using its
BBV
• Greedy – Merges the most two closet threads and
continue…
• Clustering – K-mean clustering
"An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization", Proceedings of IEEE International Parallel Author Distributed Processing Symposium (IPDPS), May, 2012.
Conclusion
• Ubiquitous GPU Computing
– Supercomputer, datacenter, embedded, IoT
• Challenges
– Performance optimization
• Contribution
– Automatic performance analysis and optimization techniques
Yun (Eric) Liang @ Peking University 36 9/23/2016