Timothy G. Rogers 1 , Mike O’Connor 2 , Tor M. Aamodt 1 1 The University of British Columbia 2 NVIDIA Research Divergence-Aware Warp Scheduling MICRO 2013 Davis, CA
Timothy G. Rogers1, Mike O’Connor2, Tor M. Aamodt1
1The University of British Columbia 2NVIDIA Research
Divergence-Aware Warp Scheduling
MICRO 2013 Davis, CA
Tim Rogers Divergence-Aware Warp Scheduling 2
Streaming Multiprocessor Streaming Multiprocessor
Warp Scheduler
Memory Unit
L1D
GPU
W1 …W2
• 10000’s concurrent threads • Grouped into warps • Scheduler picks warp to issue each cycle
Main Memory L2 cache
Threads
Warp …
Tim Rogers Divergence-Aware Warp Scheduling 3
Warp …
Main Memory
2 Types of Divergence
… Can waste memory bandwidth
Memory Divergence
Branch Divergence
Effects functional unit
utilization Aware of branch
divergence
if(…) { …
}
Threads
Warp … 1 0 10
Aware of memory
divergence
AND Focus on improving
performance
Tim Rogers Divergence-Aware Warp Scheduling 4
Motivation
• Transfer locality management from SW to HW
• Software solutions: • Complicate programming • Not always performance portable • Not guaranteed to improve performance • Sometimes impossible
• Improve performance of programs with memory divergence • Parallel irregular applications • Economically important (server computing, big data)
Tim Rogers Divergence-Aware Warp Scheduling 5
Programmability Case Study Sparse Vector-Matrix Multiply
Divergence
Added Complication
Dependent on Warp Size
Parallel Reduction
Explicit Scratchpad Use
2 versions from SHOC
Divergent Version
GPU-Optimized Version
Each thread has locality
Tim Rogers Divergence-Aware Warp Scheduling 6
Previous Work • Scheduling used to capture intra-thread locality (MICRO 2012)
• Proactive
• Branch divergence aware
Reactive • Detects interference then throttles
Previous Work Divergence-Aware Warp Scheduling
Predict and be Proactive
Adapt to branch divergence Unaware of branch divergence • All warps treated equally
Outperformed by profiled
static throttling
Outperform static
solution
Lost Locality Detected
W2
Warp Scheduler
Memory Unit
L1D
W1 …W3 WN Go Go Stop Stop
1 0 1 1 0 0 Active Mask
1 1 1 1 1 1 Active Mask
Go Go
Case Study: Divergent code 50% slowdown
Case Study: Divergent code <4% slowdown
Tim Rogers Divergence-Aware Warp Scheduling 7
Divergence-Aware Warp Scheduling
How to be proactive
Adapt to branch divergence
• Identify where locality exists • Limit the number of warps executing in high locality regions
• Create cache footprint prediction in high locality regions • Account for number of active lanes to create per-warp
footprint prediction. • Change the prediction as branch divergence occurs.
Tim Rogers Divergence-Aware Warp Scheduling 8
Where is the locality? • Examine every load instruction in program
0
10
20
30
40
50
60
Static Load Instructions in GC workload
Hits
/Mis
ses
PKI
Load
1
Load
2
Load
3
Load
4
Load
5
Load
6
Load
7
Load
8
Load
9
Load
10
Load
11
Load
12
Load
13
Locality Concentrated in Loops
Loop
Tim Rogers Divergence-Aware Warp Scheduling 9
Locality In Loops Limit Study
0
0.2
0.4
0.6
0.8
1
Average
Line accessed this iteration
Line accessed last iteration
Other
Hits on data accessed in immediately previous trip How much data should we keep around?
Frac
tion
cach
e hi
ts in
loop
s
Tim Rogers Divergence-Aware Warp Scheduling 10
DAWS Objectives
1. Predict the amount of data accessed by each warp in a loop iteration.
2. Schedule warps in loops so that aggregate predicted footprint does not exceed L1D.
Tim Rogers Divergence-Aware Warp Scheduling 11
Observations that enable prediction • Memory divergence in static instructions is predictable
• Data touched by divergent loads dependent on active mask
Warp 0 Warp 1 … load …
Divergence
Divergence Warp
Main Memory
Main Memory Main Memory
Divergence
Both Used To Create Cache
Footprint Prediction
4 accesses 2 accesses
1 0 10Warp
1 1 11
Tim Rogers Divergence-Aware Warp Scheduling 12
Online characterization to create cache footprint prediction
1. Detect loops with locality
2. Classify loads in the loop
3. Compute footprint from active mask
Some loops have locality Some don’t Limit
multithreading here
while(…) { load 1 … load 2
}
Diverged
Not Diverged
while(…) { load 1 … load 2
}
Warp 0 1 1 1 1 1 1
Loop with locality
Loop with locality
Diverged
Not Diverged
4 accesses
1 access +
Warp 0’s Footprint = 5 cache
lines
int C[]={0,64,96,128,160,160,192,224,256}; void sum_row_csr(float* A, …) { float sum = 0; int i =C[tid]; while(i < C[tid+1]) { sum += A[ i ]; ++i; } …
Example Compressed Sparse Row Kernel
Time1 Time0 Time2
Cache A[0]
A[64] A[96]
A[128]
Cache A[0]
A[64] A[96]
A[128]
Cache A[32]
A[160] A[192] A[224]
Warp0 1 1 1 1
2nd Iter.
Warp0 1 0 0 0
33rd Iter.
Warp1 0 1 1 1
1st Iter.
Warp1 0 1 1 1
1st Iter.
Memory Divergence
Divergent Branch
Go Go
Warp1
Warp0 Warp1
Warp0
No Footprint
Warp0 1 1 1 1
1st Iter.
DAWS Operation Example
Cache Footprint
4 4 4 Want to capture
spatial locality
Hit Hit Hit
Hit
Go
Hit x30 Hit x30 Hit x30 Hit x30
Loop Stop Go
No locality detected = no
footprint
Locality Detected 1 Diverged Load
Detected
Footprint = 4X1
Footprint = 3X1 Early warps
profile loop for later warps
Warp 0 has branch divergence Both warps capture
spatial locality together 4 Active threads
Stop
Footprint decreased
Tim Rogers Divergence-Aware Warp Scheduling 14
Methodology GPGPU-Sim (version 3.1.0)
• 30 Streaming Multiprocessors • 32 warp contexts (1024 threads total)
• 32k L1D per streaming multiprocessor • 1M L2 unified cache
Compared Schedulers • Cache-Conscious Wavefront Scheduling (CCWS) • Profile based Best-SWL • Divergence-Aware Warp Scheduling (DAWS)
More schedulers in paper
Tim Rogers Divergence-Aware Warp Scheduling 15
Sparse MM Case Study Results
Within 4% of optimized with no programmer
input
0
0.5
1
1.5
2
Div
erge
nt C
ode
Exec
utio
n tim
e
• Performance (normalized to optimized version)
Tim Rogers Divergence-Aware Warp Scheduling 16
Sparse MM Case Study Results • Properties (normalized to optimized version)
0
0.5
1
1.5
2
2.5
3
13.3
<20% increase in off-chip accesses
Divergent version now has potential energy advantages
Divergent code issues 2.8x less instructions
Div
erge
nt c
ode
off-c
hip
acce
sses
Tim Rogers Divergence-Aware Warp Scheduling 17
Cache-Sensitive Applications • Breadth First Search (BFS) • Memcached-GPU (MEMC) • Sparse Matrix-Vector Multiply (SPMV-Scalar) • Garbage Collector (GC) • K-Means Clustering (KMN) Cache-Insensitive
Applications in paper
Tim Rogers Divergence-Aware Warp Scheduling 18
00.20.40.60.81
1.21.41.61.8
Results Outperform Best-SWL in highly
branch divergent
Overall 26% improvement over
CCWS
Nor
mal
ized
Spe
edup
CCWS
BFS MEMC SPMV-Scalar
GC KMN HMean
Best-‐SWL DAWS
Tim Rogers Divergence-Aware Warp Scheduling 19
Summary
Divergent loads in GPU programs. • Software solutions complicate programming
DAWS • Captures opportunities by accounting for divergence
Overall 26% performance improvement over CCWS Case Study: Divergent code performs within 4% code optimized to
minimize divergence
Questions?