Yunsup Lee – UC Berkeley 1
Yunsup Lee – UC Berkeley 1
Yunsup Lee – UC Berkeley 2
Why is Supporting Control FlowChallenging in Data-Parallel Architectures?
for (i=0; i<16; i++) { a[i] = op0; b[i] = op1; if (a[i] < b[i]) { c[i] = op2; } else { c[i] = op3; } d[i] = op4; }
Thread0 Thread4 Thread8
Thread12
Thread1 Thread5 Thread9
Thread13
Thread2 Thread6
Thread10 Thread14
Thread3 Thread7 Thread11 Thread15
Data-Parallel Issue Unit
• The divergence management architecture must not only partially sequence all execution paths for correctness,
• but also reconverge threads from different execution paths for efficiency.
Yunsup Lee – UC Berkeley 3
For GPUs, Supporting Complex Control Flow in a SPMD Program is not Optional!
• Traditional vector compilers can always give up and run complex control flow on the control processor.
• However, for the SPMD compiler, supporting complex control flow is a functional requirement rather than an optional performance optimization.
Kernel() { if (a[tid] < b[tid]) { if (f[tid]) { c[tid]->vfunc(); goto SKIP; } d[tid] = op; } SKIP: } Kernel<<<16>>>();
for (i=0; i<16; i++) { if (a[i] < b[i]) { if (f[i]) { c[i]->vfunc(); goto SKIP; } d[i] = op; } SKIP: }
Yunsup Lee – UC Berkeley 4
Design Space ofDivergence Management
Software Explicitly Scheduled by Compiler
Hardware Implicitly Managed
by Microarchitecture
Divergence Stack
Predication + Fork/Join Vector Predication
Predication (Used in Limited Cases)
(Compiler Figures Out Reconvergence Points)
We can perform a design space exploration of
divergence management on NVIDIA GPU silicon
Yunsup Lee – UC Berkeley 5
Executive Summary, Contributions of Paper28 Benchmarks written in CUDA (Parboil, Rodinia, FFT, nqueens)
CUDA 6.5 Production Compiler
Divergence Stack Limited Predication
CUDA Binary
Modified CUDA 6.5 Production Compiler Full Predication with
New Compiler Algorithms
CUDA Binary
Performance, Statistics
NVIDIA Tesla K20c (Kepler, GK110)
Performance, Statistics
Performance with predication is on par compared to performance with divergence stack
(1) Detailed Explanation and Categorization of Hardware and Software Divergence Management Schemes
(2) SPMD Predication Compiler Algorithms
(3) Apples-to-Apples Comparison using Production Silicon and Compiler
Yunsup Lee – UC Berkeley
Yunsup Lee – UC Berkeley 6
How does the hardware divergence stack and
software predication handle control flow?
Yunsup Lee – UC Berkeley 7
If-Then-Else Example: Divergence Stack
CUDA Program
PTX
LLVM Compiler
• LLVM compiler takes a CUDA program and generates PTX, which encodes all data/control dependences
• Note, all instructions on the right (despite being a scalar instruction) are executed in a SIMD fashion
Kernel() { a = op0; b = op1; if (a < b) { c = op2; } else { c = op3; } d = op4; } Kernel<<<n>>>();
a = op0 b = op1 p = slt a,b branch.eqz p, else c = op2 j ipdom else: c = op3 ipdom: d = op4
SASS
“ptxas” Backend Compiler
Yunsup Lee – UC Berkeley 8
If-Then-Else Example: Divergence Stack
CUDA Program
PTX
SASS
LLVM Compiler
“ptxas” Backend Compiler
a = op0 b = op1 p = slt a,b push ipdom branch.eqz p, else c = op2.pop else: c = op3.pop ipdom: d = op4
• “ptxas” backend compiler takes PTX and generates SASS instructions, which executes natively on GPU
• Reconvergence points are analyzed and inserted by the backend compiler
Kernel() { a = op0; b = op1; if (a < b) { c = op2; } else { c = op3; } d = op4; } Kernel<<<n>>>();
Yunsup Lee – UC Berkeley 9
a = op0 b = op1 p = slt a,b push ipdom branch.eqz p, else c = op2.pop else: c = op3.pop ipdom: d = op4
If-Then-Else Example: Divergence Stack
1111 | op0 1111 | op1 1111 | slt | push 1111 | branch 1100 | op2 | pop 0011 | op3 | pop 1111 | op4
• Push: pushes <reconverge pc, current mask> to stack• Pop: disregards current <pc, mask>, pops top of stack,
and executes deferred <pc, mask>
PC MASK
PC MASK ipdom 1111
Divergence Stack
PC MASK else 0011
ipdom 1111
Mask OP
Assume 4 threads are executing. Assume thread 0 and 1 took the branch.
Yunsup Lee – UC Berkeley 10
If-Then-Else Example: Predication
• The compiler can schedule instructions• Predicates also encode reconvergence information
Predication
Kernel() { a = op0; b = op1; if (a < b) { c = op2; } else { c = op3; } d = op4; } Kernel<<<n>>>();
a = op0 b = op1 f0 = slt a,b @f0 c = op2 @!f0 c = op3 d = op4
Yunsup Lee – UC Berkeley 11
Uniform Branch Conditions Across All Threads
a = op0 b = op1 p = slt a,b push ipdom branch.eqz p, else c = op2.pop else: c = op3.pop ipdom: d = op4
• What if the branch condition is uniform across all threads?• Execute instructions with a null predicate
• Branches don’t push a token to the divergence stack when branch condition is uniform.
Divergence Stack Predication
a = op0 b = op1 f0 = slt a,b @f0 c = op2 @!f0 c = op3 d = op4
Yunsup Lee – UC Berkeley 12
Runtime Branch-Uniformity Optimizationwith Consensual Branches
Kernel() { a = op0; b = op1; if (a < b) { c = op2; } else { c = op3; } d = op4; } Kernel<<<n>>>();
a = op0 b = op1 f0 = slt a,b cbranch.ifnull f0, else @f0 c = op2 else: cbranch.ifnull !f0, ipdom @!f0 c = op3 ipdom: d = op4
• We can optimize the predicated code with a consensual branch (cbranch), which is taken only when all threads consensually agree on the branch condition (ifnull).
• The code may jump around unnecessary work.
Thread-Aware Predication
Yunsup Lee – UC Berkeley 13
a = op0 b = op1 f0 = slt a,b cbranch.ifnull f0, else c = op2 else: cbranch.ifnull !f0, ipdom c = op3 ipdom: d = op4
Static Branch-Uniformity Optimizationwith Consensual Branches
Kernel() { a = op0; b = op1; if (a < b) { c = op2; } else { c = op3; } d = op4; } Kernel<<<n>>>();
• If the compiler can prove that the branch condition is uniform across all threads, the compiler can omit the guard predicates.
Thread-Aware Predication
Yunsup Lee – UC Berkeley 14
Loop Example: Consensual Branches are Key to Compile Loops with Predication
f0 = true loop: cbranch.ifnull f0, exit @f0 a = op0 @f0 b = op1 @f0 f1 = slt a, b f0 = and f0, !f1 j loop exit: c = op2
• Intuitively, the compiler needs to sequence loop until all threads are done executing the loop.
• A consensual branch (cbranch.ifnull) is used to check whether loop mask (f0) is null.
Kernel() { done = false; while (!done) { a = op0; b = op1; done = a < b; } c = op2; } Kernel<<<n>>>();
Thread-Aware Predication
Yunsup Lee – UC Berkeley 15
Thread-Aware (TA) Predication Compiler Algorithms
Yunsup Lee – UC Berkeley 16
Predication Compiler Algorithms
N1 N8 N2
P1 !P1
N4 N7 N3
P2 !P2
N6 N5
Control Dependence Graph (CDG)
Guard Predicate = P1 && !P2
N1
N2
N4
N3
N6 N5
N8
N7
Control Flow Graph (CFG)
P1 !P1
P2 !P2
Kernel() { N1; N2; if (!P1) { N3; } else { N4; if (!P2) { N5; } else { N6; } N7; } N8; } Kernel<<<n>>>();
Generate CFG, CDG
Walk CDG to Get Guard Predicates
for all BBs
Linearize Control
Flow
Predicate all instructions with guard predicate and
rewire all BBs
Yunsup Lee – UC Berkeley 17
Runtime Branch-Uniformity Optimization
N1 N8 N2
P1 !P1
N4 N7 N3
P2 !P2
N6 N5
Control Dependence Graph (CDG)
N1
N2
N4
N3
N6 N5
N8
N7
CFG
P1 !P1
P2 !P2
Generate CFG, CDG
Walk CDG to Get Guard Predicates
for all BBs
Linearize Control
Flow
Predicate all instructions with guard predicate and
rewire all BBs
Kernel() { N1; N2; if (!P1) { N3; } else { N4; if (!P2) { N5; } else { N6; } N7; } N8; } Kernel<<<n>>>();
Assume compiler cannot prove that P1 is uniform across all threads
Linearize Control
Flow
Add consensual branch if P1 is null
Add consensual
branch if !P1 is null
Yunsup Lee – UC Berkeley 18
Static Branch-Uniformity Optimization
N1 N8 N2
P1 !P1
N4 N7 N3
P2 !P2
N6 N5
N1
N2
N4
N3
N6 N5
N8
N7
CFG
P1 !P1
P2 !P2
Generate CFG, CDG
Walk CDG to Get Guard Predicates
for all BBs
Linearize Control
Flow
Predicate all instructions with guard predicate and
rewire all BBs
Kernel() { N1; N2; if (!P1) { N3; } else { N4; if (!P2) { N5; } else { N6; } N7; } N8; } Kernel<<<n>>>();
Assume compiler can prove that P1 is uniform across all threads
Guard Predicate = P2
Walk CDG to Get Guard Predicates
for all BBs
Linearize Control
Flow
Control Dependence Graph (CDG)
Yunsup Lee – UC Berkeley 19
N1
N2
N3
N12
N4
N10
N5
N6
N7N8
N9
N11
Predication Compiler Algorithms: Loops
Loop Mask
Exit Mask 1
Exit Mask 2
N1 N3
P1 !P1
N12
N10
N2 N11
N4 L1
N5
P2|E1 !P2
N7 N6 N8
P3 !P3|E2
N9
Kernel() { if (!P1) { while (!P2) { if (!P3) { break; } } } } Kernel<<<n>>>();
Thread-Aware Control Dependence Graph (CDG) CFG
Yunsup Lee – UC Berkeley 20
Supporting Complex Control Flow
• Function Calls• Support by a straightforward calling convention
• Virtual Function Calls
• Irreducible Control Flow• Find smallest region containing irreducible control flow, and
insert sequencing code at the entry block and exit block to sequence active threads through the region one-by-one.
@p2 jalr r3 loop: p0, r4 = find_unique p2, r3 @p0 jalr r4 p2 = p2 and !p0 cbranch.ifany p2, loop
With Divergence Stack
With Predication
Yunsup Lee – UC Berkeley 21
Evaluation
Yunsup Lee – UC Berkeley 22
Predication CUDA Compiler
• Theoretically we could have implemented our thread-aware predication pass in ptxas
• Implemented bulk of the predication pass in LLVM for fast prototyping
CUDA Program
Annotated PTX
SASS
LLVM Compiler Predication Passes,
Generates a throw-away pseudo PTX instruction with predication
information
“ptxas” Backend Compiler Runs existing optimizations
Then predicates with information retrieved from pseudo PTX insts.
Yunsup Lee – UC Berkeley 23
Evaluation: Quick Recap28 Benchmarks written in CUDA (Parboil, Rodinia, FFT, nqueens)
CUDA 6.5 Production Compiler
Divergence Stack Limited Predication
CUDA Binary
Modified CUDA 6.5 Production Compiler Full Predication with
TA Compiler Algorithms
CUDA Binary
Performance, Statistics
NVIDIA Tesla K20c (Kepler, GK110)
Performance, Statistics
Yunsup Lee – UC Berkeley
Compare performance and statistics
Five Bars 1) Baseline (Divergence Stack) 2) Limited Predication 3) TA Predication 4) TA+SBU
(static branch-uniformity optimization)
5) TA+SBU+RBU (runtime branch-uniformity optimization)
Yunsup Lee – UC Berkeley 24
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1.1 1.2 1.3 1.4 1.5
r-path
finde
r
p-cutc
p
p-sge
mm p-s
ad
p-mri-
q
p-sten
cil
r-nn
p-lbm
fft
r-bac
kprop
nque
ens
r-b+tre
e
r-hots
pot
r-srad
-v2
p-grid
ding
p-tpa
cf
p-spm
v
r-gau
ssian
p-hist
o p-b
fs
r-srad
-v1
r-sclu
ster
r-lud
r-b
fs
Baseline with divergence stack Divergence Stack+Limited Predication TA TA+SBU TA+SBU+RBU
Performance Results: Geomean
• Thread-aware predication compiler is competitive with the baseline compiler (divergence stack)
• Both static and runtime branch-uniformity optimizations play an important role
• Performance doesn’t change with limited if-conversion heuristic implemented in production compiler
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1.1 1.2 1.3 1.4 1.5
geomean
Spee
dup
Yunsup Lee – UC Berkeley 25
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1.1 1.2 1.3 1.4 1.5
geomean
Spee
dup
Performance Results: Speedups
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1.1 1.2 1.3 1.4 1.5
r-path
finde
r
p-cutc
p
p-sge
mm p-s
ad
p-mri-
q
p-sten
cil
r-nn
p-lbm
fft
r-bac
kprop
nque
ens
r-b+tre
e
r-hots
pot
r-srad
-v2
p-grid
ding
p-tpa
cf
p-spm
v
r-gau
ssian
p-hist
o p-b
fs
r-srad
-v1
r-sclu
ster
r-lud
r-b
fs
Baseline with divergence stack Divergence Stack+Limited Predication TA TA+SBU TA+SBU+RBU
• Predication can expose better scheduling opportunities
• Extra consensual branches added for +RBU may act as scheduling barriers
Yunsup Lee – UC Berkeley 26
Performance Results: Slowdowns
• Ten benchmarks are inconclusive (>90%, <100%)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1.1 1.2 1.3 1.4 1.5
r-path
finde
r
p-cutc
p
p-sge
mm p-s
ad
p-mri-
q
p-sten
cil
r-nn
p-lbm
fft
r-bac
kprop
nque
ens
r-b+tre
e
r-hots
pot
r-srad
-v2
p-grid
ding
p-tpa
cf
p-spm
v
r-gau
ssian
p-hist
o p-b
fs
r-srad
-v1
r-sclu
ster
r-lud
r-b
fs
Baseline with divergence stack Divergence Stack+Limited Predication TA TA+SBU TA+SBU+RBU
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1.1 1.2 1.3 1.4 1.5
geomean
Spee
dup
Yunsup Lee – UC Berkeley 27
Performance Results: Slowdowns
• Ten benchmarks are inconclusive (>90%, <100%)• Five benchmarks in <90% range
• Increase in register pressure can reduce occupancy, which sometimes reduce performance
• Compiler is not able to optimize for all branch uniformity exhibited during runtime
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1.1 1.2 1.3 1.4 1.5
r-path
finde
r
p-cutc
p
p-sge
mm p-s
ad
p-mri-
q
p-sten
cil
r-nn
p-lbm
fft
r-bac
kprop
nque
ens
r-b+tre
e
r-hots
pot
r-srad
-v2
p-grid
ding
p-tpa
cf
p-spm
v
r-gau
ssian
p-hist
o p-b
fs
r-srad
-v1
r-sclu
ster
r-lud
r-b
fs
Baseline with divergence stack Divergence Stack+Limited Predication TA TA+SBU TA+SBU+RBU
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1.1 1.2 1.3 1.4 1.5
geomean
Spee
dup
Yunsup Lee – UC Berkeley 28
Discussion on Area, Power, Energy
• Hard to quantify impact of predication on area, power, energy
• Experiment done on GPU silicon• Primary motivation of software divergence management
is to reduce hardware design complexity and associated verification costs
• Area and power overhead of divergence stack are not significant
• Performance ≈ power/energy consumption• Power/Energy consumption of software divergence mgmt.
≈ Power/Energy of hardware divergence mgmt.
Yunsup Lee – UC Berkeley 29
N1
N2
N6 N3
N4 N7
N5
Fundamental Advantages ofSoftware Divergence Management
Control Flow Graph
• Short-circuit example• Divergence stack cannot reconverge
threads at N4• Predication can reconverge threads at
N4
Yunsup Lee – UC Berkeley 30
Conclusions• Advantages of Divergence Stack
• Enables a fairly conventional thread compilation model• Register allocation easier• Simplifies the task of supporting irreducible control flow
• Advantages of Predication• Simplifies the hardware without sacrificing programmability• Actual cases where predication can outperform divergence stack
• Better scheduling opportunities• Better reconvergence of threads
• For divergence management, pushing complexity to the compiler is a better choice
This work was funded by DARPA award HR0011-12-2-0016, the Center for Future Architecture Research, a member of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA, an NVIDIA graduate fellowship, and ASPIRE Lab industrial sponsors and affiliates Intel, Google, Nokia, NVIDIA, Oracle, and Samsung. It was also funded by DOE contract B599861. Any opinions, findings, conclusions, or recommendations in this paper are solely those of the authors and do not necessarily reflect the position or the policy of the sponsors.