Top Banner
Yunsup Lee – UC Berkeley 1
30

Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Oct 13, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 1

Page 2: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 2

Why is Supporting Control FlowChallenging in Data-Parallel Architectures?

for (i=0; i<16; i++) { a[i] = op0; b[i] = op1; if (a[i] < b[i]) { c[i] = op2; } else { c[i] = op3; } d[i] = op4; }

Thread0 Thread4 Thread8

Thread12

Thread1 Thread5 Thread9

Thread13

Thread2 Thread6

Thread10 Thread14

Thread3 Thread7 Thread11 Thread15

Data-Parallel Issue Unit

•  The divergence management architecture must not only partially sequence all execution paths for correctness,

•  but also reconverge threads from different execution paths for efficiency.

Page 3: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 3

For GPUs, Supporting Complex Control Flow in a SPMD Program is not Optional!

•  Traditional vector compilers can always give up and run complex control flow on the control processor.

•  However, for the SPMD compiler, supporting complex control flow is a functional requirement rather than an optional performance optimization.

Kernel() { if (a[tid] < b[tid]) { if (f[tid]) { c[tid]->vfunc(); goto SKIP; } d[tid] = op; } SKIP: } Kernel<<<16>>>();

for (i=0; i<16; i++) { if (a[i] < b[i]) { if (f[i]) { c[i]->vfunc(); goto SKIP; } d[i] = op; } SKIP: }

Page 4: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 4

Design Space ofDivergence Management

Software Explicitly Scheduled by Compiler

Hardware Implicitly Managed

by Microarchitecture

Divergence Stack

Predication + Fork/Join Vector Predication

Predication (Used in Limited Cases)

(Compiler Figures Out Reconvergence Points)

We can perform a design space exploration of

divergence management on NVIDIA GPU silicon

Page 5: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 5

Executive Summary, Contributions of Paper28 Benchmarks written in CUDA (Parboil, Rodinia, FFT, nqueens)

CUDA 6.5 Production Compiler

Divergence Stack Limited Predication

CUDA Binary

Modified CUDA 6.5 Production Compiler Full Predication with

New Compiler Algorithms

CUDA Binary

Performance, Statistics

NVIDIA Tesla K20c (Kepler, GK110)

Performance, Statistics

Performance with predication is on par compared to performance with divergence stack

(1) Detailed Explanation and Categorization of Hardware and Software Divergence Management Schemes

(2) SPMD Predication Compiler Algorithms

(3) Apples-to-Apples Comparison using Production Silicon and Compiler

Yunsup Lee – UC Berkeley

Page 6: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 6

How does the hardware divergence stack and

software predication handle control flow?

Page 7: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 7

If-Then-Else Example: Divergence Stack

CUDA Program

PTX

LLVM Compiler

•  LLVM compiler takes a CUDA program and generates PTX, which encodes all data/control dependences

•  Note, all instructions on the right (despite being a scalar instruction) are executed in a SIMD fashion

Kernel() { a = op0; b = op1; if (a < b) { c = op2; } else { c = op3; } d = op4; } Kernel<<<n>>>();

a = op0 b = op1 p = slt a,b branch.eqz p, else c = op2 j ipdom else: c = op3 ipdom: d = op4

SASS

“ptxas” Backend Compiler

Page 8: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 8

If-Then-Else Example: Divergence Stack

CUDA Program

PTX

SASS

LLVM Compiler

“ptxas” Backend Compiler

a = op0 b = op1 p = slt a,b push ipdom branch.eqz p, else c = op2.pop else: c = op3.pop ipdom: d = op4

•  “ptxas” backend compiler takes PTX and generates SASS instructions, which executes natively on GPU

•  Reconvergence points are analyzed and inserted by the backend compiler

Kernel() { a = op0; b = op1; if (a < b) { c = op2; } else { c = op3; } d = op4; } Kernel<<<n>>>();

Page 9: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 9

a = op0 b = op1 p = slt a,b push ipdom branch.eqz p, else c = op2.pop else: c = op3.pop ipdom: d = op4

If-Then-Else Example: Divergence Stack

1111 | op0 1111 | op1 1111 | slt | push 1111 | branch 1100 | op2 | pop 0011 | op3 | pop 1111 | op4

•  Push: pushes <reconverge pc, current mask> to stack•  Pop: disregards current <pc, mask>, pops top of stack,

and executes deferred <pc, mask>

PC MASK

PC MASK ipdom 1111

Divergence Stack

PC MASK else 0011

ipdom 1111

Mask OP

Assume 4 threads are executing. Assume thread 0 and 1 took the branch.

Page 10: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 10

If-Then-Else Example: Predication

•  The compiler can schedule instructions•  Predicates also encode reconvergence information

Predication

Kernel() { a = op0; b = op1; if (a < b) { c = op2; } else { c = op3; } d = op4; } Kernel<<<n>>>();

a = op0 b = op1 f0 = slt a,b @f0 c = op2 @!f0 c = op3 d = op4

Page 11: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 11

Uniform Branch Conditions Across All Threads

a = op0 b = op1 p = slt a,b push ipdom branch.eqz p, else c = op2.pop else: c = op3.pop ipdom: d = op4

•  What if the branch condition is uniform across all threads?•  Execute instructions with a null predicate

•  Branches don’t push a token to the divergence stack when branch condition is uniform.

Divergence Stack Predication

a = op0 b = op1 f0 = slt a,b @f0 c = op2 @!f0 c = op3 d = op4

Page 12: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 12

Runtime Branch-Uniformity Optimizationwith Consensual Branches

Kernel() { a = op0; b = op1; if (a < b) { c = op2; } else { c = op3; } d = op4; } Kernel<<<n>>>();

a = op0 b = op1 f0 = slt a,b cbranch.ifnull f0, else @f0 c = op2 else: cbranch.ifnull !f0, ipdom @!f0 c = op3 ipdom: d = op4

•  We can optimize the predicated code with a consensual branch (cbranch), which is taken only when all threads consensually agree on the branch condition (ifnull).

•  The code may jump around unnecessary work.

Thread-Aware Predication

Page 13: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 13

a = op0 b = op1 f0 = slt a,b cbranch.ifnull f0, else c = op2 else: cbranch.ifnull !f0, ipdom c = op3 ipdom: d = op4

Static Branch-Uniformity Optimizationwith Consensual Branches

Kernel() { a = op0; b = op1; if (a < b) { c = op2; } else { c = op3; } d = op4; } Kernel<<<n>>>();

•  If the compiler can prove that the branch condition is uniform across all threads, the compiler can omit the guard predicates.

Thread-Aware Predication

Page 14: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 14

Loop Example: Consensual Branches are Key to Compile Loops with Predication

f0 = true loop: cbranch.ifnull f0, exit @f0 a = op0 @f0 b = op1 @f0 f1 = slt a, b f0 = and f0, !f1 j loop exit: c = op2

•  Intuitively, the compiler needs to sequence loop until all threads are done executing the loop.

•  A consensual branch (cbranch.ifnull) is used to check whether loop mask (f0) is null.

Kernel() { done = false; while (!done) { a = op0; b = op1; done = a < b; } c = op2; } Kernel<<<n>>>();

Thread-Aware Predication

Page 15: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 15

Thread-Aware (TA) Predication Compiler Algorithms

Page 16: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 16

Predication Compiler Algorithms

N1 N8 N2

P1 !P1

N4 N7 N3

P2 !P2

N6 N5

Control Dependence Graph (CDG)

Guard Predicate = P1 && !P2

N1

N2

N4

N3

N6 N5

N8

N7

Control Flow Graph (CFG)

P1 !P1

P2 !P2

Kernel() { N1; N2; if (!P1) { N3; } else { N4; if (!P2) { N5; } else { N6; } N7; } N8; } Kernel<<<n>>>();

Generate CFG, CDG

Walk CDG to Get Guard Predicates

for all BBs

Linearize Control

Flow

Predicate all instructions with guard predicate and

rewire all BBs

Page 17: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 17

Runtime Branch-Uniformity Optimization

N1 N8 N2

P1 !P1

N4 N7 N3

P2 !P2

N6 N5

Control Dependence Graph (CDG)

N1

N2

N4

N3

N6 N5

N8

N7

CFG

P1 !P1

P2 !P2

Generate CFG, CDG

Walk CDG to Get Guard Predicates

for all BBs

Linearize Control

Flow

Predicate all instructions with guard predicate and

rewire all BBs

Kernel() { N1; N2; if (!P1) { N3; } else { N4; if (!P2) { N5; } else { N6; } N7; } N8; } Kernel<<<n>>>();

Assume compiler cannot prove that P1 is uniform across all threads

Linearize Control

Flow

Add consensual branch if P1 is null

Add consensual

branch if !P1 is null

Page 18: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 18

Static Branch-Uniformity Optimization

N1 N8 N2

P1 !P1

N4 N7 N3

P2 !P2

N6 N5

N1

N2

N4

N3

N6 N5

N8

N7

CFG

P1 !P1

P2 !P2

Generate CFG, CDG

Walk CDG to Get Guard Predicates

for all BBs

Linearize Control

Flow

Predicate all instructions with guard predicate and

rewire all BBs

Kernel() { N1; N2; if (!P1) { N3; } else { N4; if (!P2) { N5; } else { N6; } N7; } N8; } Kernel<<<n>>>();

Assume compiler can prove that P1 is uniform across all threads

Guard Predicate = P2

Walk CDG to Get Guard Predicates

for all BBs

Linearize Control

Flow

Control Dependence Graph (CDG)

Page 19: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 19

N1

N2

N3

N12

N4

N10

N5

N6

N7N8

N9

N11

Predication Compiler Algorithms: Loops

Loop Mask

Exit Mask 1

Exit Mask 2

N1 N3

P1 !P1

N12

N10

N2 N11

N4 L1

N5

P2|E1 !P2

N7 N6 N8

P3 !P3|E2

N9

Kernel() { if (!P1) { while (!P2) { if (!P3) { break; } } } } Kernel<<<n>>>();

Thread-Aware Control Dependence Graph (CDG) CFG

Page 20: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 20

Supporting Complex Control Flow

•  Function Calls•  Support by a straightforward calling convention

•  Virtual Function Calls

•  Irreducible Control Flow•  Find smallest region containing irreducible control flow, and

insert sequencing code at the entry block and exit block to sequence active threads through the region one-by-one.

@p2 jalr r3 loop: p0, r4 = find_unique p2, r3 @p0 jalr r4 p2 = p2 and !p0 cbranch.ifany p2, loop

With Divergence Stack

With Predication

Page 21: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 21

Evaluation

Page 22: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 22

Predication CUDA Compiler

•  Theoretically we could have implemented our thread-aware predication pass in ptxas

•  Implemented bulk of the predication pass in LLVM for fast prototyping

CUDA Program

Annotated PTX

SASS

LLVM Compiler Predication Passes,

Generates a throw-away pseudo PTX instruction with predication

information

“ptxas” Backend Compiler Runs existing optimizations

Then predicates with information retrieved from pseudo PTX insts.

Page 23: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 23

Evaluation: Quick Recap28 Benchmarks written in CUDA (Parboil, Rodinia, FFT, nqueens)

CUDA 6.5 Production Compiler

Divergence Stack Limited Predication

CUDA Binary

Modified CUDA 6.5 Production Compiler Full Predication with

TA Compiler Algorithms

CUDA Binary

Performance, Statistics

NVIDIA Tesla K20c (Kepler, GK110)

Performance, Statistics

Yunsup Lee – UC Berkeley

Compare performance and statistics

Five Bars 1)  Baseline (Divergence Stack) 2)  Limited Predication 3)  TA Predication 4)  TA+SBU

(static branch-uniformity optimization)

5)  TA+SBU+RBU (runtime branch-uniformity optimization)

Page 24: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 24

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

r-path

finde

r

p-cutc

p

p-sge

mm p-s

ad

p-mri-

q

p-sten

cil

r-nn

p-lbm

fft

r-bac

kprop

nque

ens

r-b+tre

e

r-hots

pot

r-srad

-v2

p-grid

ding

p-tpa

cf

p-spm

v

r-gau

ssian

p-hist

o p-b

fs

r-srad

-v1

r-sclu

ster

r-lud

r-b

fs

Baseline with divergence stack Divergence Stack+Limited Predication TA TA+SBU TA+SBU+RBU

Performance Results: Geomean

•  Thread-aware predication compiler is competitive with the baseline compiler (divergence stack)

•  Both static and runtime branch-uniformity optimizations play an important role

•  Performance doesn’t change with limited if-conversion heuristic implemented in production compiler

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

geomean

Spee

dup

Page 25: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 25

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

geomean

Spee

dup

Performance Results: Speedups

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

r-path

finde

r

p-cutc

p

p-sge

mm p-s

ad

p-mri-

q

p-sten

cil

r-nn

p-lbm

fft

r-bac

kprop

nque

ens

r-b+tre

e

r-hots

pot

r-srad

-v2

p-grid

ding

p-tpa

cf

p-spm

v

r-gau

ssian

p-hist

o p-b

fs

r-srad

-v1

r-sclu

ster

r-lud

r-b

fs

Baseline with divergence stack Divergence Stack+Limited Predication TA TA+SBU TA+SBU+RBU

•  Predication can expose better scheduling opportunities

•  Extra consensual branches added for +RBU may act as scheduling barriers

Page 26: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 26

Performance Results: Slowdowns

•  Ten benchmarks are inconclusive (>90%, <100%)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

r-path

finde

r

p-cutc

p

p-sge

mm p-s

ad

p-mri-

q

p-sten

cil

r-nn

p-lbm

fft

r-bac

kprop

nque

ens

r-b+tre

e

r-hots

pot

r-srad

-v2

p-grid

ding

p-tpa

cf

p-spm

v

r-gau

ssian

p-hist

o p-b

fs

r-srad

-v1

r-sclu

ster

r-lud

r-b

fs

Baseline with divergence stack Divergence Stack+Limited Predication TA TA+SBU TA+SBU+RBU

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

geomean

Spee

dup

Page 27: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 27

Performance Results: Slowdowns

•  Ten benchmarks are inconclusive (>90%, <100%)•  Five benchmarks in <90% range

•  Increase in register pressure can reduce occupancy, which sometimes reduce performance

•  Compiler is not able to optimize for all branch uniformity exhibited during runtime

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

r-path

finde

r

p-cutc

p

p-sge

mm p-s

ad

p-mri-

q

p-sten

cil

r-nn

p-lbm

fft

r-bac

kprop

nque

ens

r-b+tre

e

r-hots

pot

r-srad

-v2

p-grid

ding

p-tpa

cf

p-spm

v

r-gau

ssian

p-hist

o p-b

fs

r-srad

-v1

r-sclu

ster

r-lud

r-b

fs

Baseline with divergence stack Divergence Stack+Limited Predication TA TA+SBU TA+SBU+RBU

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

geomean

Spee

dup

Page 28: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 28

Discussion on Area, Power, Energy

•  Hard to quantify impact of predication on area, power, energy

•  Experiment done on GPU silicon•  Primary motivation of software divergence management

is to reduce hardware design complexity and associated verification costs

•  Area and power overhead of divergence stack are not significant

•  Performance ≈ power/energy consumption•  Power/Energy consumption of software divergence mgmt.

≈ Power/Energy of hardware divergence mgmt.

Page 29: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 29

N1

N2

N6 N3

N4 N7

N5

Fundamental Advantages ofSoftware Divergence Management

Control Flow Graph

•  Short-circuit example•  Divergence stack cannot reconverge

threads at N4•  Predication can reconverge threads at

N4

Page 30: Yunsup Lee – UC Berkeley 1 - Hwachahwacha.org/papers/predication-micro2014-talk.pdfYunsup Lee – UC Berkeley 3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not

Yunsup Lee – UC Berkeley 30

Conclusions•  Advantages of Divergence Stack

•  Enables a fairly conventional thread compilation model•  Register allocation easier•  Simplifies the task of supporting irreducible control flow

•  Advantages of Predication•  Simplifies the hardware without sacrificing programmability•  Actual cases where predication can outperform divergence stack

•  Better scheduling opportunities•  Better reconvergence of threads

•  For divergence management, pushing complexity to the compiler is a better choice

This work was funded by DARPA award HR0011-12-2-0016, the Center for Future Architecture Research, a member of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA, an NVIDIA graduate fellowship, and ASPIRE Lab industrial sponsors and affiliates Intel, Google, Nokia, NVIDIA, Oracle, and Samsung. It was also funded by DOE contract B599861. Any opinions, findings, conclusions, or recommendations in this paper are solely those of the authors and do not necessarily reflect the position or the policy of the sponsors.