Peng Li , Guodong Li, and Ganesh Gopalakrishnan { peterlee , ligd , ganesh }@cs.utah

Parametric FlowsAutomated Behavior Equivalencing

for Symbolic Analysis of Races in CUDA Programs

Peng Li, Guodong Li, and Ganesh Gopalakrishnan{peterlee, ligd, ganesh}@cs.utah.edu

School of Computing, University of Utah,

Salt Lake City, UT 84112, USA

GPU-based Computing• Titan [AMD+NVidia Kepler] is ranked 1st in the latest top 500!

• Various of GPU Programming models exist

2

(courtesy of NVidia) (courtesy of AMD) (courtesy of Intel) (courtesy of Microsoft)

CUDA OpenCL C/C++ C++ AMP

CUDA programs harbor insidious bugs!

• Data Races– Caused by unsynchronized accesses

tid = 1 tid = 2 … = a[tid] a[tid-1] = …

– Can produce unpredictable results– Compilers can misbehave if given code with races

• Deadlocks and other problems

3

CUDA Thread + Memory Organization

4Thread Warp Block Grid

5

Illustration of Race

tid 0 1 63

A

t0:write A[0]

...

__global__ void inc_gpu(int*A, int b, int N) { unsigned tid = threadIdx.x; A[tid] = A[(tid+1) % 64] + b;}

RACE!

t63: read A[0]

t0 t63

Illustration of Deadlock

6

tid %2 == 0

__syncthreads()

t0t1 t2 t3

t0 t2 t1 t3

true false

Debugging CUDA Programs is hard!

7

Why Hard?

8

E1

E2

En

…

…

…

…t0 t1 t2 t3 t4

Read(Addr=10)

Write(Addr=10)

Why Hard?• Traditional Methods– bugs only w.r.t. current platforms + inputs +

schedules

• Formal Methods– bugs analyzed w.r.t. future / different platforms

(PORTING ISSUE!)– all relevant inputs – all relevant schedules

9

Solution to relevant inputs: symbolic execution

X < 3

X

X < 10

X =

x<3 x>=3

x>=3& x<10

x>=3& x>=10

Path 1 : x < 3Path 2 : 3 <= x < 10Path 3 : x >= 10

Example Test Case 1 : x = 2Example Test Case 2 : x = 3Example Test Case 3 : x = 11

Constraint Solver

10

Solution to relevant schedules: representative interleaving

11


__device__ int d[64];__global void foo(int *d) {1: __shared__ int a[64];2: int tid = threadIdx.x;3: a[tid] = d[tid];4: __syncthreads();5: a[tid]++;6: if (tid % 2 == 0) {7: a[tid] = a[tid]+2;8: } else {9: a[tid] = a[tid%32];10:}11:__syncthreads();}

12



13



Barrier

Barrier

14

BarrierInterval



Barrier

Barrier

t0 t1 t2 t29t30t31

15

BarrierInterval


__device__ int d[64];__global void foo(int *d) {1: __shared__ int a[64];2: int tid = threadIdx.x;3: a[tid] = d[tid];4: __syncthreads();5: a[tid]++;6: if (tid % 2 == 0) {7: a[tid] = a[tid]+2;8: } else {9: a[tid] = a[tid] + 1;10:}11:__syncthreads();}

Barrier

Barrier

t0 t1 t2 t29t30t31

t0 t2 … t30

16

BarrierInterval


__device__ int d[64];__global void foo(int *d) {1: __shared__ int a[64];2: int tid = threadIdx.x;3: a[tid] = d[tid];4: __syncthreads();5: a[tid]++;6: if (tid % 2 == 0) {7: a[tid] = a[tid]+2;8: } else {9: a[tid] = a[tid]+1;10:}11:__syncthreads();}

Barrier

Barrier

t0 t1 t2 t29t30t31

t1 t3 … t31

t0 t2 … t30

17

BarrierInterval



Barrier

Barrier

t0 t1 t2 t29t30t31 t32

t33

t34 t61t62t63

t0 t2 … t30

t1 t3 … t31

t32 t34 … t62

t33 t35 … t63

SIMD-Aware Canonical Schedule

18



Barrier

Barrier

t0 t1 t2 t29t30t31 t32

t33

t34 t61t62t63

t0 t2 … t30

t1 t3 … t31

t32 t34 … t62

t33 t35 … t63

Around 16K pairs

19

SIMD-Aware Canonical Schedule

Result in PPoPP’ 12:Guarantee to find races !!

Evolution of Formal Analysis Tools for CUDA in our group

• Previous tool : GKLEE [PPoPP’12]– complete – does not scale, because every thread (e.g. 20K or more)

explicitly modeled

• This paper [SC’12] : GKLEEp – complete (in practice) – scales to 20k threads or more..

20

GKLEEp’s Flow

LLVM byte-code

instructions

SymbolicAnalyzer and

Scheduler

Error Monitors

C++ CUDA Programs with Symbolic Variable

Declarations

LLVM-GCC

• Data races• Deadlocks•Concrete test inputs• Bank conflicts• Warp divergences• Non-coalesced • Test Cases

• Provide high coverage• Can be run on HW

21

Key Contributions• Parametric flows are the control-flow equivalence classes of

threads that diverge in the same manner

• GKLEEp found bugs missed by GKLEE (GKLEEp scales!)– GKLEE: upto 2K threads– GKLEEp: well beyond 20K threads– GKLEEp finds all races (except in contrived programs) 22

Key Idea: Branching on TDC (Thread-ID Dependent Conditional)

__global void foo(int *d) {1: __shared__ int a[64];2: int tid = threadIdx.x;3: a[tid] = d[tid];4: __syncthreads();

5: a[tid]++;6: if (tid % 2 == 0) {7: a[tid] = a[tid]+2;8: } else {9: a[tid] = a[tid%32];10:}11: __syncthreads();}

Barrier

Barrier

23

A Motivating Example• __shared__ unsigned b[2048];• __global__ void test(unsigned * a) {• 1: unsigned tid = threadIdx.x;• 2: int x, y;• 3: if (tid < 1024) {• 4: b[tid] = a[tid] + 1; • 5: if (tid % 2 != 0) {• 6: b[tid] = 2; • 7: } else {• 8: if (tid > 0)• 9: b[tid] = b[tid-1]+1; • 10: if (x < y) … • 11: }• 12: } • 13: } else {• 14: b[tid] = b[tid-1]; • 15: }• } 24

A Motivating Example• __shared__ unsigned b[2048];• __global__ void test(unsigned * a) {• 1: unsigned tid = threadIdx.x;• 2: int x, y;

• 3: if (tid < 1024) { <<== TDC• 4: b[tid] = a[tid] + 1;

• 5: if (tid % 2 != 0) { <<== TDC• 6: b[tid] = 2; • 7: } else {

• 8: if (tid > 0) { <<== TDC• 9: b[tid] = b[tid-1]+1; • 10: if (x < y) … << == Not TDC• 11: }• 12: } • 13: } else {• 14: b[tid] = b[tid-1]; • 15: }• }

25

A Motivating Example

tid == 0

tid %2 != 0

b[tid] = 2;

tid >= 1024

b[tid] = b[tid-1];

tid < 1024

tid < 1024

tid > 0

b[tid] = b[tid-1]+1

tid %2 != 0

b[tid] = a[tid] + 1;

tid % 2 == 0

tid > 0

26

Parametric Flow Tree

4 Parametric Flows

Correctness of GKLEEp• No False Alarms– guaranteed - because of exact symbolic constraint

solving!!

• No Omissions– "no omissions" true in practice

Details in paper!!

27

SDK Kernel Example: Symbolic race checking

__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) { const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos]; addData64(s_Hist, threadPos, (data4 >> 2) & 0x3FU); ... } __syncthreads(); ...}

__device__ void addData64(unsigned char *s_Hist, int threadPos, unsigned int data){ s_Hist[threadPos + IMUL(data, THREAD_N)]++; }

threadPos = … threadPos = …

data = (data4>>2) & 0x3FU


s_Hist[threadPos + data*THREAD_N]++;


t1 t2

28

SDK Kernel Example: Symbolic race checking






RW set:t1: writes s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 32), …

t2: writes s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 32), …

t1 t2

t1,t2,d_Data: (t1 t2) (((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 2) & 0x3FU) * 32) == ((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 2) & 0x3FU) * 32)

?

29Satisfiable! There is a race!!

SDK Kernel Example: race checking






t1 t2

GKLEEp indicates that these two

addresses are equal when

t1 = 23, t2 = 31, d_data[23]= 0xfcfcfcfc,

and d_data[31] = 0xf4f4f4f4

indicating a Write-Write race

RW set:t1: writes s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64), …

t2: writes s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64), …

30

Evaluation

31

Timed Out!

Evaluation

3232

GKLEEp in practice

• Accepts host program with many kernel calls

• Each kernel can be ~1K LOC, e.g., eigenvalues

• Finds races as well as inputs causing them

EvaluationKernels Race #T = 32 #T = 64 #T = 256 #T = 1,024 #T = 2,048

GKLEE GKLEEp GKLEE GKLEEp GKLEE GKLEEp GKLEE GKLEEp GKLEE GKLEEp

Bitonic Sort T.O. 7.7/20 T.O. 29.3/27 T.O. 177.2/44 T.O. 198/65 T.O T.O

Histogram64 WW 67.1/6 14.7/5 T.O. 21.8/6 T.O. 350.3/7 T.O. 292.2/2 T.O. 367.2/2

Scalar Product 0.7/1 16.1/1 0.6/1 4.3/1 0.8/1 0.8/1 1.3/1 0.9/1 2.6/1 1.2/1

Matrix Mult 0.2/1 4.5/1 0.4/1 4.0/1 2.0/1 3.2/1 19/1 2.8/1 362.1/1 3.4/1

Reduction0 0.02/1 0.07/1 0.1/1 0.03/1 0.3/1 0.2/1 2.9/1 0.3/1 10.5/1 0.4/1

Reduction1 0.01/1 0.1/1 0.1/1 0.1/1 0.8/1 0.2/1 8.1/1 0.3/1 24.0/1 0.5/1

Reduction2 0.02/1 0.1/1 0.03/1 0.1/1 0.2/1 0.1/1 2.9/1 0.3/1 10.2/1 0.4/1

Reduction3 0.01/1 0.1/1 0.03/1 0.1/1 0.3/1 0.1/1 2.7/1 0.3/1 10.0/1 0.4/1

Reduction4 RW 0.1/1 0.04/1 0.3/1 0.03/1 2.8/1 0.2/1 17.3/1 0.4/1 42.4/1 0.6/1

Reduction5 RW 0.1/1 0.04/1 0.3/1 0.03/1 2.8/1 0.2/1 11.4/1 0.4/1 21.3/1 0.5/1

Reduction6 RW 0.1/1 0.05/1 0.3/1 0.04/1 2.8/1 0.2/1 11.5/1 0.4/1 22.6/1 0.6/1

Scan Best 0.3/1 3.6/1 2.1/1 5.1/1 48.8/1 8.1/1 923.3/1 12.5/1 T.O. 26.6/1

Scan Naive 0.04/1 0.2/1 0.2/1 0.4/1 3.4/1 0.5/1 66.0/1 0.9/1 291.8/1 15.2/1

Scan WorkEfficient 0.1/1 0.6/1 0.4/1 0.8/1 12.1/1 1.2/1 250.8/1 2.1/1 T.O. 3.1/1

Scan Large 0.3/1 2.1/1 0.8/1 2.7/1 6.3/1 2.7/1 67.7.1/1 8.1/1 230.3/1 21.3/1

TABLE I SDK 2.0 KERNEL RESULTS. WE SET 7200 SECONDS AS THE THRESHOLD FOR TIME OUT (ABBREVIATED AS T.O.). A/B , A is the tool runtime (in seconds) and B is the

number of control flow paths34

Related formal methods based work: compare with other formal tools

• [M.Zheng et al, PPoPP’11]: – Combination of static analysis and dynamic analysis

• [A. Leung et al, PLDI’12]: – A single dynamic run can be used to learn much more information about

a CUDA program’s behavior

• [A. Betts et al, SPLASH’12]:– Two threads abstraction – Found errors in real SDK kernels

GKLEEp scales more and finds races in real kernels!35

Conclusion

• New formal approach for analyzing CUDA kernels• Employs a “parametric” reasoning style which

capitalizes on thread symmetry• Scales to over 10^5 threads on realistic CUDA

programs• Finds races missed by– Traditional testing– Previous formal approaches

• Tool will be released soon – check websitehttp://www.cs.utah.edu/fv/GKLEE

36

Thanks!

Questions?

37

Extra Slides

• How to pick symbolic inputs?– taint analyzer being developed– help pick inputs that matter and make symbolic

• Loops invariant– Static analysis to avoid loop unrolling

A Motivating Example• __global__ void test(unsigned * a) {• 1: unsigned bid = blockIdx.x;• 2: unsigned tid = threadIdx.x;• 3:• 4: if (bid % 2 != 0) {• 5: if (tid < 1024) {• 6: unsigned idx = bid * blockDim.x + tid;

• 7: b[tid] = a[idx] + 1; • 8: if (tid % 2 != 0) {• 9: b[tid] = 2; • 10: } else {• 11: if (tid > 0)

• 12: b[tid] = b[tid-1]+1; • 13: }• 14: } else {• 15: b[tid] = b[tid-1]; • 16: }• 17: } else {• 18: unsigned idx = bid * blockDim.x + tid;• 19: b[tid] = a[idx] + 1;• 20: }• }

GKLEEp: T1: <1,0,0><511,0,0> and T2: <1,0,0><512,0,0> incur the write-read race, needs

1.9s s

GKLEE: T1: <1,0,0><31,0,0> and T2: <1,0,0><32,0,0> incur the write-read race, needs

50.5s s

39

A Motivating Example• 7: b[tid] = a[idx] + 1; • 8: if (tid % 2 != 0) {• 9: b[tid] = 2; • 10: } else {• 11: if (tid > 0)

• 12: b[tid] = b[tid-1]+1; • 13: }• 14: }

• Constraint for race checking: – Configuration Constraint:

• •

•

– TDC Constraint from Parametric Flow Tree: • •

– Thread Relation Constraint:•

Precondition

40

A Motivating Example• 7: b[tid] = a[idx] + 1; • 8: if (tid % 2 != 0) {• 9: b[tid] = 2; • 10: } else {• 11: if (tid > 0)

• 12: b[tid] = b[tid-1]+1; • 13: }• 14: }

• Constraint for race checking: – Configuration Constraint:

• •

•

– TDC Constraint from Parametric Flow Tree: • •

– Thread Relation Constraint:•

– Race Constraint:

GKLEEp:

T1: <1,0,0><511,0,0> and T2: <1,0,0><512,0,0> incur

the inter-warp write-read races

Precondition

41

Peng Li , Guodong Li, and Ganesh Gopalakrishnan { peterlee , ligd , ganesh }@cs.utah

Documents