Parametric Flows Automated Behavior Equivalencing for Symbolic Analysis of Races in CUDA Programs Peng Li, Guodong Li, and Ganesh Gopalakrishnan {peterlee, ligd, ganesh}@cs.utah.edu School of Computing, University of Utah, Salt Lake City, UT 84112, USA
41
Embed
Peng Li , Guodong Li, and Ganesh Gopalakrishnan { peterlee , ligd , ganesh }@cs.utah
Parametric Flows Automated Behavior Equivalencing for Symbolic Analysis of Races in CUDA Programs. Peng Li , Guodong Li, and Ganesh Gopalakrishnan { peterlee , ligd , ganesh }@cs.utah.edu School of Computing, University of Utah, Salt Lake City, UT 84112, USA. GPU-based Computing. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Parametric FlowsAutomated Behavior Equivalencing
for Symbolic Analysis of Races in CUDA Programs
Peng Li, Guodong Li, and Ganesh Gopalakrishnan{peterlee, ligd, ganesh}@cs.utah.edu
School of Computing, University of Utah,
Salt Lake City, UT 84112, USA
GPU-based Computing• Titan [AMD+NVidia Kepler] is ranked 1st in the latest top 500!
• Various of GPU Programming models exist
2
(courtesy of NVidia) (courtesy of AMD) (courtesy of Intel) (courtesy of Microsoft)
CUDA OpenCL C/C++ C++ AMP
CUDA programs harbor insidious bugs!
• Data Races– Caused by unsynchronized accesses
tid = 1 tid = 2 … = a[tid] a[tid-1] = …
– Can produce unpredictable results– Compilers can misbehave if given code with races
• Deadlocks and other problems
3
CUDA Thread + Memory Organization
4Thread Warp Block Grid
5
Illustration of Race
tid 0 1 63
A
t0:write A[0]
...
__global__ void inc_gpu(int*A, int b, int N) { unsigned tid = threadIdx.x; A[tid] = A[(tid+1) % 64] + b;}
RACE!
t63: read A[0]
t0 t63
Illustration of Deadlock
6
tid %2 == 0
__syncthreads()
t0t1 t2 t3
t0 t2 t1 t3
true false
Debugging CUDA Programs is hard!
7
Why Hard?
8
E1
E2
En
…
…
…
…t0 t1 t2 t3 t4
Read(Addr=10)
Write(Addr=10)
Why Hard?• Traditional Methods– bugs only w.r.t. current platforms + inputs +
schedules
• Formal Methods– bugs analyzed w.r.t. future / different platforms
(PORTING ISSUE!)– all relevant inputs – all relevant schedules
9
Solution to relevant inputs: symbolic execution
X < 3
X
X < 10
X =
x<3 x>=3
x>=3& x<10
x>=3& x>=10
Path 1 : x < 3Path 2 : 3 <= x < 10Path 3 : x >= 10
Example Test Case 1 : x = 2Example Test Case 2 : x = 3Example Test Case 3 : x = 11
Constraint Solver
10
Solution to relevant schedules: representative interleaving
11
Solution to relevant schedules: representative interleaving
__device__ int d[64];__global void foo(int *d) {1: __shared__ int a[64];2: int tid = threadIdx.x;3: a[tid] = d[tid];4: __syncthreads();5: a[tid]++;6: if (tid % 2 == 0) {7: a[tid] = a[tid]+2;8: } else {9: a[tid] = a[tid%32];10:}11:__syncthreads();}
12
Solution to relevant schedules: representative interleaving
__device__ int d[64];__global void foo(int *d) {1: __shared__ int a[64];2: int tid = threadIdx.x;3: a[tid] = d[tid];4: __syncthreads();5: a[tid]++;6: if (tid % 2 == 0) {7: a[tid] = a[tid]+2;8: } else {9: a[tid] = a[tid%32];10:}11:__syncthreads();}
13
Solution to relevant schedules: representative interleaving
__device__ int d[64];__global void foo(int *d) {1: __shared__ int a[64];2: int tid = threadIdx.x;3: a[tid] = d[tid];4: __syncthreads();5: a[tid]++;6: if (tid % 2 == 0) {7: a[tid] = a[tid]+2;8: } else {9: a[tid] = a[tid%32];10:}11:__syncthreads();}
Barrier
Barrier
14
BarrierInterval
Solution to relevant schedules: representative interleaving
__device__ int d[64];__global void foo(int *d) {1: __shared__ int a[64];2: int tid = threadIdx.x;3: a[tid] = d[tid];4: __syncthreads();5: a[tid]++;6: if (tid % 2 == 0) {7: a[tid] = a[tid]+2;8: } else {9: a[tid] = a[tid%32];10:}11:__syncthreads();}
Barrier
Barrier
t0 t1 t2 t29t30t31
15
BarrierInterval
Solution to relevant schedules: representative interleaving
__device__ int d[64];__global void foo(int *d) {1: __shared__ int a[64];2: int tid = threadIdx.x;3: a[tid] = d[tid];4: __syncthreads();5: a[tid]++;6: if (tid % 2 == 0) {7: a[tid] = a[tid]+2;8: } else {9: a[tid] = a[tid] + 1;10:}11:__syncthreads();}
Barrier
Barrier
t0 t1 t2 t29t30t31
t0 t2 … t30
16
BarrierInterval
Solution to relevant schedules: representative interleaving
__device__ int d[64];__global void foo(int *d) {1: __shared__ int a[64];2: int tid = threadIdx.x;3: a[tid] = d[tid];4: __syncthreads();5: a[tid]++;6: if (tid % 2 == 0) {7: a[tid] = a[tid]+2;8: } else {9: a[tid] = a[tid]+1;10:}11:__syncthreads();}
Barrier
Barrier
t0 t1 t2 t29t30t31
t1 t3 … t31
t0 t2 … t30
17
BarrierInterval
Solution to relevant schedules: representative interleaving
__device__ int d[64];__global void foo(int *d) {1: __shared__ int a[64];2: int tid = threadIdx.x;3: a[tid] = d[tid];4: __syncthreads();5: a[tid]++;6: if (tid % 2 == 0) {7: a[tid] = a[tid]+2;8: } else {9: a[tid] = a[tid%32];10:}11:__syncthreads();}
Barrier
Barrier
t0 t1 t2 t29t30t31 t32
t33
t34 t61t62t63
t0 t2 … t30
t1 t3 … t31
t32 t34 … t62
t33 t35 … t63
SIMD-Aware Canonical Schedule
18
Solution to relevant schedules: representative interleaving
__device__ int d[64];__global void foo(int *d) {1: __shared__ int a[64];2: int tid = threadIdx.x;3: a[tid] = d[tid];4: __syncthreads();5: a[tid]++;6: if (tid % 2 == 0) {7: a[tid] = a[tid]+2;8: } else {9: a[tid] = a[tid%32];10:}11:__syncthreads();}
Barrier
Barrier
t0 t1 t2 t29t30t31 t32
t33
t34 t61t62t63
t0 t2 … t30
t1 t3 … t31
t32 t34 … t62
t33 t35 … t63
Around 16K pairs
19
SIMD-Aware Canonical Schedule
Result in PPoPP’ 12:Guarantee to find races !!
Evolution of Formal Analysis Tools for CUDA in our group
• Previous tool : GKLEE [PPoPP’12]– complete – does not scale, because every thread (e.g. 20K or more)
explicitly modeled
• This paper [SC’12] : GKLEEp – complete (in practice) – scales to 20k threads or more..
20
GKLEEp’s Flow
LLVM byte-code
instructions
SymbolicAnalyzer and
Scheduler
Error Monitors
C++ CUDA Programs with Symbolic Variable
Declarations
LLVM-GCC
• Data races• Deadlocks•Concrete test inputs• Bank conflicts• Warp divergences• Non-coalesced • Test Cases
• Provide high coverage• Can be run on HW
21
Key Contributions• Parametric flows are the control-flow equivalence classes of
threads that diverge in the same manner
• GKLEEp found bugs missed by GKLEE (GKLEEp scales!)– GKLEE: upto 2K threads– GKLEEp: well beyond 20K threads– GKLEEp finds all races (except in contrived programs) 22
Key Idea: Branching on TDC (Thread-ID Dependent Conditional)
__global void foo(int *d) {1: __shared__ int a[64];2: int tid = threadIdx.x;3: a[tid] = d[tid];4: __syncthreads();
TABLE I SDK 2.0 KERNEL RESULTS. WE SET 7200 SECONDS AS THE THRESHOLD FOR TIME OUT (ABBREVIATED AS T.O.). A/B , A is the tool runtime (in seconds) and B is the
number of control flow paths34
Related formal methods based work: compare with other formal tools
• [M.Zheng et al, PPoPP’11]: – Combination of static analysis and dynamic analysis
• [A. Leung et al, PLDI’12]: – A single dynamic run can be used to learn much more information about
a CUDA program’s behavior
• [A. Betts et al, SPLASH’12]:– Two threads abstraction – Found errors in real SDK kernels
GKLEEp scales more and finds races in real kernels!35
Conclusion
• New formal approach for analyzing CUDA kernels• Employs a “parametric” reasoning style which
capitalizes on thread symmetry• Scales to over 10^5 threads on realistic CUDA
programs• Finds races missed by– Traditional testing– Previous formal approaches
• Tool will be released soon – check websitehttp://www.cs.utah.edu/fv/GKLEE
36
Thanks!
Questions?
37
Extra Slides
• How to pick symbolic inputs?– taint analyzer being developed– help pick inputs that matter and make symbolic
• Loops invariant– Static analysis to avoid loop unrolling
A Motivating Example• __global__ void test(unsigned * a) {• 1: unsigned bid = blockIdx.x;• 2: unsigned tid = threadIdx.x;• 3:• 4: if (bid % 2 != 0) {• 5: if (tid < 1024) {• 6: unsigned idx = bid * blockDim.x + tid;