GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of Computing, University of Utah Indradeep Ghosh and Sreeranga P. Rajan Fujitsu Labs of America Work associated with The Center for Parallel Computing (CPU), and the Gauss Group at Utah http://www.cs.utah.edu/ fv /GKLEE
33
Embed
GKLEE : Concolic Verification and Test Generation for GPUs
GKLEE : Concolic Verification and Test Generation for GPUs. Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya , and Ganesh Gopalakrishnan School of Computing, University of Utah Indradeep Ghosh and Sreeranga P. Rajan Fujitsu Labs of America Work associated with - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
GKLEE : Concolic Verification and Test Generation for GPUs
Guodong LiFujitsu Labs of America
Peng Li, Geof Sawaya, and Ganesh GopalakrishnanSchool of Computing, University of Utah
Indradeep Ghosh and Sreeranga P. RajanFujitsu Labs of America
Work associated withThe Center for Parallel Computing (CPU), and the Gauss Group at Utah
• GPUs are exciting in so many ways– Parallelism for the masses!– Growing relevance: hand-held devices to Exascale
• There are many ways to arrive at GPU code:– Write it from scratch– Various compilation approaches
• Debugging GPU code is important– Library functions, students learning GPU programming, ..– Compiler transformations need to be verified, as well
• We contribute GKLEE, a tool that finds real bugs– Main Take-Away Message:
Formal methods can be exciting and practical in the GPU domain !!
What is GKLEE ?
• A CUDA/C++ Concrete+Symbolic Execution Tool– Designers can decide which variables to declare as symbolic– Symbolic execution considers all possible values
• Not just the test inputs that the designer happened to pick• This is made possible by the power of SMT (constraint) solving
– Provides far more incisive coverage– Yet bugs are displayed as concrete traces– Concolic tools can also generate tests that can be run on the HW
• GKLEE also models all possible schedules– E.g. Different warps executed in different orders– Helps expose bugs that are execution platform dependent– GKLEE does this very efficiently by exploring a canonical schedule
Value of GKLEE to CUDA Programmers
• Finds deadlocks caused by incorrect uses of __syncthreads– GKLEE detects barriers that are not textually aligned
• GKLEE can help verify functional correctness• Verification can be conducted over symbolic inputs
• Detects many types of races– Shared memory races:
• Intra warp under warp divergence (we call it “porting race”)• Intra-warp without warp divergence• Inter-warp races• Global memory races• GKLEE can solve control flow constraints and generate test input
that exposes races (Example-5 presented later)
Value of GKLEE to CUDA Programmers (contd.)
• Detects many causes of performance loss– Bank conflicts, Warp divergences, Non-coalesced mem. accesses– Currently reported as % of affected Barrier Intervals / Warps– Considers all inputs and schedules
• Again, it is sufficient to analyze the canonical schedule
• Multi-kernel examples with 2K threads have been verified– Additional scalability through parameterized verification (in progress)
Architecture of GKLEE• GKLEE was realized by extending KLEE (Dunbar, Cadar, Engler – OSDI 2008)• GKLEE employs symbolic virtual machine that “understands” CUDA
Symbolic Virtual Machine of GKLEE
GKLEE through examples– Basic usage (including Emacs mode)– Example-1: Porting a prefix-sum example
– Example-3: Deliberately introduced deadlock (Sanders/Kandrot, p. 88)• Textbook shows risk of “too much optimization”; GKLEE can be safety-net
– Example-4: A multi-kernel example: (AB)^T = B^T A^T• The whole assertion was verified for 2K threads• A broken calculation immediately caught (no wading through results)
– Example-5: Detecting “unexpected” bank conflicts• Code claims that all bank conflicts have been eliminated• Yet GKLEE finds bank conflicts (and provides a scenario)
– Example-6: Input-dependent race/bank conflict in SDK kernel• The racing location was input-dependent (also for bank conflict)• Without symbolic analysis, nearly impossible to hit these errors
GKLEE Features not covered by these examples (see our paper)
– Test generation and reduction heuristics• Scripts to convert GKLEE tests to hardware
– Different kinds of races• Shared memory vs. global memory races• Intra-warp races
– With warp divergence (“porting race”)– Without warp divergence
• Inter-warp races
– Bank conflicts and non-coalesced accesses• Computed with respect to 1.x and 2.x rules
– Bugs as a function of compiler optimization level revealed• Volatile bugs• Other compilation issues
– Handy emacs-mode with• Thread, block, warp stepping• Ability to see LLVM byte-codes• Trace actions wrt source code
bool verify(int data[], int ROM_data[], int length){ // Do a prefix-sum sequentially onto ROM_data for (int i = 1; i < length; ++i) { ROM_data[i] += ROM_data[i-1]; printf("ROM_data[%d]=%d\n", i, ROM_data[i]); }
// Now, verify for (int i = 1 ; i < length; ++i) { if (data[i] != ROM_data[i] )
{ printf("error, results disagree at loc %d\n", i); return false; }
} return true;}
//#define BLOCK_SIZE 64 #define BLOCK_SIZE 32
__global__ void prefixsumblock(int *in, int *out, int length)__global__ void correctsumends(int *ends, int *in, int *out)__global__ void gathersumends(int *in, int *out)__global__ void zarro(int *data, int length)
• With the indicated changes, the example can be easily verified
• With the trick to force the compiler to consider both paths, we can examine the behavior under two scenarios
• The kernel verifies fine• Seeded calculation bugs are easily caught (try
breaking the computation)
Example-2: Bitonic Sorting
• CUDA SDK 2.0 example• Can be verified for functional correctness• Concolic verifier generates 28 (or so) paths– For each conditional, GKLEE forks two executions– Test limiting heuristics are available• -Path-Reduce :
– B: Item covered by some thread at least once– T : Item covered by all threads at least once
__device__ inline void swap(int & a, int & b) { int tmp = a; a = b; b = tmp;}__global__ void BitonicKernel(int * values){ unsigned int tid = threadIdx.x; // Copy input to shared mem. shared[tid] = values[tid]; printf("tid: %d, blockDim: %d\n", tid, blockDim.x); __syncthreads();
// Parallel bitonic sort. for (unsigned int k = 2; k <= blockDim.x; k *= 2) { for (unsigned int j = k / 2; j>0; j /= 2) { unsigned int ixj = tid ^ j; if (ixj > tid) {
if ((tid & k) == 0) { if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]);}else { if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]);}}
#ifndef _SYM // for debugging for (int i = 0; i < NUM; i++) { printf("%d ", values[i]); } printf("\n");#endif
// here blockDim.x should be NUM; we use this hack for (int i = 1; i < NUM; i++) { if (dvalues[i] < dvalues[i-1]) { printf("The sorting algorithm is incorrect since values[%d] < values[%d]!\n", i, i-1); return 1; } }
cudaFree(dvalues); cudaFree(values);
return 0;}
int main() {#ifdef _SYM //__device__ int values[NUM]; __input__ int *values = (int *)malloc(sizeof(int) * NUM); #else __input__ int values[NUM] = {6, 5}; // , 2, 1, 4, 3}; //__input__ int values[NUM] = {6, 5, 2, 1, 4, 3};
// for debugging printf("\nInput values:\n"); for (int i = 0; i < NUM; i++) { printf("%u ", values[i]); } printf("\n");#endif klee_make_symbolic(values, sizeof(int)*NUM, "values");
Example-3: Deadlock due to incorrect __syncthread call in dot-product
(Illustration p.88, Sanders and Kandrot, “CUDA By Example”)
// buggy code suggested on page 88 while (i != 0) { if (cacheIndex < i) { cache[cacheIndex] += cache[cacheIndex + i]; __syncthreads(); } i /= 2; }
// begin corrected code as suggested on page 88 while (i != 0) { if (cacheIndex < i) cache[cacheIndex] += cache[cacheIndex + i]; __syncthreads(); i /= 2; }
GKLEE: Thread 128 and Thread 127 encounter different barrier sequences, one hits the end of kernel, but the other does not!
t128 found a deadlock: #barriers at the threads:
Report:
Example-4: Symbolic verification of multi-kernel exampleVerify symbolically that (AB)^T = B^ A^, for matrices A,B
int main(int argc, char* argv[]){ // const unsigned int seed = 99; //doGkleeTransposeTest(); //doGkleeMultTest(); // A^T ... int *A, *AT; // A: [64 * 32] cudaMalloc((void **)&A, sizeof(int) * AN); cudaMalloc((void **)&AT, sizeof(int) * AN); // Make the input 'A' as symbolic... klee_make_symbolic(A, sizeof(int) * AN, "A_var"); __modify_Grid(GRIDSIZE_X, P/BLOCKSIZE);// (1, 2) __modify_Block(BLOCKSIZE, BLOCKSIZE);// (8, 8) __begin_GPU(); MatTrans(A, AT); __end_GPU(); printf("After A's transpose!\n");
// B * A = T int *T; cudaMalloc((void **)&T, sizeof(int) * CN); __modify_Grid(GRIDSIZE_Y, GRIDSIZE_X); __modify_Block(BLOCKSIZE, BLOCKSIZE); __begin_GPU(); matrixMul(B, A, T, P, DIM_X); __end_GPU();
// T^T = C' int *C_P; cudaMalloc((void **)&C_P, sizeof(int) * CN); __modify_Grid(GRIDSIZE_X, GRIDSIZE_Y); __modify_Block(BLOCKSIZE, BLOCKSIZE); __begin_GPU(); MatTrans(T, C_P); __end_GPU();
Concluding Remarks• A Concolic Verifier for CUDA/C++• Detects correctness / performance issues• High coverage, automatic test generation• Tool finds issues in well-known kernels (SDK)
• Tool Demos during talk will illustrate these examples– Can provide a LiveDVD or ISO image (will be posted in the URL below…)
• Our paper provides details on all the issues glossed over here– Paper, user-manual, and example code available fromhttp://www.cs.utah.edu/fv/GKLEE
• Support for CUDA 4.0 features– Atomics + SIMD– GPU2GPU transfers– GPU + MPI
• Incorporate into GPU-oriented compilation frameworks– E.g. OpenACC, others.
• Suggestions are welcome
Extra Slides
AB
P1 P2 Pi Pi+1 Pj ….Consider an arbitrary schedulethat brings the execution to theIllustrated where a race FIRST occurs
The race is between A and B
AB
P1 P2 Pi Pi+1 Pj ….Then clearly, the red executionis equivalent to the race-freeexecution, because it is occurringin the race-free region of theexecution-space.
AB
P1 P2 Pi Pi+1 Pj ….Our canonical schedule isshown by the dashed edgeshere.
AB
P1 P2 Pi Pi+1 Pj ….The Extra Executions Should not matter
Unless they themselves race !
But that race would then be caught !
So under the absence of ANY race,ALL schedules within a barrier intervalare equivalent.