1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: • Basic GPU programming model • CUDA kernel • Simple CUDA program to add two vectors togethe • Compiling the code on a Linux system
25
Embed
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt
CUDAProgramming Model
These notes will introduce:
•Basic GPU programming model•CUDA kernel•Simple CUDA program to add two vectors together•Compiling the code on a Linux system
2
Programming Model
Historically, GPUs designed for creating image data for displays.
That application involves manipulating image pixels (picture elements) and often the same operation each pixel
SIMD (single instruction multiple data) model - An efficient mode of operation in which the same operation is done on each data element at the same time
3
SIMD (Single Instruction Multiple Data) model
Also know as data parallel computation (pattern).One instruction specifies the operation:
Instructiona[] = a[] + k
a[0] a[n-1]a[n-2]a[1]
Very efficient of this is what you want to do. One program.Can design computers to operate this way.
ALUs
4
Single Instruction Multiple Thread Programming Model (SIMT)
A version of SIMD used in GPUs.
GPUs use a thread model to achieve very high parallel performance and to hide memory latency
Multiple threads, each execute the same instruction sequence.
On a GPU, a very large number of threads (10,000’s) possible.
Threads mapped onto available processors on GPU (100’s of processors all executing same program sequence)
5
Programming applications using SIMT model
Matrix operations -- very amenable to SIMT•Same operations done on different elements of matrices
Some “embarassingly” parallel computations such as Monte Carlo calculations• Monte Carlo calculations use random selections
Random selections are independent of each other
Data manipulations• Some sorting can be done quite efficiently
…
6
To write a SIMT program, one needs to write a code sequence that all the threads on the GPU will do.
In CUDA, this code sequence is called a Kernel routine
Kernel code will be regular C except one typically needs to use the thread ID in expressions to ensure each thread accesses different data:
devA and devB are pointers to destination in device
a and b are pointers to host data
Destination Source
12
4. Declaring “kernel” routine to execute on device (GPU)
CUDA introduces a syntax addition to C:Triple angle brackets mark call from host code to device code. Contains organization and number of threads in two parameters:
myKernel<<< n, m >>>(arg1, … );
n and m will define organization of thread blocks and threads in a block.
For now, we will set n = 1, which say one block and m = N, which says N threads in this block.
arg1, … , -- arguments to routine myKernel typically pointers to device memory obtained previously from cudaMalloc.
13
Example – Adding to vectors A and B
#define N 256__global__ void vecAdd(int *a, int *b, int *c) { // Kernel definition
int i = threadIdx.x; c[i] = a[i] + b[i];
}
int main() { // allocate device memory &
// copy data to device // device mem. ptrs devA,devB,devC
vecAdd<<<1, N>>>(devA,devB,devC); // Grid of one block, N threads in block …
}
Loosely derived from CUDA C programming guide, v 3.2 , 2010, NVIDIA
Declaring a Kernel Routine
Each of the N threads performs one pair-wise addition: Thread 0: devC[0] = devA[0] + devB[0];Thread 1: devC[1] = devA[1] + devB[1];
Thread N-1: devC[N-1] = devA[N-1]+devB[N-1];
CUDA structure that provides thread ID in block
Two underscores each side
A kernel defined using CUDA specifier __global__
14
5. Transferring data from device (GPU) to host (CPU)