Programming with Programming with CUDA CUDA WS 08/09 WS 08/09 Lecture 3 Lecture 3 Thu, 30 Oct, 2008 Thu, 30 Oct, 2008
Programming with Programming with CUDACUDAWS 08/09WS 08/09
Lecture 3Lecture 3Thu, 30 Oct, 2008Thu, 30 Oct, 2008
PreviouslyPreviously
CUDA programming modelCUDA programming model– GPU as co-processorGPU as co-processor– Kernel definition and invocationKernel definition and invocation– Thread blocks – 1D, 2D, 3DThread blocks – 1D, 2D, 3D– Thread ID and Thread ID and threadIdxthreadIdx– Global/shared memory for threadsGlobal/shared memory for threads– Compute capabilityCompute capability
TodayToday
Theory/practical course?Theory/practical course? CUDA programming modelCUDA programming model
– Limitations on number of threadsLimitations on number of threads– Grids of thread blocksGrids of thread blocks
TodayToday
Theory/practical course?Theory/practical course?– The course is meant to be practicalThe course is meant to be practical– ProgrammingProgramming with CUDA with CUDA– Is that a problem for some of you?Is that a problem for some of you?– Should we change something?Should we change something?
The CUDA Programming ModelThe CUDA Programming Model
(cont'd)(cont'd)
Number of threadsNumber of threads
A kernel is executed on the device A kernel is executed on the device simultaneously by many threadssimultaneously by many threads
dim3 blockSize(Dx,Dy,Dz);dim3 blockSize(Dx,Dy,Dz);// for 1D block, Dy = 1// for 1D block, Dy = 1// for 2D block, Dz = 1// for 2D block, Dz = 1kernel<<<1,blockSize>>>(...)kernel<<<1,blockSize>>>(...)
– # threads =# threads = blockSize = Dx*Dy*Dz blockSize = Dx*Dy*Dz
A bit about the A bit about the hardwarehardware The GPU consists of several The GPU consists of several
multiprocessorsmultiprocessors Each multiprocessor consists of Each multiprocessor consists of
several processorsseveral processors Each processor in a multiprocessor Each processor in a multiprocessor
has its local memory in the form of has its local memory in the form of registersregisters
All processors in a multiprocessor All processors in a multiprocessor have access to a have access to a shared memoryshared memory
Threads and Threads and processorsprocessors All threads in a block run on the All threads in a block run on the
same multiprocessor.same multiprocessor.– They might not all run at the same They might not all run at the same
timetime– Therefore, threads should be Therefore, threads should be
independent of each otherindependent of each other– __syncthreads()__syncthreads() causes all threads causes all threads
to reach the same execution point to reach the same execution point before carrying on.before carrying on.
Threads and Threads and processorsprocessors How many threads can run on How many threads can run on
a multiprocessor?a multiprocessor?– how much memory the how much memory the
multiprocessor hasmultiprocessor has– how much memory does each thread how much memory does each thread
requirerequire
Threads and Threads and processorsprocessors How many threads can a block How many threads can a block
have?have?– how much memory the how much memory the
multiprocessor hasmultiprocessor has– how much memory does each thread how much memory does each thread
requirerequire
Grids of BlocksGrids of Blocks
What if I want to run more What if I want to run more threads?threads?– Call multiple blocks of threadsCall multiple blocks of threads– These form a These form a gridgrid of blocks of blocks
A grid can be 1D or 2DA grid can be 1D or 2D
Grids of BlocksGrids of Blocks
Example of 1D gridExample of 1D gridInvoke (in main):Invoke (in main):
int N;int N;// assign some value to N// assign some value to Ndim3 blockDimension (N,N);dim3 blockDimension (N,N);kernel<<<N, blockDimension>>> (...);kernel<<<N, blockDimension>>> (...);
Example of 2D gridExample of 2D gridInvoke (in main):Invoke (in main):
int N;int N;// assign some value to N// assign some value to Ndim3 blockDimension (N,N);dim3 blockDimension (N,N);dim3 gridDimension (N,N);dim3 gridDimension (N,N);kernel<<<gridDimension, blockDimension>>> kernel<<<gridDimension, blockDimension>>>
(...);(...);
Grids of BlocksGrids of Blocks
Invoking a grid:Invoking a grid:kernel<<<gridDimension, blockDimension>>> kernel<<<gridDimension, blockDimension>>>
(...);(...);
– # threads =# threads = gridDimension* gridDimension* blockDimensionblockDimension
Accessing block Accessing block informationinformation Grids can be 1D or 2DGrids can be 1D or 2D The index of a block in a grid is The index of a block in a grid is
available through the available through the blockIdx blockIdx variablevariable
The dimension of a block is The dimension of a block is available through the available through the blockDim blockDim vairablevairable
Arranging blocksArranging blocks
Threads in a block should be Threads in a block should be independent of other threads in independent of other threads in the blockthe block
Blocks in a grid should be Blocks in a grid should be independent of other blocks in the independent of other blocks in the gridgrid
Memory available to Memory available to threadsthreads Each thread has a local memoryEach thread has a local memory Threads in a block share a shared Threads in a block share a shared
memorymemory All threads can access the global All threads can access the global
memorymemory
Memory available to Memory available to threadsthreads All threads have All threads have read-onlyread-only access access
to to constantconstant and and texturetexture memoriesmemories
Memory available to Memory available to threadsthreads An application is expected to An application is expected to
managemanage– global, constant and texture memory global, constant and texture memory
spacesspaces– Data transfer between host and Data transfer between host and
device memoriesdevice memories– (de)allocating host and device (de)allocating host and device
memorymemory
Have a nice weekendHave a nice weekendSee you next timeSee you next time