Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.

Programming with Programming with CUDACUDAWS 08/09WS 08/09

Lecture 3Lecture 3Thu, 30 Oct, 2008Thu, 30 Oct, 2008

PreviouslyPreviously

CUDA programming modelCUDA programming model– GPU as co-processorGPU as co-processor– Kernel definition and invocationKernel definition and invocation– Thread blocks – 1D, 2D, 3DThread blocks – 1D, 2D, 3D– Thread ID and Thread ID and threadIdxthreadIdx– Global/shared memory for threadsGlobal/shared memory for threads– Compute capabilityCompute capability

TodayToday

Theory/practical course?Theory/practical course? CUDA programming modelCUDA programming model

– Limitations on number of threadsLimitations on number of threads– Grids of thread blocksGrids of thread blocks

TodayToday

Theory/practical course?Theory/practical course?– The course is meant to be practicalThe course is meant to be practical– ProgrammingProgramming with CUDA with CUDA– Is that a problem for some of you?Is that a problem for some of you?– Should we change something?Should we change something?

The CUDA Programming ModelThe CUDA Programming Model

(cont'd)(cont'd)

Number of threadsNumber of threads

A kernel is executed on the device A kernel is executed on the device simultaneously by many threadssimultaneously by many threads

dim3 blockSize(Dx,Dy,Dz);dim3 blockSize(Dx,Dy,Dz);// for 1D block, Dy = 1// for 1D block, Dy = 1// for 2D block, Dz = 1// for 2D block, Dz = 1kernel<<<1,blockSize>>>(...)kernel<<<1,blockSize>>>(...)

– # threads =# threads = blockSize = Dx*Dy*Dz blockSize = Dx*Dy*Dz

A bit about the A bit about the hardwarehardware The GPU consists of several The GPU consists of several

multiprocessorsmultiprocessors Each multiprocessor consists of Each multiprocessor consists of

several processorsseveral processors Each processor in a multiprocessor Each processor in a multiprocessor

has its local memory in the form of has its local memory in the form of registersregisters

All processors in a multiprocessor All processors in a multiprocessor have access to a have access to a shared memoryshared memory

Threads and Threads and processorsprocessors All threads in a block run on the All threads in a block run on the

same multiprocessor.same multiprocessor.– They might not all run at the same They might not all run at the same

timetime– Therefore, threads should be Therefore, threads should be

independent of each otherindependent of each other– __syncthreads()__syncthreads() causes all threads causes all threads

to reach the same execution point to reach the same execution point before carrying on.before carrying on.

Threads and Threads and processorsprocessors How many threads can run on How many threads can run on

a multiprocessor?a multiprocessor?– how much memory the how much memory the

multiprocessor hasmultiprocessor has– how much memory does each thread how much memory does each thread

requirerequire

Threads and Threads and processorsprocessors How many threads can a block How many threads can a block

have?have?– how much memory the how much memory the

multiprocessor hasmultiprocessor has– how much memory does each thread how much memory does each thread

requirerequire

Grids of BlocksGrids of Blocks

What if I want to run more What if I want to run more threads?threads?– Call multiple blocks of threadsCall multiple blocks of threads– These form a These form a gridgrid of blocks of blocks

A grid can be 1D or 2DA grid can be 1D or 2D


Example of 1D gridExample of 1D gridInvoke (in main):Invoke (in main):

int N;int N;// assign some value to N// assign some value to Ndim3 blockDimension (N,N);dim3 blockDimension (N,N);kernel<<<N, blockDimension>>> (...);kernel<<<N, blockDimension>>> (...);

Example of 2D gridExample of 2D gridInvoke (in main):Invoke (in main):

int N;int N;// assign some value to N// assign some value to Ndim3 blockDimension (N,N);dim3 blockDimension (N,N);dim3 gridDimension (N,N);dim3 gridDimension (N,N);kernel<<<gridDimension, blockDimension>>> kernel<<<gridDimension, blockDimension>>>

(...);(...);


Invoking a grid:Invoking a grid:kernel<<<gridDimension, blockDimension>>> kernel<<<gridDimension, blockDimension>>>

(...);(...);

– # threads =# threads = gridDimension* gridDimension* blockDimensionblockDimension

Accessing block Accessing block informationinformation Grids can be 1D or 2DGrids can be 1D or 2D The index of a block in a grid is The index of a block in a grid is

available through the available through the blockIdx blockIdx variablevariable

The dimension of a block is The dimension of a block is available through the available through the blockDim blockDim vairablevairable

Arranging blocksArranging blocks

Threads in a block should be Threads in a block should be independent of other threads in independent of other threads in the blockthe block

Blocks in a grid should be Blocks in a grid should be independent of other blocks in the independent of other blocks in the gridgrid

Memory available to Memory available to threadsthreads Each thread has a local memoryEach thread has a local memory Threads in a block share a shared Threads in a block share a shared

memorymemory All threads can access the global All threads can access the global

memorymemory

Memory available to Memory available to threadsthreads All threads have All threads have read-onlyread-only access access

to to constantconstant and and texturetexture memoriesmemories

Memory available to Memory available to threadsthreads An application is expected to An application is expected to

managemanage– global, constant and texture memory global, constant and texture memory

spacesspaces– Data transfer between host and Data transfer between host and

device memoriesdevice memories– (de)allocating host and device (de)allocating host and device

memorymemory

Have a nice weekendHave a nice weekendSee you next timeSee you next time

Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.

Documents