Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Programming with Programming with CUDACUDAWS 08/09WS 08/09

Lecture 7Lecture 7Thu, 13 Nov, 2008Thu, 13 Nov, 2008

PreviouslyPreviously

CUDA Runtime ComponentCUDA Runtime Component– Common ComponentCommon Component

Built-in vector typesBuilt-in vector types Math functionsMath functions TimingTiming TexturesTextures

– Texture fetchTexture fetch– Texture referenceTexture reference– Texture read modesTexture read modes– Normalized texture coordinatesNormalized texture coordinates– Linear texture filteringLinear texture filtering

TodayToday

CUDA Runtime ComponentCUDA Runtime Component– Common ComponentCommon Component– Device ComponentDevice Component– Host ComponentHost Component

CUDA Runtime CUDA Runtime ComponentComponent Common ComponentCommon Component Device ComponentDevice Component Host ComponentHost Component

Device Runtime Device Runtime ComponentComponent Can only be used in device codeCan only be used in device code Math functionsMath functions

– Faster, less accurate versions of Faster, less accurate versions of functions from common componentfunctions from common component

– __<common_function_name>__<common_function_name> log and __logflog and __logf

– Appendix B of Programming GuideAppendix B of Programming Guide– Use fast math by defaultUse fast math by default

Compiler option Compiler option -use_fast_math-use_fast_math

Device Runtime Device Runtime ComponentComponent Synch function: Synch function: __syncThreads()__syncThreads()

– Synchronize threads in a blockSynchronize threads in a block– Avoid read-after-write, write-after-Avoid read-after-write, write-after-

read, write-after-write hazards for read, write-after-write hazards for commonly accessed shared memorycommonly accessed shared memory

– Dangerous to use in conditionalsDangerous to use in conditionals Code hangs / unwanted effectsCode hangs / unwanted effects

Device Runtime Device Runtime ComponentComponent Atomic functionsAtomic functions

– Guaranteed to perform un-interferedGuaranteed to perform un-interfered Memory address is lockedMemory address is locked

– Supported by CUDA cards > 1.0Supported by CUDA cards > 1.0– Mostly operate on integers onlyMostly operate on integers only– Appendix C of programming guideAppendix C of programming guide

Device Runtime Device Runtime ComponentComponent Warp vote functionsWarp vote functions

– Supported by CUDA cards >= 1.2Supported by CUDA cards >= 1.2– Check a condition on all threads in a Check a condition on all threads in a

warpwarp int __all (int predicate)int __all (int predicate)true (non-zero) if true (non-zero) if predicatepredicate is is true for all warp threadstrue for all warp threads

int __any (int predicate)int __any (int predicate)true (non-zero) if true (non-zero) if predicatepredicate is is true for any warp threadtrue for any warp thread

Device Runtime Device Runtime ComponentComponent Texture functions: fetching textures, Texture functions: fetching textures,

or or texturingtexturing– Texture data may be stored in Texture data may be stored in

linear memorylinear memory or or CUDA arraysCUDA arrays– Texturing from linear memoryTexturing from linear memorytemplate<class Type>template<class Type>Type tex1Dfetch(Type tex1Dfetch(texture<Type, 1, cudaReadModeElementType> texRef, texture<Type, 1, cudaReadModeElementType> texRef, int x);int x);float tex1Dfetch(float tex1Dfetch(texture<Type, 1, cudaReadModeNormalizedFloat> texture<Type, 1, cudaReadModeNormalizedFloat> texRef, int x);texRef, int x);


or or texturingtexturing– Texturing from linear memoryTexturing from linear memory– TypeType can be any of the supported 1-, can be any of the supported 1-,

2- or 4- vector types2- or 4- vector typestemplate<class Type>template<class Type>Type tex1Dfetch(Type tex1Dfetch(texture<Type, 1, cudaReadModeElementType> texRef, texture<Type, 1, cudaReadModeElementType> texRef, int x);int x);float4 tex1Dfetch(float4 tex1Dfetch(texture<uchar4, 1, cudaReadModeNormalizedFloat> texture<uchar4, 1, cudaReadModeNormalizedFloat> texRef, int x);texRef, int x);


or or texturingtexturing– Texturing from linear memoryTexturing from linear memory– No addressing modes supportedNo addressing modes supported– No texture filtering supportedNo texture filtering supported


or or texturingtexturing– Texturing from CUDA arraysTexturing from CUDA arraystemplate<class Type, enum cudaTextureReadMode template<class Type, enum cudaTextureReadMode readMode>readMode>Type tex1D(texture<Type, 1, readMode> texRef, Type tex1D(texture<Type, 1, readMode> texRef, float x);float x);template<class Type, enum cudaTextureReadMode template<class Type, enum cudaTextureReadMode readMode>readMode>Type tex2D(texture<Type, 2, readMode> texRef, Type tex2D(texture<Type, 2, readMode> texRef, float x, float y);float x, float y);template<class Type, enum cudaTextureReadMode template<class Type, enum cudaTextureReadMode readMode>readMode>Type tex3D(texture<Type, 3, readMode> texRef, Type tex3D(texture<Type, 3, readMode> texRef, float x, float y, float z);float x, float y, float z);


or or texturingtexturing– Texturing from CUDA arraysTexturing from CUDA arrays– Run-time attributes determineRun-time attributes determine

Coordinate normalizationCoordinate normalization Addressing mode (clamp/wrap)Addressing mode (clamp/wrap) FilteringFiltering

CUDA Runtime CUDA Runtime ComponentComponent Common ComponentCommon Component Device ComponentDevice Component Host ComponentHost Component

Host Runtime Host Runtime ComponentComponent Can only be used by host functionsCan only be used by host functions Composed of 2 APIsComposed of 2 APIs

– High-level High-level CUDA runtime APICUDA runtime API, , which runs on top ofwhich runs on top of

– Low-level Low-level CUDA driver APICUDA driver API No mixing: an application should use No mixing: an application should use

either one or the other. either one or the other.

Each API provides functions forEach API provides functions for– Device managementDevice management– Context managementContext management– Memory managementMemory management– Code module managementCode module management– Execution controlExecution control– Texture reference managementTexture reference management– OpenGL/Direct3D interoperabilityOpenGL/Direct3D interoperability

Host Runtime Host Runtime ComponentComponent

The CUDA runtime API implicitly The CUDA runtime API implicitly providesprovides– InitializationInitialization– Context managementContext management– Module managementModule management

CUDA driver API does not, and is CUDA driver API does not, and is harder to program.harder to program.


Recall: nvcc parses an input source fileRecall: nvcc parses an input source file– Separates device and host codeSeparates device and host code– Device code compiled to Device code compiled to cubincubin

objectobject– Generated host code in C compiled Generated host code in C compiled

by external toolby external tool


Generated host codeGenerated host code– Is in C formatIs in C format– Includes the Includes the cubincubin object object

Applications mayApplications may– Ignore host code and run Ignore host code and run cubincubin

object directly using the object directly using the low-level low-level CUDA driver APICUDA driver API

– Link to generated host code and Link to generated host code and launch it using the high-level CUDA launch it using the high-level CUDA runtime APIruntime API


The CUDA driver APIThe CUDA driver API– Is harder to programIs harder to program– Offers greater controlOffers greater control– Does not depend on CDoes not depend on C– Does not offer device emulationDoes not offer device emulation


CUDA runtime functions and other CUDA runtime functions and other entry points are prefixed by entry points are prefixed by cudacuda

CUDA driver functions and other entry CUDA driver functions and other entry points are prefixed by points are prefixed by cucu


Device memory is always allocated as Device memory is always allocated as either ofeither of– Linear memoryLinear memory– CUDA arraysCUDA arrays

Host Runtime Host Runtime Component - detourComponent - detour

Linear memory in deviceLinear memory in device– Contiguous segment of memoryContiguous segment of memory– 32-bit addresses32-bit addresses– Can be referenced using pointersCan be referenced using pointers


CUDA arraysCUDA arrays– ““opaque” memory layoutopaque” memory layout– 1D/2D/3D arrays of 1/2/4 vectors of 1D/2D/3D arrays of 1/2/4 vectors of

8/16/32 bit integers or 16/32 bit 8/16/32 bit integers or 16/32 bit floatsfloats16 bit floats from driver API only16 bit floats from driver API only

– Optimized for texture fetchingOptimized for texture fetching– Accessible from kernels through Accessible from kernels through

texture fetches onlytexture fetches only


Both the CUDA runtime and CUDA Both the CUDA runtime and CUDA driver APIsdriver APIs– Can access device informationCan access device information– Enable the host to read/write to Enable the host to read/write to

linear memory/CUDA arrayslinear memory/CUDA arrays With support for pinned memoryWith support for pinned memory


Both the CUDA runtime and CUDA Both the CUDA runtime and CUDA driver APIsdriver APIs– Can access device informationCan access device information– Enable the host to read/write to Enable the host to read/write to

linear memory/CUDA arrayslinear memory/CUDA arrays With support for pinned memoryWith support for pinned memory

– Provide OpenGL/Direct3D Provide OpenGL/Direct3D interoperabilityinteroperability

– Provide management for Provide management for asynchronous executionasynchronous execution


Asynchronous functionsAsynchronous functions– Kernel launches, and some othersKernel launches, and some others– AsyncAsync memory copies memory copies– Device <-> device memory copiesDevice <-> device memory copies– Memory settingMemory setting

Concurrent execution of functions is Concurrent execution of functions is managed through managed through streamsstreams


StreamsStreams– A queue of operationsA queue of operations– An application may have multiple An application may have multiple

stream objectsstream objects simultaneously simultaneously– kernel<<<Ng,Nb,Ns,kernel<<<Ng,Nb,Ns,SS>>>>>>– A kernel can be scheduled to A kernel can be scheduled to

execute on a streamexecute on a stream– Some memory copy functions can Some memory copy functions can

also be queued on a streamalso be queued on a stream


StreamsStreams– If no stream is specified, stream 0 is If no stream is specified, stream 0 is

used by default.used by default.– Operations in a stream are executed Operations in a stream are executed

synchronouslysynchronouslyPrevious stream operations have Previous stream operations have to end before a new one beginsto end before a new one begins


CUDA runtime and driver APIs provide CUDA runtime and driver APIs provide execution control through stream execution control through stream managementmanagement– <cu/cuda>StreamQuery()<cu/cuda>StreamQuery()

Is stream free?Is stream free?– <cu/cuda>StreamSynchronize()<cu/cuda>StreamSynchronize()

Wait for stream operations to endWait for stream operations to end


CUDA runtime and driver APIs provide CUDA runtime and driver APIs provide execution control through stream execution control through stream managementmanagement– cudaThreadSynchronize() / cudaThreadSynchronize() / cuCtxSynchronize()cuCtxSynchronize() Wait for all streams to be freeWait for all streams to be free

– <cu/cuda>StreamDestroy()<cu/cuda>StreamDestroy() Wait for stream to get freeWait for stream to get free Destroy streamDestroy stream


Accurate timing using Accurate timing using eventsevents– CUEvent/cudaEvent_t start,stop;CUEvent/cudaEvent_t start,stop;<cu/cuda>EventCreate (&start);<cu/cuda>EventCreate (&start);<cu/cuda>EventCreate (&stop);<cu/cuda>EventCreate (&stop);

– Events have to be recordedEvents have to be recorded<cu/cuda>EventRecord (start, 0); // asynchronous<cu/cuda>EventRecord (start, 0); // asynchronous// stuff to time// stuff to time<cu/cuda>EventRecord (stop, 0); // asynchronous<cu/cuda>EventRecord (stop, 0); // asynchronous

– Stream 0: record all operations from Stream 0: record all operations from all streamsall streams

– Stream N: record operations in Stream N: record operations in stream Nstream N


Accurate timing using Accurate timing using eventsevents– <cu/cuda>EventRecord (start, 0); // asynchronous<cu/cuda>EventRecord (start, 0); // asynchronous// stuff to time// stuff to time<cu/cuda>EventRecord (stop, 0); // asynchronous<cu/cuda>EventRecord (stop, 0); // asynchronous<cu/cuda>EventSynchronize (stop);<cu/cuda>EventSynchronize (stop);float time;float time;<cu/cuda>EventElapsedTime (&time, start, stop);<cu/cuda>EventElapsedTime (&time, start, stop);

– As call to record is asynchronous, As call to record is asynchronous, the event has to be synchronized the event has to be synchronized before timingbefore timing

– <cu/cuda>EventDestroy (start);<cu/cuda>EventDestroy (start);<cu/cuda>EventDestroy (stop);<cu/cuda>EventDestroy (stop);


Asynchronous execution can get Asynchronous execution can get confusingconfusing– Can be switched offCan be switched off– Useful for degbuggingUseful for degbugging– Set Set CUDA_LAUNCH_BLOCKINGCUDA_LAUNCH_BLOCKING to 1 to 1


Device InitializationDevice Initialization– CUDA Runtime APICUDA Runtime API

Automatically with first function Automatically with first function callcall

– Cuda Driver APICuda Driver APIcuInit()cuInit()MUST be called before calling any MUST be called before calling any other API functionother API function


Device ManagementDevice Management– cudaDeviceProp / CUDevice device;cudaDeviceProp / CUDevice device;

– int devCount;int devCount;cudaGetDeviceCount (&devCount) / cuDeviceGetCount cudaGetDeviceCount (&devCount) / cuDeviceGetCount (&devCount)(&devCount)

– for dev = 1 to devCount dofor dev = 1 to devCount docudaGetDeviceProperties / cuDeviceGetcudaGetDeviceProperties / cuDeviceGet(&device, dev)(&device, dev)


Device ManagementDevice Management– cudaSetDevice()cudaSetDevice()

Sets the device to be usedSets the device to be usedMUST be set before calling any MUST be set before calling any __global____global__ function function

Device 0 used by defaultDevice 0 used by default


Stream ManagementStream Management– CUStream / cudaStream_t st;CUStream / cudaStream_t st;– cudaStreamCreate (&st); / cudaStreamCreate (&st); / cuStreamCreate (&st, 0);cuStreamCreate (&st, 0);

– cudaStreamDestroy (&st);cudaStreamDestroy (&st);


Accurate timing using Accurate timing using eventsevents– <cu/cuda>EventRecord (start, 0); // asynchronous<cu/cuda>EventRecord (start, 0); // asynchronous// stuff to time// stuff to time<cu/cuda>EventRecord (stop, 0); // asynchronous<cu/cuda>EventRecord (stop, 0); // asynchronous<cu/cuda>EventSynchronize (stop);<cu/cuda>EventSynchronize (stop);float time;float time;<cu/cuda>EventElapsedTime (&time, start, stop);<cu/cuda>EventElapsedTime (&time, start, stop);

– As call to record is asynchronous, As call to record is asynchronous, the event has to be synchronized the event has to be synchronized before timingbefore timing

– <cu/cuda>EventDestroy (start);<cu/cuda>EventDestroy (start);<cu/cuda>EventDestroy (stop);<cu/cuda>EventDestroy (stop);


Event managementEvent management– CUEvent/cudaEvent_t start,stop;CUEvent/cudaEvent_t start,stop;<cu/cuda>EventCreate (&start);<cu/cuda>EventCreate (&start);<cu/cuda>EventCreate (&stop);<cu/cuda>EventCreate (&stop);<cu/cuda>EventRecord (start, 0); // asynchronous<cu/cuda>EventRecord (start, 0); // asynchronous// stuff to time// stuff to time<cu/cuda>EventRecord (stop, 0); // asynchronous<cu/cuda>EventRecord (stop, 0); // asynchronous<cu/cuda>EventSynchronize (stop);<cu/cuda>EventSynchronize (stop);float time;float time;<cu/cuda>EventElapsedTime (&time, start, stop);<cu/cuda>EventElapsedTime (&time, start, stop);<cu/cuda>EventDestroy (start);<cu/cuda>EventDestroy (start);<cu/cuda>EventDestroy (stop);<cu/cuda>EventDestroy (stop);


All for todayAll for today

Next timeNext time– More on the host runtime APIsMore on the host runtime APIs

Memory, stream, event, texture Memory, stream, event, texture managementmanagement

Debug mode for runtime APIDebug mode for runtime APIContext, module, execution Context, module, execution control for driver APIcontrol for driver API

– Performance & OptimizationPerformance & Optimization

See you next week!See you next week!

Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Documents