Programming with Programming with CUDA CUDA WS 08/09 WS 08/09 Lecture 7 Lecture 7 Thu, 13 Nov, 2008 Thu, 13 Nov, 2008
Programming with Programming with CUDACUDAWS 08/09WS 08/09
Lecture 7Lecture 7Thu, 13 Nov, 2008Thu, 13 Nov, 2008
PreviouslyPreviously
CUDA Runtime ComponentCUDA Runtime Component– Common ComponentCommon Component
Built-in vector typesBuilt-in vector types Math functionsMath functions TimingTiming TexturesTextures
– Texture fetchTexture fetch– Texture referenceTexture reference– Texture read modesTexture read modes– Normalized texture coordinatesNormalized texture coordinates– Linear texture filteringLinear texture filtering
TodayToday
CUDA Runtime ComponentCUDA Runtime Component– Common ComponentCommon Component– Device ComponentDevice Component– Host ComponentHost Component
CUDA Runtime CUDA Runtime ComponentComponent Common ComponentCommon Component Device ComponentDevice Component Host ComponentHost Component
Device Runtime Device Runtime ComponentComponent Can only be used in device codeCan only be used in device code Math functionsMath functions
– Faster, less accurate versions of Faster, less accurate versions of functions from common componentfunctions from common component
– __<common_function_name>__<common_function_name> log and __logflog and __logf
– Appendix B of Programming GuideAppendix B of Programming Guide– Use fast math by defaultUse fast math by default
Compiler option Compiler option -use_fast_math-use_fast_math
Device Runtime Device Runtime ComponentComponent Synch function: Synch function: __syncThreads()__syncThreads()
– Synchronize threads in a blockSynchronize threads in a block– Avoid read-after-write, write-after-Avoid read-after-write, write-after-
read, write-after-write hazards for read, write-after-write hazards for commonly accessed shared memorycommonly accessed shared memory
– Dangerous to use in conditionalsDangerous to use in conditionals Code hangs / unwanted effectsCode hangs / unwanted effects
Device Runtime Device Runtime ComponentComponent Atomic functionsAtomic functions
– Guaranteed to perform un-interferedGuaranteed to perform un-interfered Memory address is lockedMemory address is locked
– Supported by CUDA cards > 1.0Supported by CUDA cards > 1.0– Mostly operate on integers onlyMostly operate on integers only– Appendix C of programming guideAppendix C of programming guide
Device Runtime Device Runtime ComponentComponent Warp vote functionsWarp vote functions
– Supported by CUDA cards >= 1.2Supported by CUDA cards >= 1.2– Check a condition on all threads in a Check a condition on all threads in a
warpwarp int __all (int predicate)int __all (int predicate)true (non-zero) if true (non-zero) if predicatepredicate is is true for all warp threadstrue for all warp threads
int __any (int predicate)int __any (int predicate)true (non-zero) if true (non-zero) if predicatepredicate is is true for any warp threadtrue for any warp thread
Device Runtime Device Runtime ComponentComponent Texture functions: fetching textures, Texture functions: fetching textures,
or or texturingtexturing– Texture data may be stored in Texture data may be stored in
linear memorylinear memory or or CUDA arraysCUDA arrays– Texturing from linear memoryTexturing from linear memorytemplate<class Type>template<class Type>Type tex1Dfetch(Type tex1Dfetch(texture<Type, 1, cudaReadModeElementType> texRef, texture<Type, 1, cudaReadModeElementType> texRef, int x);int x);float tex1Dfetch(float tex1Dfetch(texture<Type, 1, cudaReadModeNormalizedFloat> texture<Type, 1, cudaReadModeNormalizedFloat> texRef, int x);texRef, int x);
Device Runtime Device Runtime ComponentComponent Texture functions: fetching textures, Texture functions: fetching textures,
or or texturingtexturing– Texturing from linear memoryTexturing from linear memory– TypeType can be any of the supported 1-, can be any of the supported 1-,
2- or 4- vector types2- or 4- vector typestemplate<class Type>template<class Type>Type tex1Dfetch(Type tex1Dfetch(texture<Type, 1, cudaReadModeElementType> texRef, texture<Type, 1, cudaReadModeElementType> texRef, int x);int x);float4 tex1Dfetch(float4 tex1Dfetch(texture<uchar4, 1, cudaReadModeNormalizedFloat> texture<uchar4, 1, cudaReadModeNormalizedFloat> texRef, int x);texRef, int x);
Device Runtime Device Runtime ComponentComponent Texture functions: fetching textures, Texture functions: fetching textures,
or or texturingtexturing– Texturing from linear memoryTexturing from linear memory– No addressing modes supportedNo addressing modes supported– No texture filtering supportedNo texture filtering supported
Device Runtime Device Runtime ComponentComponent Texture functions: fetching textures, Texture functions: fetching textures,
or or texturingtexturing– Texturing from CUDA arraysTexturing from CUDA arraystemplate<class Type, enum cudaTextureReadMode template<class Type, enum cudaTextureReadMode readMode>readMode>Type tex1D(texture<Type, 1, readMode> texRef, Type tex1D(texture<Type, 1, readMode> texRef, float x);float x);template<class Type, enum cudaTextureReadMode template<class Type, enum cudaTextureReadMode readMode>readMode>Type tex2D(texture<Type, 2, readMode> texRef, Type tex2D(texture<Type, 2, readMode> texRef, float x, float y);float x, float y);template<class Type, enum cudaTextureReadMode template<class Type, enum cudaTextureReadMode readMode>readMode>Type tex3D(texture<Type, 3, readMode> texRef, Type tex3D(texture<Type, 3, readMode> texRef, float x, float y, float z);float x, float y, float z);
Device Runtime Device Runtime ComponentComponent Texture functions: fetching textures, Texture functions: fetching textures,
or or texturingtexturing– Texturing from CUDA arraysTexturing from CUDA arrays– Run-time attributes determineRun-time attributes determine
Coordinate normalizationCoordinate normalization Addressing mode (clamp/wrap)Addressing mode (clamp/wrap) FilteringFiltering
CUDA Runtime CUDA Runtime ComponentComponent Common ComponentCommon Component Device ComponentDevice Component Host ComponentHost Component
Host Runtime Host Runtime ComponentComponent Can only be used by host functionsCan only be used by host functions Composed of 2 APIsComposed of 2 APIs
– High-level High-level CUDA runtime APICUDA runtime API, , which runs on top ofwhich runs on top of
– Low-level Low-level CUDA driver APICUDA driver API No mixing: an application should use No mixing: an application should use
either one or the other. either one or the other.
Each API provides functions forEach API provides functions for– Device managementDevice management– Context managementContext management– Memory managementMemory management– Code module managementCode module management– Execution controlExecution control– Texture reference managementTexture reference management– OpenGL/Direct3D interoperabilityOpenGL/Direct3D interoperability
Host Runtime Host Runtime ComponentComponent
The CUDA runtime API implicitly The CUDA runtime API implicitly providesprovides– InitializationInitialization– Context managementContext management– Module managementModule management
CUDA driver API does not, and is CUDA driver API does not, and is harder to program.harder to program.
Host Runtime Host Runtime ComponentComponent
Recall: nvcc parses an input source fileRecall: nvcc parses an input source file– Separates device and host codeSeparates device and host code– Device code compiled to Device code compiled to cubincubin
objectobject– Generated host code in C compiled Generated host code in C compiled
by external toolby external tool
Host Runtime Host Runtime ComponentComponent
Generated host codeGenerated host code– Is in C formatIs in C format– Includes the Includes the cubincubin object object
Applications mayApplications may– Ignore host code and run Ignore host code and run cubincubin
object directly using the object directly using the low-level low-level CUDA driver APICUDA driver API
– Link to generated host code and Link to generated host code and launch it using the high-level CUDA launch it using the high-level CUDA runtime APIruntime API
Host Runtime Host Runtime ComponentComponent
The CUDA driver APIThe CUDA driver API– Is harder to programIs harder to program– Offers greater controlOffers greater control– Does not depend on CDoes not depend on C– Does not offer device emulationDoes not offer device emulation
Host Runtime Host Runtime ComponentComponent
CUDA runtime functions and other CUDA runtime functions and other entry points are prefixed by entry points are prefixed by cudacuda
CUDA driver functions and other entry CUDA driver functions and other entry points are prefixed by points are prefixed by cucu
Host Runtime Host Runtime ComponentComponent
Device memory is always allocated as Device memory is always allocated as either ofeither of– Linear memoryLinear memory– CUDA arraysCUDA arrays
Host Runtime Host Runtime Component - detourComponent - detour
Linear memory in deviceLinear memory in device– Contiguous segment of memoryContiguous segment of memory– 32-bit addresses32-bit addresses– Can be referenced using pointersCan be referenced using pointers
Host Runtime Host Runtime Component - detourComponent - detour
CUDA arraysCUDA arrays– ““opaque” memory layoutopaque” memory layout– 1D/2D/3D arrays of 1/2/4 vectors of 1D/2D/3D arrays of 1/2/4 vectors of
8/16/32 bit integers or 16/32 bit 8/16/32 bit integers or 16/32 bit floatsfloats16 bit floats from driver API only16 bit floats from driver API only
– Optimized for texture fetchingOptimized for texture fetching– Accessible from kernels through Accessible from kernels through
texture fetches onlytexture fetches only
Host Runtime Host Runtime Component - detourComponent - detour
Both the CUDA runtime and CUDA Both the CUDA runtime and CUDA driver APIsdriver APIs– Can access device informationCan access device information– Enable the host to read/write to Enable the host to read/write to
linear memory/CUDA arrayslinear memory/CUDA arrays With support for pinned memoryWith support for pinned memory
Host Runtime Host Runtime ComponentComponent
Both the CUDA runtime and CUDA Both the CUDA runtime and CUDA driver APIsdriver APIs– Can access device informationCan access device information– Enable the host to read/write to Enable the host to read/write to
linear memory/CUDA arrayslinear memory/CUDA arrays With support for pinned memoryWith support for pinned memory
– Provide OpenGL/Direct3D Provide OpenGL/Direct3D interoperabilityinteroperability
– Provide management for Provide management for asynchronous executionasynchronous execution
Host Runtime Host Runtime ComponentComponent
Asynchronous functionsAsynchronous functions– Kernel launches, and some othersKernel launches, and some others– AsyncAsync memory copies memory copies– Device <-> device memory copiesDevice <-> device memory copies– Memory settingMemory setting
Concurrent execution of functions is Concurrent execution of functions is managed through managed through streamsstreams
Host Runtime Host Runtime ComponentComponent
StreamsStreams– A queue of operationsA queue of operations– An application may have multiple An application may have multiple
stream objectsstream objects simultaneously simultaneously– kernel<<<Ng,Nb,Ns,kernel<<<Ng,Nb,Ns,SS>>>>>>– A kernel can be scheduled to A kernel can be scheduled to
execute on a streamexecute on a stream– Some memory copy functions can Some memory copy functions can
also be queued on a streamalso be queued on a stream
Host Runtime Host Runtime ComponentComponent
StreamsStreams– If no stream is specified, stream 0 is If no stream is specified, stream 0 is
used by default.used by default.– Operations in a stream are executed Operations in a stream are executed
synchronouslysynchronouslyPrevious stream operations have Previous stream operations have to end before a new one beginsto end before a new one begins
Host Runtime Host Runtime ComponentComponent
CUDA runtime and driver APIs provide CUDA runtime and driver APIs provide execution control through stream execution control through stream managementmanagement– <cu/cuda>StreamQuery()<cu/cuda>StreamQuery()
Is stream free?Is stream free?– <cu/cuda>StreamSynchronize()<cu/cuda>StreamSynchronize()
Wait for stream operations to endWait for stream operations to end
Host Runtime Host Runtime ComponentComponent
CUDA runtime and driver APIs provide CUDA runtime and driver APIs provide execution control through stream execution control through stream managementmanagement– cudaThreadSynchronize() / cudaThreadSynchronize() / cuCtxSynchronize()cuCtxSynchronize() Wait for all streams to be freeWait for all streams to be free
– <cu/cuda>StreamDestroy()<cu/cuda>StreamDestroy() Wait for stream to get freeWait for stream to get free Destroy streamDestroy stream
Host Runtime Host Runtime ComponentComponent
Accurate timing using Accurate timing using eventsevents– CUEvent/cudaEvent_t start,stop;CUEvent/cudaEvent_t start,stop;<cu/cuda>EventCreate (&start);<cu/cuda>EventCreate (&start);<cu/cuda>EventCreate (&stop);<cu/cuda>EventCreate (&stop);
– Events have to be recordedEvents have to be recorded<cu/cuda>EventRecord (start, 0); // asynchronous<cu/cuda>EventRecord (start, 0); // asynchronous// stuff to time// stuff to time<cu/cuda>EventRecord (stop, 0); // asynchronous<cu/cuda>EventRecord (stop, 0); // asynchronous
– Stream 0: record all operations from Stream 0: record all operations from all streamsall streams
– Stream N: record operations in Stream N: record operations in stream Nstream N
Host Runtime Host Runtime ComponentComponent
Accurate timing using Accurate timing using eventsevents– <cu/cuda>EventRecord (start, 0); // asynchronous<cu/cuda>EventRecord (start, 0); // asynchronous// stuff to time// stuff to time<cu/cuda>EventRecord (stop, 0); // asynchronous<cu/cuda>EventRecord (stop, 0); // asynchronous<cu/cuda>EventSynchronize (stop);<cu/cuda>EventSynchronize (stop);float time;float time;<cu/cuda>EventElapsedTime (&time, start, stop);<cu/cuda>EventElapsedTime (&time, start, stop);
– As call to record is asynchronous, As call to record is asynchronous, the event has to be synchronized the event has to be synchronized before timingbefore timing
– <cu/cuda>EventDestroy (start);<cu/cuda>EventDestroy (start);<cu/cuda>EventDestroy (stop);<cu/cuda>EventDestroy (stop);
Host Runtime Host Runtime ComponentComponent
Asynchronous execution can get Asynchronous execution can get confusingconfusing– Can be switched offCan be switched off– Useful for degbuggingUseful for degbugging– Set Set CUDA_LAUNCH_BLOCKINGCUDA_LAUNCH_BLOCKING to 1 to 1
Host Runtime Host Runtime ComponentComponent
Device InitializationDevice Initialization– CUDA Runtime APICUDA Runtime API
Automatically with first function Automatically with first function callcall
– Cuda Driver APICuda Driver APIcuInit()cuInit()MUST be called before calling any MUST be called before calling any other API functionother API function
Host Runtime Host Runtime ComponentComponent
Device ManagementDevice Management– cudaDeviceProp / CUDevice device;cudaDeviceProp / CUDevice device;
– int devCount;int devCount;cudaGetDeviceCount (&devCount) / cuDeviceGetCount cudaGetDeviceCount (&devCount) / cuDeviceGetCount (&devCount)(&devCount)
– for dev = 1 to devCount dofor dev = 1 to devCount docudaGetDeviceProperties / cuDeviceGetcudaGetDeviceProperties / cuDeviceGet(&device, dev)(&device, dev)
Host Runtime Host Runtime ComponentComponent
Device ManagementDevice Management– cudaSetDevice()cudaSetDevice()
Sets the device to be usedSets the device to be usedMUST be set before calling any MUST be set before calling any __global____global__ function function
Device 0 used by defaultDevice 0 used by default
Host Runtime Host Runtime ComponentComponent
Stream ManagementStream Management– CUStream / cudaStream_t st;CUStream / cudaStream_t st;– cudaStreamCreate (&st); / cudaStreamCreate (&st); / cuStreamCreate (&st, 0);cuStreamCreate (&st, 0);
– cudaStreamDestroy (&st);cudaStreamDestroy (&st);
Host Runtime Host Runtime ComponentComponent
Accurate timing using Accurate timing using eventsevents– <cu/cuda>EventRecord (start, 0); // asynchronous<cu/cuda>EventRecord (start, 0); // asynchronous// stuff to time// stuff to time<cu/cuda>EventRecord (stop, 0); // asynchronous<cu/cuda>EventRecord (stop, 0); // asynchronous<cu/cuda>EventSynchronize (stop);<cu/cuda>EventSynchronize (stop);float time;float time;<cu/cuda>EventElapsedTime (&time, start, stop);<cu/cuda>EventElapsedTime (&time, start, stop);
– As call to record is asynchronous, As call to record is asynchronous, the event has to be synchronized the event has to be synchronized before timingbefore timing
– <cu/cuda>EventDestroy (start);<cu/cuda>EventDestroy (start);<cu/cuda>EventDestroy (stop);<cu/cuda>EventDestroy (stop);
Host Runtime Host Runtime ComponentComponent
Event managementEvent management– CUEvent/cudaEvent_t start,stop;CUEvent/cudaEvent_t start,stop;<cu/cuda>EventCreate (&start);<cu/cuda>EventCreate (&start);<cu/cuda>EventCreate (&stop);<cu/cuda>EventCreate (&stop);<cu/cuda>EventRecord (start, 0); // asynchronous<cu/cuda>EventRecord (start, 0); // asynchronous// stuff to time// stuff to time<cu/cuda>EventRecord (stop, 0); // asynchronous<cu/cuda>EventRecord (stop, 0); // asynchronous<cu/cuda>EventSynchronize (stop);<cu/cuda>EventSynchronize (stop);float time;float time;<cu/cuda>EventElapsedTime (&time, start, stop);<cu/cuda>EventElapsedTime (&time, start, stop);<cu/cuda>EventDestroy (start);<cu/cuda>EventDestroy (start);<cu/cuda>EventDestroy (stop);<cu/cuda>EventDestroy (stop);
Host Runtime Host Runtime ComponentComponent
All for todayAll for today
Next timeNext time– More on the host runtime APIsMore on the host runtime APIs
Memory, stream, event, texture Memory, stream, event, texture managementmanagement
Debug mode for runtime APIDebug mode for runtime APIContext, module, execution Context, module, execution control for driver APIcontrol for driver API
– Performance & OptimizationPerformance & Optimization
See you next week!See you next week!