Top Banner
Programming with Programming with CUDA CUDA WS 08/09 WS 08/09 Lecture 7 Lecture 7 Thu, 13 Nov, 2008 Thu, 13 Nov, 2008
42

Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Dec 18, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Programming with Programming with CUDACUDAWS 08/09WS 08/09

Lecture 7Lecture 7Thu, 13 Nov, 2008Thu, 13 Nov, 2008

Page 2: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

PreviouslyPreviously

CUDA Runtime ComponentCUDA Runtime Component– Common ComponentCommon Component

Built-in vector typesBuilt-in vector types Math functionsMath functions TimingTiming TexturesTextures

– Texture fetchTexture fetch– Texture referenceTexture reference– Texture read modesTexture read modes– Normalized texture coordinatesNormalized texture coordinates– Linear texture filteringLinear texture filtering

Page 3: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

TodayToday

CUDA Runtime ComponentCUDA Runtime Component– Common ComponentCommon Component– Device ComponentDevice Component– Host ComponentHost Component

Page 4: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

CUDA Runtime CUDA Runtime ComponentComponent Common ComponentCommon Component Device ComponentDevice Component Host ComponentHost Component

Page 5: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Device Runtime Device Runtime ComponentComponent Can only be used in device codeCan only be used in device code Math functionsMath functions

– Faster, less accurate versions of Faster, less accurate versions of functions from common componentfunctions from common component

– __<common_function_name>__<common_function_name> log and __logflog and __logf

– Appendix B of Programming GuideAppendix B of Programming Guide– Use fast math by defaultUse fast math by default

Compiler option Compiler option -use_fast_math-use_fast_math

Page 6: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Device Runtime Device Runtime ComponentComponent Synch function: Synch function: __syncThreads()__syncThreads()

– Synchronize threads in a blockSynchronize threads in a block– Avoid read-after-write, write-after-Avoid read-after-write, write-after-

read, write-after-write hazards for read, write-after-write hazards for commonly accessed shared memorycommonly accessed shared memory

– Dangerous to use in conditionalsDangerous to use in conditionals Code hangs / unwanted effectsCode hangs / unwanted effects

Page 7: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Device Runtime Device Runtime ComponentComponent Atomic functionsAtomic functions

– Guaranteed to perform un-interferedGuaranteed to perform un-interfered Memory address is lockedMemory address is locked

– Supported by CUDA cards > 1.0Supported by CUDA cards > 1.0– Mostly operate on integers onlyMostly operate on integers only– Appendix C of programming guideAppendix C of programming guide

Page 8: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Device Runtime Device Runtime ComponentComponent Warp vote functionsWarp vote functions

– Supported by CUDA cards >= 1.2Supported by CUDA cards >= 1.2– Check a condition on all threads in a Check a condition on all threads in a

warpwarp int __all (int predicate)int __all (int predicate)true (non-zero) if true (non-zero) if predicatepredicate is is true for all warp threadstrue for all warp threads

int __any (int predicate)int __any (int predicate)true (non-zero) if true (non-zero) if predicatepredicate is is true for any warp threadtrue for any warp thread

Page 9: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Device Runtime Device Runtime ComponentComponent Texture functions: fetching textures, Texture functions: fetching textures,

or or texturingtexturing– Texture data may be stored in Texture data may be stored in

linear memorylinear memory or or CUDA arraysCUDA arrays– Texturing from linear memoryTexturing from linear memorytemplate<class Type>template<class Type>Type tex1Dfetch(Type tex1Dfetch(texture<Type, 1, cudaReadModeElementType> texRef, texture<Type, 1, cudaReadModeElementType> texRef, int x);int x);float tex1Dfetch(float tex1Dfetch(texture<Type, 1, cudaReadModeNormalizedFloat> texture<Type, 1, cudaReadModeNormalizedFloat> texRef, int x);texRef, int x);

Page 10: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Device Runtime Device Runtime ComponentComponent Texture functions: fetching textures, Texture functions: fetching textures,

or or texturingtexturing– Texturing from linear memoryTexturing from linear memory– TypeType can be any of the supported 1-, can be any of the supported 1-,

2- or 4- vector types2- or 4- vector typestemplate<class Type>template<class Type>Type tex1Dfetch(Type tex1Dfetch(texture<Type, 1, cudaReadModeElementType> texRef, texture<Type, 1, cudaReadModeElementType> texRef, int x);int x);float4 tex1Dfetch(float4 tex1Dfetch(texture<uchar4, 1, cudaReadModeNormalizedFloat> texture<uchar4, 1, cudaReadModeNormalizedFloat> texRef, int x);texRef, int x);

Page 11: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Device Runtime Device Runtime ComponentComponent Texture functions: fetching textures, Texture functions: fetching textures,

or or texturingtexturing– Texturing from linear memoryTexturing from linear memory– No addressing modes supportedNo addressing modes supported– No texture filtering supportedNo texture filtering supported

Page 12: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Device Runtime Device Runtime ComponentComponent Texture functions: fetching textures, Texture functions: fetching textures,

or or texturingtexturing– Texturing from CUDA arraysTexturing from CUDA arraystemplate<class Type, enum cudaTextureReadMode template<class Type, enum cudaTextureReadMode readMode>readMode>Type tex1D(texture<Type, 1, readMode> texRef, Type tex1D(texture<Type, 1, readMode> texRef, float x);float x);template<class Type, enum cudaTextureReadMode template<class Type, enum cudaTextureReadMode readMode>readMode>Type tex2D(texture<Type, 2, readMode> texRef, Type tex2D(texture<Type, 2, readMode> texRef, float x, float y);float x, float y);template<class Type, enum cudaTextureReadMode template<class Type, enum cudaTextureReadMode readMode>readMode>Type tex3D(texture<Type, 3, readMode> texRef, Type tex3D(texture<Type, 3, readMode> texRef, float x, float y, float z);float x, float y, float z);

Page 13: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Device Runtime Device Runtime ComponentComponent Texture functions: fetching textures, Texture functions: fetching textures,

or or texturingtexturing– Texturing from CUDA arraysTexturing from CUDA arrays– Run-time attributes determineRun-time attributes determine

Coordinate normalizationCoordinate normalization Addressing mode (clamp/wrap)Addressing mode (clamp/wrap) FilteringFiltering

Page 14: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

CUDA Runtime CUDA Runtime ComponentComponent Common ComponentCommon Component Device ComponentDevice Component Host ComponentHost Component

Page 15: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Host Runtime Host Runtime ComponentComponent Can only be used by host functionsCan only be used by host functions Composed of 2 APIsComposed of 2 APIs

– High-level High-level CUDA runtime APICUDA runtime API, , which runs on top ofwhich runs on top of

– Low-level Low-level CUDA driver APICUDA driver API No mixing: an application should use No mixing: an application should use

either one or the other. either one or the other.

Page 16: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Each API provides functions forEach API provides functions for– Device managementDevice management– Context managementContext management– Memory managementMemory management– Code module managementCode module management– Execution controlExecution control– Texture reference managementTexture reference management– OpenGL/Direct3D interoperabilityOpenGL/Direct3D interoperability

Host Runtime Host Runtime ComponentComponent

Page 17: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

The CUDA runtime API implicitly The CUDA runtime API implicitly providesprovides– InitializationInitialization– Context managementContext management– Module managementModule management

CUDA driver API does not, and is CUDA driver API does not, and is harder to program.harder to program.

Host Runtime Host Runtime ComponentComponent

Page 18: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Recall: nvcc parses an input source fileRecall: nvcc parses an input source file– Separates device and host codeSeparates device and host code– Device code compiled to Device code compiled to cubincubin

objectobject– Generated host code in C compiled Generated host code in C compiled

by external toolby external tool

Host Runtime Host Runtime ComponentComponent

Page 19: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Generated host codeGenerated host code– Is in C formatIs in C format– Includes the Includes the cubincubin object object

Applications mayApplications may– Ignore host code and run Ignore host code and run cubincubin

object directly using the object directly using the low-level low-level CUDA driver APICUDA driver API

– Link to generated host code and Link to generated host code and launch it using the high-level CUDA launch it using the high-level CUDA runtime APIruntime API

Host Runtime Host Runtime ComponentComponent

Page 20: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

The CUDA driver APIThe CUDA driver API– Is harder to programIs harder to program– Offers greater controlOffers greater control– Does not depend on CDoes not depend on C– Does not offer device emulationDoes not offer device emulation

Host Runtime Host Runtime ComponentComponent

Page 21: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

CUDA runtime functions and other CUDA runtime functions and other entry points are prefixed by entry points are prefixed by cudacuda

CUDA driver functions and other entry CUDA driver functions and other entry points are prefixed by points are prefixed by cucu

Host Runtime Host Runtime ComponentComponent

Page 22: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Device memory is always allocated as Device memory is always allocated as either ofeither of– Linear memoryLinear memory– CUDA arraysCUDA arrays

Host Runtime Host Runtime Component - detourComponent - detour

Page 23: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Linear memory in deviceLinear memory in device– Contiguous segment of memoryContiguous segment of memory– 32-bit addresses32-bit addresses– Can be referenced using pointersCan be referenced using pointers

Host Runtime Host Runtime Component - detourComponent - detour

Page 24: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

CUDA arraysCUDA arrays– ““opaque” memory layoutopaque” memory layout– 1D/2D/3D arrays of 1/2/4 vectors of 1D/2D/3D arrays of 1/2/4 vectors of

8/16/32 bit integers or 16/32 bit 8/16/32 bit integers or 16/32 bit floatsfloats16 bit floats from driver API only16 bit floats from driver API only

– Optimized for texture fetchingOptimized for texture fetching– Accessible from kernels through Accessible from kernels through

texture fetches onlytexture fetches only

Host Runtime Host Runtime Component - detourComponent - detour

Page 25: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Both the CUDA runtime and CUDA Both the CUDA runtime and CUDA driver APIsdriver APIs– Can access device informationCan access device information– Enable the host to read/write to Enable the host to read/write to

linear memory/CUDA arrayslinear memory/CUDA arrays With support for pinned memoryWith support for pinned memory

Host Runtime Host Runtime ComponentComponent

Page 26: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Both the CUDA runtime and CUDA Both the CUDA runtime and CUDA driver APIsdriver APIs– Can access device informationCan access device information– Enable the host to read/write to Enable the host to read/write to

linear memory/CUDA arrayslinear memory/CUDA arrays With support for pinned memoryWith support for pinned memory

– Provide OpenGL/Direct3D Provide OpenGL/Direct3D interoperabilityinteroperability

– Provide management for Provide management for asynchronous executionasynchronous execution

Host Runtime Host Runtime ComponentComponent

Page 27: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Asynchronous functionsAsynchronous functions– Kernel launches, and some othersKernel launches, and some others– AsyncAsync memory copies memory copies– Device <-> device memory copiesDevice <-> device memory copies– Memory settingMemory setting

Concurrent execution of functions is Concurrent execution of functions is managed through managed through streamsstreams

Host Runtime Host Runtime ComponentComponent

Page 28: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

StreamsStreams– A queue of operationsA queue of operations– An application may have multiple An application may have multiple

stream objectsstream objects simultaneously simultaneously– kernel<<<Ng,Nb,Ns,kernel<<<Ng,Nb,Ns,SS>>>>>>– A kernel can be scheduled to A kernel can be scheduled to

execute on a streamexecute on a stream– Some memory copy functions can Some memory copy functions can

also be queued on a streamalso be queued on a stream

Host Runtime Host Runtime ComponentComponent

Page 29: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

StreamsStreams– If no stream is specified, stream 0 is If no stream is specified, stream 0 is

used by default.used by default.– Operations in a stream are executed Operations in a stream are executed

synchronouslysynchronouslyPrevious stream operations have Previous stream operations have to end before a new one beginsto end before a new one begins

Host Runtime Host Runtime ComponentComponent

Page 30: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

CUDA runtime and driver APIs provide CUDA runtime and driver APIs provide execution control through stream execution control through stream managementmanagement– <cu/cuda>StreamQuery()<cu/cuda>StreamQuery()

Is stream free?Is stream free?– <cu/cuda>StreamSynchronize()<cu/cuda>StreamSynchronize()

Wait for stream operations to endWait for stream operations to end

Host Runtime Host Runtime ComponentComponent

Page 31: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

CUDA runtime and driver APIs provide CUDA runtime and driver APIs provide execution control through stream execution control through stream managementmanagement– cudaThreadSynchronize() / cudaThreadSynchronize() / cuCtxSynchronize()cuCtxSynchronize() Wait for all streams to be freeWait for all streams to be free

– <cu/cuda>StreamDestroy()<cu/cuda>StreamDestroy() Wait for stream to get freeWait for stream to get free Destroy streamDestroy stream

Host Runtime Host Runtime ComponentComponent

Page 32: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Accurate timing using Accurate timing using eventsevents– CUEvent/cudaEvent_t start,stop;CUEvent/cudaEvent_t start,stop;<cu/cuda>EventCreate (&start);<cu/cuda>EventCreate (&start);<cu/cuda>EventCreate (&stop);<cu/cuda>EventCreate (&stop);

– Events have to be recordedEvents have to be recorded<cu/cuda>EventRecord (start, 0); // asynchronous<cu/cuda>EventRecord (start, 0); // asynchronous// stuff to time// stuff to time<cu/cuda>EventRecord (stop, 0); // asynchronous<cu/cuda>EventRecord (stop, 0); // asynchronous

– Stream 0: record all operations from Stream 0: record all operations from all streamsall streams

– Stream N: record operations in Stream N: record operations in stream Nstream N

Host Runtime Host Runtime ComponentComponent

Page 33: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Accurate timing using Accurate timing using eventsevents– <cu/cuda>EventRecord (start, 0); // asynchronous<cu/cuda>EventRecord (start, 0); // asynchronous// stuff to time// stuff to time<cu/cuda>EventRecord (stop, 0); // asynchronous<cu/cuda>EventRecord (stop, 0); // asynchronous<cu/cuda>EventSynchronize (stop);<cu/cuda>EventSynchronize (stop);float time;float time;<cu/cuda>EventElapsedTime (&time, start, stop);<cu/cuda>EventElapsedTime (&time, start, stop);

– As call to record is asynchronous, As call to record is asynchronous, the event has to be synchronized the event has to be synchronized before timingbefore timing

– <cu/cuda>EventDestroy (start);<cu/cuda>EventDestroy (start);<cu/cuda>EventDestroy (stop);<cu/cuda>EventDestroy (stop);

Host Runtime Host Runtime ComponentComponent

Page 34: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Asynchronous execution can get Asynchronous execution can get confusingconfusing– Can be switched offCan be switched off– Useful for degbuggingUseful for degbugging– Set Set CUDA_LAUNCH_BLOCKINGCUDA_LAUNCH_BLOCKING to 1 to 1

Host Runtime Host Runtime ComponentComponent

Page 35: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Device InitializationDevice Initialization– CUDA Runtime APICUDA Runtime API

Automatically with first function Automatically with first function callcall

– Cuda Driver APICuda Driver APIcuInit()cuInit()MUST be called before calling any MUST be called before calling any other API functionother API function

Host Runtime Host Runtime ComponentComponent

Page 36: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Device ManagementDevice Management– cudaDeviceProp / CUDevice device;cudaDeviceProp / CUDevice device;

– int devCount;int devCount;cudaGetDeviceCount (&devCount) / cuDeviceGetCount cudaGetDeviceCount (&devCount) / cuDeviceGetCount (&devCount)(&devCount)

– for dev = 1 to devCount dofor dev = 1 to devCount docudaGetDeviceProperties / cuDeviceGetcudaGetDeviceProperties / cuDeviceGet(&device, dev)(&device, dev)

Host Runtime Host Runtime ComponentComponent

Page 37: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Device ManagementDevice Management– cudaSetDevice()cudaSetDevice()

Sets the device to be usedSets the device to be usedMUST be set before calling any MUST be set before calling any __global____global__ function function

Device 0 used by defaultDevice 0 used by default

Host Runtime Host Runtime ComponentComponent

Page 38: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Stream ManagementStream Management– CUStream / cudaStream_t st;CUStream / cudaStream_t st;– cudaStreamCreate (&st); / cudaStreamCreate (&st); / cuStreamCreate (&st, 0);cuStreamCreate (&st, 0);

– cudaStreamDestroy (&st);cudaStreamDestroy (&st);

Host Runtime Host Runtime ComponentComponent

Page 39: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Accurate timing using Accurate timing using eventsevents– <cu/cuda>EventRecord (start, 0); // asynchronous<cu/cuda>EventRecord (start, 0); // asynchronous// stuff to time// stuff to time<cu/cuda>EventRecord (stop, 0); // asynchronous<cu/cuda>EventRecord (stop, 0); // asynchronous<cu/cuda>EventSynchronize (stop);<cu/cuda>EventSynchronize (stop);float time;float time;<cu/cuda>EventElapsedTime (&time, start, stop);<cu/cuda>EventElapsedTime (&time, start, stop);

– As call to record is asynchronous, As call to record is asynchronous, the event has to be synchronized the event has to be synchronized before timingbefore timing

– <cu/cuda>EventDestroy (start);<cu/cuda>EventDestroy (start);<cu/cuda>EventDestroy (stop);<cu/cuda>EventDestroy (stop);

Host Runtime Host Runtime ComponentComponent

Page 40: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Event managementEvent management– CUEvent/cudaEvent_t start,stop;CUEvent/cudaEvent_t start,stop;<cu/cuda>EventCreate (&start);<cu/cuda>EventCreate (&start);<cu/cuda>EventCreate (&stop);<cu/cuda>EventCreate (&stop);<cu/cuda>EventRecord (start, 0); // asynchronous<cu/cuda>EventRecord (start, 0); // asynchronous// stuff to time// stuff to time<cu/cuda>EventRecord (stop, 0); // asynchronous<cu/cuda>EventRecord (stop, 0); // asynchronous<cu/cuda>EventSynchronize (stop);<cu/cuda>EventSynchronize (stop);float time;float time;<cu/cuda>EventElapsedTime (&time, start, stop);<cu/cuda>EventElapsedTime (&time, start, stop);<cu/cuda>EventDestroy (start);<cu/cuda>EventDestroy (start);<cu/cuda>EventDestroy (stop);<cu/cuda>EventDestroy (stop);

Host Runtime Host Runtime ComponentComponent

Page 41: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

All for todayAll for today

Next timeNext time– More on the host runtime APIsMore on the host runtime APIs

Memory, stream, event, texture Memory, stream, event, texture managementmanagement

Debug mode for runtime APIDebug mode for runtime APIContext, module, execution Context, module, execution control for driver APIcontrol for driver API

– Performance & OptimizationPerformance & Optimization

Page 42: Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

See you next week!See you next week!