Top Banner
CUDA Architecture Overview
30
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 1. CUDA Architecture Overview

2. PROGRAMMING ENVIRONMENT 3. CUDA APIs API allows the host to manage the devicesAllocate memory & transfer dataLaunch kernels CUDA C Runtime APIHigh level of abstraction - start here! CUDA C Driver APIMore control, more verbose (OpenCL: Similar to CUDA C Driver API) 4. CUDA C and OpenCL Entry point forEntry point for developers developerswho want low-level API who prefer high-level C Shared back-end compilerand optimization technology 5. Processing FlowPCI BusCopy input data from CPU memory to GPU memory 6. Processing Flow PCI Bus1. Copy input data from CPU memory to GPU memory2. Load GPU program and execute, caching data on chip for performance 7. Processing FlowPCI Bus1. Copy input data from CPU memory to GPUmemory2. Load GPU program and execute, caching data on chip for performance3. Copy results from GPU memory to CPUmemory 8. CUDA Parallel Computing ArchitectureParallel computing architectureand programming modelIncludes a CUDA C compiler,support for OpenCL andDirectComputeArchitected to natively supportmultiple computationalinterfaces (standard languagesand APIs) 9. C for CUDA : C with a few keywordsvoid saxpy_serial(int n, float a, float *x, float *y){for (int i = 0; i < n; ++i)y[i] = a*x[i] + y[i];} Standard C Code// Invoke serial SAXPY kernelsaxpy_serial(n, 2.0, x, y);__global__ void saxpy_parallel(int n, float a, float *x, float *y){int i = blockIdx.x*blockDim.x + threadIdx.x;if (i < n) y[i] = a*x[i] + y[i]; Parallel CCode}// Invoke parallel SAXPY kernel with 256 threads/blockint nblocks = (n + 255) / 256;saxpy_parallel(n, 2.0, x, y); 10. CUDA Parallel Computing Architecture CUDA defines: Programming model Memory model Execution model CUDA uses the GPU, but is for general-purpose computing Facilitate heterogeneous computing: CPU + GPU CUDA is scalable Scale to run on 100s of cores/1000s of parallel threads 11. Compiling CUDA C Applications (Runtime API)void serial_function( ) {... C CUDA Rest of C}void other_function(int ... ) { Key KernelsApplication...}NVCCvoid saxpy_serial(float ... ) {CPU Compiler(Open64) for (int i = 0; i < n; ++i)y[i] = a*x[i] + y[i]; Modifyinto} Parallel CUDA object CPU object void main( ) { CUDA codefilesfiles float x;Linker saxpy_serial(..); ... } CPU-GPU Executable 12. PROGRAMMING MODEL CUDA KernelsParallel portion of application: execute as a kernelEntire GPU executes kernel, many threadsCUDA threads:LightweightFast switching1000s execute simultaneouslyCPUHostExecutes functionsGPUDeviceExecutes kernels 13. CUDA Kernels: Parallel Threads A kernel is a function executed on the GPU Array of threads, in parallel float x = input[threadID]; All threads execute the samefloat y = func(x); output[threadID] = y; code, can take different paths Each thread has an ID Select input/output data Control decisions 14. CUDA Kernels: Subdivide into Blocks 15. CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks 16. CUDA Kernels: Subdivide into BlocksThreads are grouped into blocksBlocks are grouped into a grid 17. CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads 18. CUDA Kernels: Subdivide into Blocks GPU Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads 19. Communication Within a Block Threads may need to cooperate Memory accesses Share results Cooperate using shared memory Accessible by all threads within a block Restriction to within a block permits scalability Fast communication between N threads is not feasible when N large 20. Transparent Scalability G84 1 2 3 4 5 6 78 910 11 12 11 12 910 78 56 34 12 21. Transparent Scalability G80 1 2 3 4 5 6 7 8 9101112 9 101112 1 2 3 4 5 6 22. Transparent Scalability GT200 1 2 3 4 5 6 7891 110 12 2 10 11 12 Idl Idl Idl 1 3 4 5 6 7 8 9e e... e Idl e 23. CUDA Programming Model - Summary HostDevice A kernel executes as a grid of thread blocksKernel 1 0 123 A block is a batch of threads1D Communicate through shared memory 0,00,1 0,2 0,3Kernel 2 Each block has a block ID 1,01,1 1,21,32D Each thread has a thread ID 24. MEMORY MODELMemory hierarchy Thread: Registers 25. Memory hierarchy Thread: Registers Thread: Local memory 26. Memory hierarchy Thread:Registers Thread:Local memory Block of threads:Shared memory 27. Memory hierarchy Thread:Registers Thread:Local memory Block of threads:Shared memory 28. Memory hierarchy Thread:Registers Thread:Local memory Block of threads:Shared memory All blocks:Global memory 29. Memory hierarchy Thread:Registers Thread:Local memory Block of threads:Shared memory All blocks:Global memory 30. Additional Memories Host can also allocate textures and arrays of constants Textures and constants have dedicated caches