ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008
ME964High Performance Computing for Engineering Applications
CUDA Memory Model&
CUDA API
Sept. 16, 2008
Before we get started…
Last Time Traced back the evolution of the GPU
GPGPU and the CUDA step forward CUDA-related nomenclature Memory layout of typical NVIDIA GPU
Today The CUDA API Start discussing CUDA programming model A look at a matrix multiplication example
2
The CUDA Access Situation
You can install CUDA on your computer even if you don’t have a GPU card You can do 95% of your HW2 without needing a GPU
Cards to be installed this afternoon in 1235ME CAE doesn’t want to made Visual Studio 2005 available (they use Visual
Studio 2008) I’m looking into opening up my lab in case 1235ME doesn’t prove to be
an option
Linux accounts available at UIUC on GPU based supercomputer See Forum posting about details
3
After you unzip the emailed assignment file, you should get a collection of files like below:
HW2: A word on getting started with CUDA
Double click helloworld.sln to get started The directory Linux contains a makefile and required files to
get you going with this OS NOTE: readme.doc contains the text of the assignment 4
Execution Configuration: Grids and Blocks
A kernel is executed as a grid of blocks of threads
All threads share global memory space
A block [of threads] is a batch of threads that can cooperate with each other by:
Synchronizing their execution For hazard-free shared memory
accesses Efficiently sharing data through a low
latency shared memory
Threads from two different blocks cannot cooperate!!!
This has important software design implications
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Courtesy: NDVIA 5HK-UIUC
Block and Thread IDs
Threads and blocks have IDs So each thread can decide
what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D
Why this 2D and 3D layout? Simplifies memory
addressing when processingmultidimensional data Image processing Solving PDEs on
subdomains …
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Courtesy: NDVIA 6HK-UIUC
CUDA Device Memory Space Overview
Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
The host can R/W global, constant, and texture memory
7
IMPORTANT NOTE: Global, constant, and texture memory spaces are persistent across kernels called by the same host application.
HK-UIUC
Global, Constant, and Texture Memories(Long Latency Accesses by Host)
Global memory Main means of communicating
R/W Data between host and device
Contents visible to all threads
Texture and Constant Memories Constants initialized by host Contents visible to all threads
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
Courtesy: NDVIA8HK-UIUC
NOTE: We will not emphasize texture
memory in this class.
9
End: Memory Layout on the GPU
Begin: CUDA API
What is an API?
Application Programming Interface (API) A set of functions, procedures or classes that an operating system,
library, or service provides to support requests made by computer programs (from Wikipedia)
Example: OpenGL, a graphics library, has its own API that allows one to draw a line, rotate it, resize it, etc.
Cooked up analogy (for the mechanical engineer) Think of a car, you can say it has a certain Device Operating Interface
(DOI): A series of pedals, gauges, handwheel, etc. This would be its DOI
In this context, CUDA is the API that enables you to tap into the computational resources of the NVIDIA GPU This is what replaced the old GPGPU way of programming the hardware
10
Overview
CUDA programming model – basic concepts and data types
CUDA application programming interface - basic
Simple example to illustrate basic concepts and functionality
11HK-UIUC
Performance features will be covered later
Talking about the API:The CUDA Software Stack
Image at right indicates where the API fits in the picture
12
An API layer is indicated by a thick red line:
Dealing with the CUDA Driver API is tedious We’ll only discuss the CUDA Runtime API, which handles all the dirty
laundry for you (under the hood, it might deal with the CUDA Driver) Examples of CUDA Libraries: CUDA FFT and CUDA BLAS
CUDA Highlights: Easy and Lightweight
The entire CUDA API is an extension to the ANSI C programming language
Low learning curve
The hardware is designed to enable lightweight runtime and driver
High performance
13HK-UIUC
Here we go…
CUDA Device Memory Allocation
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
14
cudaMalloc() Allocates object in the
device Global MemoryGlobal Memory Requires two parameters
Address of a pointer to the allocated object
Size of allocated object
cudaFree() Frees object from device
Global Memory Pointer to freed object
HK-UIUC
A Small Detour: A Matrix Data Type
NOT part of CUDA It will be frequently used in many
code examples 2 D matrix Single precision float elements Width * height elements Pitch meaningful when the matrix
is actually a sub-matrix of another matrix
Matrix entries attached to the pointer-to-float member called “elements”
Matrix is stored row-wise
typedef struct { int width; int height; int pitch; float* elements;} Matrix;
15HK-UIUC
CUDA Device Memory Allocation (cont.)
Code example: Allocate a 64 * 64 single precision float array Attach the allocated storage to Md.elements “d” is often used to indicate a device data structure
BLOCK_SIZE = 64;Matrix Md;int size = BLOCK_SIZE * BLOCK_SIZE * sizeof(float);
cudaMalloc((void**)&Md.elements, size);cudaFree(Md.elements);
16HK-UIUC
All the details are spelled out in the CUDA Programming Guide 1.1(see the resources section of the class website)
VERY USEFUL, PLEASE READ…
CUDA Host-Device Data Transfer
cudaMemcpy() memory data transfer Requires four parameters
Pointer to source Pointer to destination Number of bytes copied Type of transfer
Host to Host Host to Device Device to Host Device to Device
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
17HK-UIUC
CUDA Host-Device Data Transfer (cont.)
Code example: Transfer a 64 * 64 single precision float array M is in host memory and Md is in device memory cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are
symbolic constants
cudaMemcpy(Md.elements, M.elements, size,
cudaMemcpyHostToDevice);
cudaMemcpy(M.elements, Md.elements, size,
cudaMemcpyDeviceToHost); 18HK-UIUC
CUDA Function Declarations
Executed on the:
Only callable from the:
__device__ float DeviceFunc() device device
__global__ void KernelFunc() device host
__host__ float HostFunc() host host
__global__ defines a kernel function Must return void
__device__ and __host__ can be used together
19HK-UIUC
CUDA Function Declarations (cont.)
__device__ functions can’t have their address taken
For functions executed on the device: No recursion No static variable declarations inside the function No variable number of arguments
Something like printf would not work…
20HK-UIUC
__global__ void KernelFunc(...); // declaration
dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory
KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...);
Calling a Kernel Function, and the Concept of Execution Configuration
A kernel function must be called with an execution configuration:
21
Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit sync needed for blocking
HK-UIUC