Overview: Graphics Processing Units ● advent of GPUs ● GPU architecture ■ the NVIDIA Fermi processor ● the CUDA programming model ■ simple example, threads organization, memory model ■ case study: matrix multiply ■ memories, thread synchronization, scheduling ■ case study: reductions ■ performance considerations: bandwidth, scheduling, resource conflicts, instruction mix ◆ host-device data transfer: multiple GPUs, NVLink, Unified Memory, APUs ● the OpenCL programming model ● directive-based programming models ● refs: CUDA Toolkit Documentation, An Even Easier Introduction to CUDA (tutorial); NCI NF GPU page, Programming Massively Parallel Processors, Kirk & Hwu, Morgan-Kaufman, 2010; Cuda By Example, by Sanders and Kandrot; OpenCL web page, OpenCL in Action, by Matthew Scarpino COMP4300/8300 L21,22: Graphics Processing Units 2017 JJ J • I II × 1
24
Embed
Overview: Graphics Processing Units · Overview: Graphics Processing Units l advent of GPUs l GPU architecture n the NVIDIA Fermi processor l the CUDA programming model n simple example,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Overview: Graphics Processing Units
l advent of GPUs
l GPU architecture
n the NVIDIA Fermi processor
l the CUDA programming model
n simple example, threads organization, memory modeln case study: matrix multiplyn memories, thread synchronization, schedulingn case study: reductionsn performance considerations: bandwidth, scheduling, resource conflicts,
CUDA Program: Simple Examplel reverse an array (reversearray.cu)
global void reverseArray(int ∗ a d , int N) {int idx = threadIdx.x;int v = a[N−idx−1]; a[N−idx−1] = a[idx]; a[idx] = v;
}#define N (1<<16)int main() { //may not dereference a d !
int a[N], ∗ a d , a size = N ∗ sizeof(int);...cudaMalloc ((void ∗∗) &a d , a size );cudaMemcpy(a d , a, a size , cudaMemcpyHostToDevice );reverseArray <<<1, N/2>>> (a d , N);cudaThreadSynchronize (); // wait till threads finishcudaMemcpy(a, a d , a size , cudaMemcpyDeviceToHost );cudaFree( a d ); ...
}
l cf. OpenMP on a normal multicore: style; practicality?#pragma omp parallel num threads (N/2) default(shared){ int idx = omp get threads num ();
int v = a[N−idx−1]; a[N−idx−1] = a[idx]; a[idx] = v;}
COMP4300/8300 L21,22: Graphics Processing Units 2017 JJ J • I II × 8
CUDA Thread Organization and Memory Model
l a 2×1 grid with 2×1 blocks
(courtesy Real World Tech.)
l memory model (left) reflects
that of the GPU
l 2×2 grid with 4×2×2 blocks
(courtesy NCSC)
COMP4300/8300 L21,22: Graphics Processing Units 2017 JJ J • I II × 9
one element of C, Ci, jl invocation with W ×W thread blocks
(assume W |N)
n why better than using a N×Nthread block?
(2 reasons, both important!)
l for thread (tx, ty) of block (bx,by),
i = byW + ty and j = bxW + tx
(courtesy xfig)COMP4300/8300 L21,22: Graphics Processing Units 2017 JJ J • I II × 10
CUDA Matrix Multiply: Implementationl kernel:
global void matMult(int N, int K, double ∗ A d ,double ∗ B d , double ∗ C d ) {
int i = blockIdx.y ∗ blockDim.y + threadIdx.y;int j = blockIdx.x ∗ blockDim.x + threadIdx.x;double cij = C d [i + j ∗N];for (int k=0; k < K; k++)
cij += A d [i + k ∗N] ∗ B d [k + j ∗K];C d [i + j ∗N] = cij;
}
l main program: needs to allocate device versions of A, B & C (A d, B d, and C d) andcudaMemcpy() host versions into them
l invocation with W ×W thread blocks (assume W |N)dim3 dimG(N/W, N/W);dim3 dimB(W, W); // in kernel blockDim.x == WmatMult <<<dimG , dimB >>> (N, K, A d , B d , C d );
l what if N % W > 0? Add to kernel if (i < N && j < N) and declaredim3 dimG((N+W−1)/W, (N+W−1)/W);
n note: SIMD nature of SPs⇒ cycles for both branches of if are consumed
COMP4300/8300 L21,22: Graphics Processing Units 2017 JJ J • I II × 11
CUDA Memories and Thread Synchronization
l GPUs can potentially suffer more still from the memory wall
n DRAM access still may be 100’s of cycles
n bandwidth is limited for load/store intensive kernels
l the shared memory is on-chip (hence very fast)
n the shared type modifier may be used to denote a (fixed) array allocated
to shared memory
l threads within a block can synchronized via the syncthreads() intrinsic
(efficient – why?)
n (SM-level) atomic instructions can enforce data consistency within a block
l note: no way to synchronize between blocks, or safely ensure data consistency
across blocks
n can only be done across separate kernel invocations
COMP4300/8300 L21,22: Graphics Processing Units 2017 JJ J • I II × 12
l threads (tx,0) . . . (tx,W −1) all access Bk,bxW+tx; ((0, ty) . . . (W −1, ty) access AbyW+ty,k)
n high ratio of load to FP instructionsn harder to hide L1 cache latencies; strains memory bandwidth
l can improve kernel by utilizing SM shared memory:shared double A s [W][W], B s [W][W];global void matMult s (int N, int K, double ∗ A d ,
double ∗ B d , double ∗ C d ) {int ty = threadIdx.y, tx = threadIdx.x;int i = blockIdx.y ∗W + ty, j = blockIdx.x ∗W + tx;double cij = C d [i + j ∗N];for (int k=0; k < K; k+=W) {
A s [ty][tx] = A d [i + (k+tx )∗N];B s [ty][tx] = B d [(k+ty) + j ∗K];syncthreads ();
for (int w=0; w < W; w++)cij += A s [ty][w] ∗ B s [w][tx];syncthreads (); // can this be avoided?
}C d [i + j ∗N] = cij;
}
COMP4300/8300 L21,22: Graphics Processing Units 2017 JJ J • I II × 13
GPU Scheduling - Warps
l the GigaThread scheduler assigns the (independently
executable) thread blocks to each SMl each block is divided into
groups (of 32) called
warps
n grouping occurs in
linear order by
tx + bxty + bxbytz(e.g. warp size 4)
l the warp scheduler determines which blocks are ready
to run
n with 32-thread warps, suitable block sizes range
from 4×8 to 16×16
n SIMT: each SP executes next instr’n SIMD-style
(note: requires only a single instruction fetch!)
l thus, a kernel with enough blocks can scale across a
GPU with any number of cores (courtesy NVIDIA - both)
COMP4300/8300 L21,22: Graphics Processing Units 2017 JJ J • I II × 14
Reductions and Thread Divergence
l threads within a single 1D blocksumming A[0..N−1]:
l goal: keep the SP’s FPUs fully occupied doing useful operations
n every other kind of instruction (loads, address calculations, branches) hinders
this!
l matrix multiply revisited:
n strategy 1: ‘unroll’ k loops:for (int k=0; k < K; k+=2)
cij += A d [i+k ∗N]∗ B d [k+j ∗K] +A d [i+(k+1)∗N]∗ B d [k+1+j ∗K];
halves loop index increments & branches
n strategy 2: each thread computes a 2×2 ‘tile’ of C instead of a single element
reduces load instructions; reduces branches by 4 – but may require 4× the
registers!
also increases thread granularity: may help if K is not large
COMP4300/8300 L21,22: Graphics Processing Units 2017 JJ J • I II × 19
Host-Device Issues: Multiple GPUs, NVLink, and Unified Memory
l transfer of data to/from host to device is error-prone, potentially a performancebottleneck (what if the array for an advection solver could not fit in GPU memory?)
l the problem is exacerbated when multiple GPUs are connected to one hostn we can select the required device by cudaSetDevice():
cudaSetDevice (0);cudaMalloc(a d , n); cudaMemcpy(a d , a, n, ...);reverseArray<<<1,n/2>>>(a d , n);cudaThreadSynchronize (); cudaMemcpyPeer(a b , 0, b d , 1, n);cudaSetDevice (1);reverseArray<<<1,n/2>>>(b d ,n);
l fast interconnects such as NVLink will reduce the transfer costs (e.g. Sierra system)
l CUDA’s Unified Memory will improve programability issues (and in some cases,performance)
n cudaMallocManaged(a, n); allocates the array on host so that it can migrate,page-by-page, to/from GPU(s) transparently and on demand
l alternatively, have the device and CPU use the same memory, as on AMD’s APUfor Exascale Computing
COMP4300/8300 L21,22: Graphics Processing Units 2017 JJ J • I II × 20
The Open Compute Language for Devices and Regular Cores
l open standard – not proprietary like CUDA; based on C (no C++)
l design philosophy: treat GPUs and CPUs as peers,
data- and task- parallel compute model
l similar execution model to CUDA:
n NDRange (CUDA grid): operates on global data, units within cannot synch.
n WorkGroup (CUDA block): units within can use local data (CUDA
shared ), to synch.
n WorkItem (CUDA thread): indpt. unit of execution, also has private data
l example kernel:kernel void reverseArray( global int ∗ a d , int N) {int idx = getGlobalId (0);int v = a[N−idx−1]; a[N−idx−1] = a[idx]; a[idx] = v;
}
l recall that in CUDA, we could launch as reverseArray<<<1,N/2>>>(a d, N),
but in OpenCL. . .
COMP4300/8300 L21,22: Graphics Processing Units 2017 JJ J • I II × 21
OpenCL Kernel Launch
l must explicitly create device handle, compute context and work-queue, load andcompile the kernel, and finally enqueue it for executionclGetDeviceIDs (..., CL DEVICE TYPE GPU , 1, &device , ...);context = clCreateContext (0, 1, &device , ...);queue = clCreateCommandQueue(context , device , ...);
program = clCreateProgramWithSource(context , " r e v e r s e A r r a y . cl ", ...)clBuildProgram(program , 1, &device , ...);
reverseArr k = clCreateKernel(program , " r e v e r s e A r r a y ", ...);clSetKernelArg(reverseArray k , 0, sizeof( cl mem ) & a d );clSetKernelArg(reverseArray k , 0, sizeof(int) &N);cnDimension = 1; cnBlockSize = N/2;clEnqueueNDRangeKernel(queue , reverseArray k , 1, 0,
&cnDimension , &cnBlockSize , 0, 0, 0);
l note: CUDA host code is compiled into .cubin intermediate files which follow a similar sequence
l for usage on normal core (CL DEVICE TYPE CPU), a WorkItem corresponds to anitem in a work queue that a number of (kernel-level) threads get work from
n compiler may aggregate these to reduce overheads
COMP4300/8300 L21,22: Graphics Processing Units 2017 JJ J • I II × 22
Directive-Based Programming Models
l OpenACC enables us to specify which code is to run on a device, and how totransfer data to/from it#pragma acc parallel loop copyin(a,b) copy(c)
n the data directive may be used to specify data placement across kernels
n the code can be also compiled to run across multiple CPUs
l OpenMP 4.0 operates similarly. For the above example:#pragma omp target map(to:A[0:N ∗K],B[0:N ∗K]) map(tofrom:C[0:N ∗N])#pragma omp parallel for default(shared)
l studies on complex applications where all data must be kept on device indicate a
productivity grain and performance loss of ≈ 2× over CUDA (e.g. Zhe14)
COMP4300/8300 L21,22: Graphics Processing Units 2017 JJ J • I II × 23