GPU Programming - Vrije Universiteit Brusselparallel.vub.ac.be/education/parsys/notes2010/... · Overview 1. CUDA-enabled GPU architecture 2. Programming for GPUs 3. How a CUDA program

GPU Programming with CUDAMessage-passing Parallel Processing

Message-passing Parallel Processing

GPU Programming

Jan Lemeire

Dept. ETRO

November 6th 2008

Parallel Systems Course: Chapter IV

Jan Lemeire


Overview

1. CUDA-enabled GPU architecture

2. Programming for GPUs

3. How a CUDA program runs

4. Optimizing CUDA programs

5. Analysis & Conclusions

Jan Lemeire


Overview






Jan Lemeire

GPU Programming with CUDA

Utilization of Graphics Card processor (GPU) for High-Performance Computing

Via Nvidia’s CUDA API:

http://www.nvidia.com/object/cuda_home.html

PC graphics market largely subsidizes the development of these GPGPUs (General-Purpose computation on GPUs)

Cards that support CUDA:

Link 1

8, 9, 200 series

http://www.nvidia.com/object/cuda_home.html

Jan Lemeire


Goal of chapter

Understand benefits & disadvantages of technology.

If you have to decide whether or not a new technology should be introduced

Understand consequences!

Why Are GPUs So Fast?

GPU specialized for math-intensive highly parallel computation So, more transistors can be devoted to data processing rather than data caching and flow control

Commodity industry: provides economies of scaleCompetitive industry: fuels innovation

DRAM

Cache

ALUControl

ALU

ALU

ALU

DRAM

CPU GPU

© NVIDIA Corporation 2007

Processors execute computing threadsThread Execution Manager issues threads128 Thread ProcessorsParallel Data Cache accelerates processing

G80 GPU Computing

Thread Execution Manager

Input Assembler

Host

Parallel Data

Cache

Global Memory

Load/store

Parallel Data

Cache

Thread Processors

Parallel Data

Cache

Parallel Data

Cache

Thread Processors

Parallel Data

Cache

Parallel Data

Cache

Thread Processors

Parallel Data

Cache

Parallel Data

Cache

Thread Processors

Parallel Data

Cache

Parallel Data

Cache

Thread Processors

Parallel Data

Cache

Parallel Data

Cache

Thread Processors

Parallel Data

Cache

Parallel Data

Cache

Thread Processors

Parallel Data

Cache

Parallel Data

Cache

Thread Processors


Same programScalable performance

Goal: Scaling the Architecture


Input Assembler

Host

Parallel Data

Cache

Global Memory

Load/store

Parallel Data

Cache

Thread Processors

Parallel Data

Cache

Parallel Data

Cache

Thread Processors

Parallel Data

Cache

Parallel Data

Cache

Thread Processors

Parallel Data

Cache

Parallel Data

Cache

Thread Processors


Input Assembler

Host

Global Memory

Load/store

Parallel Data

Cache

Parallel Data

Cache

Thread Processors

Parallel Data

Cache

Parallel Data

Cache

Thread Processors


Graphics Programming Model

Graphics Application

Vertex Program

Rasterization

Fragment Program

Display


What’s Wrong With GPGPU?

Application

Vertex Program

Rasterization

Pixel Program

Display

Input Registers

Pixel Program

Output Registers

Constants

Texture

Temp Registers


What’s Wrong With GPGPU?

Application

Vertex Program

Rasterization

Fragment Program

Display

Input Registers

Fragment Program

Output Registers

Constants

Texture

Temp Registers

APIs are specific to graphics

Limited texture size and dimension

Limited shader outputs

No scatter

Limited instruction set

No thread communication

Limited local storage

Building a Better Pixel Thread

Thread Program

Output Registers

Constants

Texture

Registers

Thread Number

Features

Millions of instructions

Full Integer and Bit instructions

No limits on branching, looping

1D, 2D, or 3D thread ID allocation

Global Memory

Thread Program

Global Memory

Thread Number

Constants

Texture

Registers

Features

Fully general load/store to GPU memory

Untyped, not fixed texture types

Pointer support

Parallel Data Cache

Thread Program

Global Memory

Thread Number

Constants

Texture

Registers

FeaturesDedicated on-chip memoryShared between threads for inter-thread communicationExplicitly managedAs fast as registers

Parallel Data Cache


Hardware Implementation:Memory Architecture

The local, global, constant, and texture spaces are regions of device memoryEach multiprocessor has:

A set of 32-bit registers per processorOn-chip shared memory

Where the shared memory space resides

A read-only constant cacheTo speed up access to the constant memory space

A read-only texture cacheTo speed up access to the texture memory space

Device

Multiprocessor N

Multiprocessor 2

Multiprocessor 1

Device memory

Shared Memory

InstructionUnit

Processor 1

Registers

…Processor 2

Registers

Processor M

Registers

ConstantCache

TextureCache


Example Fluid Algorithm

CPU GPGPU

GPU Computingwith CUDA

Multiple passes through video

memory

Parallel execution through cache

Single thread out of cache

Program/Control

Data/Computation

Control

ALU

Cache DRAM

P1P2P3P4

Pn’=P1+P2+P3+P4

ALU

VideoMemory

Control

ALU

Control

ALU

Control

ALU

P1,P2P3,P4

P1,P2P3,P4

P1,P2P3,P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

ParallelData

Cache

ThreadExecutionManager

ALU

Control

ALU

Control

ALU

Control

ALU

DRAM

P1P2P3P4P5

SharedData

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Jan Lemeire


Overview






CUDA: Programming GPU in C

Philosophy: provide minimal set of extensions necessary to expose power

Declaration specifiers to indicate where things live__global__ void KernelFunc(...); // kernel callable from host__device__ void DeviceFunc(...); // function callable on device__device__ int GlobalVar; // variable in device memory__shared__ int SharedVar; // shared in PDC by thread block

Extend function invocation syntax for parallel kernel launchKernelFunc<<<500, 128>>>(...); // launch 500 blocks w/ 128 threads each

Special variables for thread identification in kernelsdim3 threadIdx; dim3 blockIdx; dim3 blockDim; dim3 gridDim;

Intrinsics that expose specific operations in kernel code__syncthreads(); // barrier synchronization within kernel


CUDA: Runtime support

Explicit memory allocation returns pointers to GPU memorycudaMalloc(), cudaFree()

Explicit memory copy for host ↔ device, device ↔ devicecudaMemcpy(), cudaMemcpy2D(), ...

Texture managementcudaBindTexture(), cudaBindTextureToArray(), ...

OpenGL & DirectX interoperabilitycudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(), …


Example: Vector Addition Kernel

// Compute vector sum C = A+B// Each thread performs one pair-wise addition

__global__ void vecAdd(float* A, float* B, float* C){

int i = threadIdx.x + blockDim.x * blockIdx.x;C[i] = A[i] + B[i];

}


Example: Host code for memory

// allocate host (CPU) memoryfloat* h_A = (float*) malloc(N * sizeof(float));float* h_B = (float*) malloc(N * sizeof(float));… initalize h_A and h_B …

// allocate device (GPU) memoryfloat* d_A, d_B, d_C;cudaMalloc( (void**) &d_A, N * sizeof(float));cudaMalloc( (void**) &d_B, N * sizeof(float));cudaMalloc( (void**) &d_C, N * sizeof(float));

// copy host memory to devicecudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice));cudaMemcpy( d_B, h_B, N * sizeof(float),cudaMemcpyHostToDevice));

// execute the kernel on N/256 blocks of 256 threads eachvecAdd<<<N/256, 256>>>(d_A, d_B, d_C);


CUDA SDK

NVIDIA C Compiler

NVIDIA Assemblyfor Computing CPU Host Code

Integrated CPUand GPU C Source Code

Libraries:FFT, BLAS,…Example Source Code

CUDADriver

DebuggerProfiler Standard C Compiler

GPU CPU

Jan Lemeire

Example program


__global__ matrixMultiplicationInOneBlock(float *inputA, float *inputB, float *output, int size){

// allocate memory for maximal matrix size__shared__ float matrixA[512], matrixB[512];float result=0.;const int tx=threadIdx.x, ty=threadIdx.y;

int position=ty*img_W+tx;matrixA[position]=inputA[position];matrixB[position]=inputA[position];

__syncthreads();

for(int i=0;i< size;i++)result+=matrixA[ty*size+i]*matrixB[i*size+tx];

output[position]=result;}

Jan Lemeire


Overview






Jan Lemeire


Threads: grouped in blocks & warps

A block of threads is executed on the same multiprocessor, use the same shared memory (16KB) and can be synchronized.

A block is divided into warps which are run ‘together’.

One multiprocessor can run 4 thread blocks in parallel.

Warp size is 32: 32 threads are executed in a SIMD fashion on the 8 cores of the multiprocessor.

To keep deep pipelines full on the FPUs. It takes 4 cycles for a memory or arithmetic operation.

Use of a 32-bit ActiveMask: a bit for every running thread in a warp

CUDA Scalable Execution Model

A hierarchy of threadsThreads execute a kernel in blocks, blocks are organized in a grid

Threads within a block cooperateshare on-chip memory in PDCbarrier synchronization

Blocks within a grid are independentblocks run to completion in unspecified orderNo global sync, no per-block mutex

Guarantees scalable execution!

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)


How thread blocks are partitioned

Thread blocks are partitioned into warpsThread IDs within a warp are consecutive and increasing

Warp 0 starts with Thread ID 0

For a 2D block:ThreadID = threadIndex.x + blockWidth * threadIndex.y

Partitioning is always the sameThus you can use this knowledge in control flow

(Covered next)

However, DO NOT rely on any ordering between warps

If there are any dependencies between threads, you must __syncthreads() to get correct results


A quick review

device = GPU = set of multiprocessors Multiprocessor = set of processors & shared memoryKernel = GPU programGrid = array of thread blocks that execute a kernelThread block = group of SIMD threads that execute a kernel and can communicate via shared memory

One threadRead/writeNoOff-chipLocalAll threads in a blockRead/writeN/AOn-chipSharedAll threads + hostRead/writeNoOff-chipGlobalAll threads + hostReadYesOff-chipConstantAll threads + hostReadYesOff-chipTexture

WhoAccessCachedLocationMemory


Quick terminology review

Thread: concurrent code and associated state executed on the CUDA device (in parallel with other threads)

The unit of parallelism in CUDA

Note difference from CPU threads: creation cost, resource usage, and switching cost of GPU threads is much smaller

Warp: a group of threads executed physically in parallel (SIMD)

Half-warp: the first or second half of a warp of threads

Thread Block: a group of threads that are executed together and can share memory on a single multiprocessor

Grid: a group of thread blocks that execute a single CUDA program logically in parallel


Device Runtime Component:Synchronization Function

void __syncthreads();

Synchronizes all threads in a blockOnce all threads have reached this point, execution resumes normallyUsed to avoid RAW / WAR / WAW hazards when accessing shared or global memoryAllowed in conditional code only if the conditional is uniform across the entire thread block

Jan Lemeire


Thread divergences in a SIMD

thread divergence: supported by the hardware!

For example: if (x < 5) y = 5; else y = -5;SIMD performs the 3 steps

y = 5; is only executed on threads for which x < 5

y = -5; is executed on all others

Only when treads in the same warp do the same thing => effective parallelism

Even more general: instruction predication


Control Flow Instructions

Main performance concern with branching is divergence

Threads within a single warp take different paths

Different execution paths must be serialized

Avoid divergence when branch condition is a function of thread ID

Example with divergence:

If (threadIdx.x > 2) { }

Branch granularity < warp size

Example without divergence:

If (threadIdx.x / WARP_SIZE > 2) { }

Branch granularity is a whole multiple of warp size


Instruction Predication

Comparison instructions set condition codes (CC)

Instructions can be predicated to write results only when CC meets criterion (CC != 0, CC >= 0, etc.)

Compiler tries to predict if a branch condition is likely to produce many divergent warps

If guaranteed not to diverge: only predicates if < 4 instructions

If not guaranteed: only predicates if < 7 instructions

May replace branches with instruction predication

ALL predicated instructions take execution cyclesThose with false conditions don’t write their output

Or invoke memory loads and stores

Saves branch instructions, so can be cheaper than serializing divergent paths


Memory Instruction Latency

Memory instructions take 4 cycles per warp to issueIssue global and local memory loads / stores (not cached)Constant and texture loads (cached)Shared memory reads / writes

Example__shared__ float shared[];

__device__ float global[];

shared[threadIdx.x] = global[threadIdx.x];

4 cycles to issue read from global (device) memory, 4 cycles to issue write to shared memory400-600 cycles to read a float from global (device) memory

But can be hidden by scheduling independent math instructions or even other loads / stores if there are enough active threads


Arithmetic Instruction Latency

int and float add, shift, min, max and float mul, mad: 4 cycles per warp

int multiply (*) is by default 32-bit

requires multiple cycles / warp

Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit intmultiply

Integer divide and modulo are more expensive

Compiler will convert literal power-of-2 divides to shifts

But we have seen it miss some cases

Be explicit in cases where compiler can’t tell that divisor is a power of 2!

Useful trick: foo % n == foo & (n-1) if n is a power of 2


Arithmetic Instruction Latency

Reciprocal, reciprocal square root, sin/cos, log, exp: 16 cycles per warp

These are the versions prefixed with “__”

Examples:__rcp(), __sin(), __exp()

Other functions are combinations of the above

y / x == rcp(x) * y takes 20 cycles per warp

sqrt(x) == rcp(rsqrt(x)) takes 32 cycles per warp

Jan Lemeire

Latency Hiding for Memory Accesses

During global to shared memory copying

During shared memory reads

Keep Multiprocessors busy with a huge amount of threads

1 multiprocessor can simultaneously execute multiple thread blocks of maximal 512 threads

Is limited by amount of shared and register memory needed by each thread

Note: GPU communicates with CPU via relatively slow PCI Express bus (500 Mb/s)


Jan Lemeire


Overview






Optimizing CUDAOptimizing CUDAMark Harris

AstroGPU 2007

2

CUDA is fast and efficientCUDA is fast and efficient

CUDA enables efficient use of the massive parallelism of NVIDIA GPUs

Direct execution of data-parallel programsWithout the overhead of a graphics API

Using CUDA on Tesla GPUs can provide large speedups on data-parallel computations straight out of the box!

Even higher speedups are achievable by understanding and tuning for GPU architecture

This presentation covers general performance, common pitfalls, and useful strategies

5

CUDA Optimization StrategiesCUDA Optimization Strategies

Optimize Algorithms for the GPU

Optimize Memory Access Coherence

Take Advantage of On-Chip Shared Memory

Use Parallelism Efficiently

6

Optimize Algorithms for the GPUOptimize Algorithms for the GPU

Maximize independent parallelism

Maximize arithmetic intensity (math/bandwidth)

Sometimes it’s better to recompute than to cacheGPU spends its transistors on ALUs, not memory

Do more computation on the GPU to avoid costly data transfers

Even low parallelism computations can sometimes be faster than transferring back and forth to host

7

Optimize Memory CoherenceOptimize Memory Coherence

Coalesced vs. Non-coalesced = order of magnitudeGlobal/Local device memory

Optimize for spatial locality in cached texture memory

In shared memory, avoid high-degree bank conflicts

12

Coalesced Access: Coalesced Access: Reading floatsReading floats

t0 t1 t2 t14 t15t3

t0 t1 t2 t14 t15t3

132 136 184 192128 140 144 188

132 136 184 192128 140 144 188

Some Threads Do Not Participate

All threads participate

13

UncoalescedUncoalesced Access: Access: Reading floatsReading floats

t0 t1 t2 t14 t15t3

132 136128 140 144

Permuted Access by Threads

184 192188

Misaligned Starting Address (not a multiple of 64)

t0 t1 t2 t13 t15t3

132 136 184 192128 140 144 188

t14

8

Take Advantage of Shared MemoryTake Advantage of Shared Memory

Hundreds of times faster than global memoryThreads can cooperate via shared memory

Use one / a few threads to load / compute data shared by all threads

Use it to avoid non-coalesced accessStage loads and stores in shared memory to re-order non-coalesceable addressingMatrix transpose example later

9

Use Parallelism EfficientlyUse Parallelism Efficiently

Partition your computation to keep the GPU multiprocessors equally busy

Many threads, many thread blocks

Keep resource usage low enough to support multiple active thread blocks per multiprocessor

Registers, shared memory

31

Optimizing threads per blockOptimizing threads per blockChoose threads per block as a multiple of warp size

Avoid wasting computation on under-populated warpsMore threads per block == better memory latency hidingBut, more threads per block == fewer registers per thread

Kernel invocations can fail if too many registers are usedHeuristics

Minimum: 64 threads per blockOnly if multiple concurrent blocks

128 to 256 threads a better choiceUsually still enough regs to compile and invoke successfully

This all depends on your computation!Experiment!

25

OccupancyOccupancy

Thread instructions executed sequentially, executing other warps is the only way to hide latencies and keep the hardware busy

Occupancy = Number of warps running concurrently on a multiprocessor divided by maximum number of warps that can run concurrently

Minimize occupancy requirements by minimizing latencyMaximize occupancy by optimizing threads per multiprocessor

33

Parameterize Your ApplicationParameterize Your Application

Parameterization helps adaptation to different GPUsGPUs vary in many ways

# of multiprocessorsMemory bandwidthShared memory sizeRegister file sizeThreads per block

You can even make apps self-tuning (like FFTW and ATLAS)

“Experiment” mode discovers and saves optimal configuration

Jan Lemeire

Wavefront algorithm


About Wavefront parallelism: see exercises

512x512 image divided into 8x8 blocks

=> 64 x 64 blocks

On GTX280: 240 cores

=> 30 multiprocessors

Conclusion: keep al multiprocessors busy

32

62

38

Parallel Memory ArchitectureParallel Memory Architecture

In a parallel machine, many threads access memoryTherefore, memory is divided into banksEssential to achieve high bandwidth

Each bank can service one address per cycleA memory can service as many simultaneous accesses as it has banks

Multiple simultaneous accesses to a bankresult in a bank conflict

Conflicting accesses are serialized

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

42

Shared memory bank conflictsShared memory bank conflicts

Shared memory is as fast as registers if there are no bank conflicts

The fast case:If all threads of a half-warp access different banks, there is no bank conflictIf all threads of a half-warp read the identical address, there is no bank conflict (broadcast)

The slow case:Bank Conflict: multiple threads in the same half-warp access the same bankMust serialize the accessesCost = max # of simultaneous accesses to a single bank

39

Bank Addressing ExamplesBank Addressing Examples

No Bank ConflictsLinear addressing stride == 1

No Bank ConflictsRandom 1:1 Permutation

Bank 15


Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15


Thread 15


40

Bank Addressing ExamplesBank Addressing Examples

2-way Bank ConflictsLinear addressing stride == 2

8-way Bank ConflictsLinear addressing stride == 8

Thread 11Thread 10Thread 9Thread 8

Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15


Thread 15


Bank 9Bank 8

Bank 15

Bank 7

Bank 2Bank 1Bank 0x8

x8

Unrolling Last Steps

Only one warp is active during the last few stepsUnroll them and remove unneeded __syncthreads()

for (unsigned int s = bd/2; s > 32; s >>= 1) {

if (t < s) {data[t] += data[t + s];

}__syncthreads();

}if (t < 32) data[t] += data[t + 32];if (t < 16) data[t] += data[t + 16];if (t < 8) data[t] += data[t + 8];if (t < 4) data[t] += data[t + 4];if (t < 2) data[t] += data[t + 2];if (t < 1) data[t] += data[t + 1];


CUDA Optimization Priorities

Memory coalescing is #1 priorityHighest !/$ optimizationOptimize for locality

Take advantage of shared memoryVery high bandwidthThreads can cooperate to save work

Use parallelism efficientlyKeep the GPU busy at all timesHigh arithmetic / bandwidth ratioMany threads & thread blocks

Leave bank conflicts for last!4-way and smaller conflicts are not usually worth avoiding if avoiding them will cost more instructions

Jan Lemeire


Overview






Jan Lemeire


Strategy

Light-weight threads, supported by the hardwareThread processors, upto 96 threads per processor

Context switch can happen in 1 cycle!

No caching mechanism, branch prediction, …GPU does not try to be efficient for every program, does not spend transistors on optimization

Simple straight-forward sequential programming should be abandoned…

Less higher-level memory:GPU: 16KB shared memory per SIMD multiprocessor

CPU: L2 cache contains several MB’s

Massively floating-point computation power

Transparent system organizationModern (sequential) CPUs based on simple Von Neumann architecture

Jan Lemeire

Strategy II

Don't write explicitly threaded codeCompiler handles it => no chance of deadlocks or race conditions

Think differently: analyze the data instead of the algorithm.

In contrast with modern superscalar CPUs: programmer writes sequential code (single-threaded), processor tries to execute it in parallel, through pipelining etc. (instruction parallelism). But by the data and resource dependencies more speedup cannot be reached with > 4-way superscalar CPUs. 1.5 Instructions per cycles seems a maximum.


Link 1: white paper

Jan Lemeire


Results

Performance doubling every 6 months!

1000s of threads possible!

High Bandwidth

PCI Express bus (connection GPU-CPU) is the bottleneck

Enormous possibilities for latency hiding

Matrix Multiplication 13 times faster on a standard GPU (GeForce 8500GT) compared to a state-of-the art CPU (Intel Dual Core)

200 times faster on a high-end GPU, 50 times if quadcore.

Low threshold:

C, good documentation, many examples, easy-to-install, automatic card detection, easy-compilation

Jan Lemeire


How to get maximal performance, or call it ... limitations

Create many threads, make them ‘aggressively’ parallel

Keep threads busy in a warp

Align memory readsGlobal memory <> Shared memory

Using shared memory

Limited memory per thread

Close to hardware architectureHardware is made for exploiting data parallelism

Jan Lemeire


When to use CUDA?

Special computational intensive programs.

Keep it simple

…

Jan Lemeire

Disadvantages

Maintenance…

CUDA = NVIDIA

Alternatives:

– OpenCL: a standard language for writing code for GPUs and multicores. Supported by ATI, NVIDIA, Apple, …

– RapidMind’s Multicore Development, supports multiple architectures, less dependent on it

– AMD, IBM, Intel, Microsoft and others are working on standard parallel-processing extensions to C/C++

– Larrabee: combining processing power of GPUs with programmability of x86 processors

CUDA promises an abstract, scalable hardware model, but is it true?


Link 1: white paper

Links in Scientific Study section

Jan Lemeire

Parallel Systems: Introduction

Heterogeneous Chip Designs

Augment standard CPU with attached processors performing the compute-intensive portions:

Graphics Processing Units (GPUs)

Field Programmable Gate Arrays (FPGAs)

Cell processors, designed for video games

Jan Lemeire

Parallel Systems: Introduction

Cell processor

8 Synergistic Processing Elements (SPEs)

128-bit wide data paths

for vector instructions

256K on-chip RAM

No memory coherencePerformance and simplicity

Programmers should carefully manage data movement

GPU Programming - Vrije Universiteit Brusselparallel.vub.ac.be/education/parsys/notes2010/... · Overview 1. CUDA-enabled GPU architecture 2. Programming for GPUs 3. How a CUDA program

Documents