GPU Programming with CUDAMessage- passing Parallel Processing Message-passing Parallel Processing GPU Programming Jan Lemeire Dept. ETRO November 6th 2008 Parallel Systems Course: Chapter IV
GPU Programming with CUDAMessage-passing Parallel Processing
Message-passing Parallel Processing
GPU Programming
Jan Lemeire
Dept. ETRO
November 6th 2008
Parallel Systems Course: Chapter IV
Jan Lemeire
GPU Programming with CUDAMessage-passing Parallel Processing
Overview
1. CUDA-enabled GPU architecture
2. Programming for GPUs
3. How a CUDA program runs
4. Optimizing CUDA programs
5. Analysis & Conclusions
Jan Lemeire
GPU Programming with CUDAMessage-passing Parallel Processing
Overview
1. CUDA-enabled GPU architecture
2. Programming for GPUs
3. How a CUDA program runs
4. Optimizing CUDA programs
5. Analysis & Conclusions
Jan Lemeire
GPU Programming with CUDA
Utilization of Graphics Card processor (GPU) for High-Performance Computing
Via Nvidia’s CUDA API:
http://www.nvidia.com/object/cuda_home.html
PC graphics market largely subsidizes the development of these GPGPUs (General-Purpose computation on GPUs)
Cards that support CUDA:
Link 1
8, 9, 200 series
Jan Lemeire
GPU Programming with CUDA
Goal of chapter
Understand benefits & disadvantages of technology.
If you have to decide whether or not a new technology should be introduced
Understand consequences!
Why Are GPUs So Fast?
GPU specialized for math-intensive highly parallel computation So, more transistors can be devoted to data processing rather than data caching and flow control
Commodity industry: provides economies of scaleCompetitive industry: fuels innovation
DRAM
Cache
ALUControl
ALU
ALU
ALU
DRAM
CPU GPU
© NVIDIA Corporation 2007
Processors execute computing threadsThread Execution Manager issues threads128 Thread ProcessorsParallel Data Cache accelerates processing
G80 GPU Computing
Thread Execution Manager
Input Assembler
Host
Parallel Data
Cache
Global Memory
Load/store
Parallel Data
Cache
Thread Processors
Parallel Data
Cache
Parallel Data
Cache
Thread Processors
Parallel Data
Cache
Parallel Data
Cache
Thread Processors
Parallel Data
Cache
Parallel Data
Cache
Thread Processors
Parallel Data
Cache
Parallel Data
Cache
Thread Processors
Parallel Data
Cache
Parallel Data
Cache
Thread Processors
Parallel Data
Cache
Parallel Data
Cache
Thread Processors
Parallel Data
Cache
Parallel Data
Cache
Thread Processors
© NVIDIA Corporation 2007
Same programScalable performance
Goal: Scaling the Architecture
Thread Execution Manager
Input Assembler
Host
Parallel Data
Cache
Global Memory
Load/store
Parallel Data
Cache
Thread Processors
Parallel Data
Cache
Parallel Data
Cache
Thread Processors
Parallel Data
Cache
Parallel Data
Cache
Thread Processors
Parallel Data
Cache
Parallel Data
Cache
Thread Processors
Thread Execution Manager
Input Assembler
Host
Global Memory
Load/store
Parallel Data
Cache
Parallel Data
Cache
Thread Processors
Parallel Data
Cache
Parallel Data
Cache
Thread Processors
© NVIDIA Corporation 2007
Graphics Programming Model
Graphics Application
Vertex Program
Rasterization
Fragment Program
Display
© NVIDIA Corporation 2007
What’s Wrong With GPGPU?
Application
Vertex Program
Rasterization
Pixel Program
Display
Input Registers
Pixel Program
Output Registers
Constants
Texture
Temp Registers
© NVIDIA Corporation 2007
What’s Wrong With GPGPU?
Application
Vertex Program
Rasterization
Fragment Program
Display
Input Registers
Fragment Program
Output Registers
Constants
Texture
Temp Registers
APIs are specific to graphics
Limited texture size and dimension
Limited shader outputs
No scatter
Limited instruction set
No thread communication
Limited local storage
Building a Better Pixel Thread
Thread Program
Output Registers
Constants
Texture
Registers
Thread Number
Features
Millions of instructions
Full Integer and Bit instructions
No limits on branching, looping
1D, 2D, or 3D thread ID allocation
Global Memory
Thread Program
Global Memory
Thread Number
Constants
Texture
Registers
Features
Fully general load/store to GPU memory
Untyped, not fixed texture types
Pointer support
Parallel Data Cache
Thread Program
Global Memory
Thread Number
Constants
Texture
Registers
FeaturesDedicated on-chip memoryShared between threads for inter-thread communicationExplicitly managedAs fast as registers
Parallel Data Cache
© NVIDIA Corporation 2007
Hardware Implementation:Memory Architecture
The local, global, constant, and texture spaces are regions of device memoryEach multiprocessor has:
A set of 32-bit registers per processorOn-chip shared memory
Where the shared memory space resides
A read-only constant cacheTo speed up access to the constant memory space
A read-only texture cacheTo speed up access to the texture memory space
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
Device memory
Shared Memory
InstructionUnit
Processor 1
Registers
…Processor 2
Registers
Processor M
Registers
ConstantCache
TextureCache
© NVIDIA Corporation 2007
Example Fluid Algorithm
CPU GPGPU
GPU Computingwith CUDA
Multiple passes through video
memory
Parallel execution through cache
Single thread out of cache
Program/Control
Data/Computation
Control
ALU
Cache DRAM
P1P2P3P4
Pn’=P1+P2+P3+P4
ALU
VideoMemory
Control
ALU
Control
ALU
Control
ALU
P1,P2P3,P4
P1,P2P3,P4
P1,P2P3,P4
Pn’=P1+P2+P3+P4
Pn’=P1+P2+P3+P4
Pn’=P1+P2+P3+P4
ParallelData
Cache
ThreadExecutionManager
ALU
Control
ALU
Control
ALU
Control
ALU
DRAM
P1P2P3P4P5
SharedData
Pn’=P1+P2+P3+P4
Pn’=P1+P2+P3+P4
Pn’=P1+P2+P3+P4
Jan Lemeire
GPU Programming with CUDAMessage-passing Parallel Processing
Overview
1. CUDA-enabled GPU architecture
2. Programming for GPUs
3. How a CUDA program runs
4. Optimizing CUDA programs
5. Analysis & Conclusions
CUDA: Programming GPU in C
Philosophy: provide minimal set of extensions necessary to expose power
Declaration specifiers to indicate where things live__global__ void KernelFunc(...); // kernel callable from host__device__ void DeviceFunc(...); // function callable on device__device__ int GlobalVar; // variable in device memory__shared__ int SharedVar; // shared in PDC by thread block
Extend function invocation syntax for parallel kernel launchKernelFunc<<<500, 128>>>(...); // launch 500 blocks w/ 128 threads each
Special variables for thread identification in kernelsdim3 threadIdx; dim3 blockIdx; dim3 blockDim; dim3 gridDim;
Intrinsics that expose specific operations in kernel code__syncthreads(); // barrier synchronization within kernel
© NVIDIA Corporation 2007
CUDA: Runtime support
Explicit memory allocation returns pointers to GPU memorycudaMalloc(), cudaFree()
Explicit memory copy for host ↔ device, device ↔ devicecudaMemcpy(), cudaMemcpy2D(), ...
Texture managementcudaBindTexture(), cudaBindTextureToArray(), ...
OpenGL & DirectX interoperabilitycudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(), …
© NVIDIA Corporation 2007
Example: Vector Addition Kernel
// Compute vector sum C = A+B// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C){
int i = threadIdx.x + blockDim.x * blockIdx.x;C[i] = A[i] + B[i];
}
© NVIDIA Corporation 2007
Example: Host code for memory
// allocate host (CPU) memoryfloat* h_A = (float*) malloc(N * sizeof(float));float* h_B = (float*) malloc(N * sizeof(float));… initalize h_A and h_B …
// allocate device (GPU) memoryfloat* d_A, d_B, d_C;cudaMalloc( (void**) &d_A, N * sizeof(float));cudaMalloc( (void**) &d_B, N * sizeof(float));cudaMalloc( (void**) &d_C, N * sizeof(float));
// copy host memory to devicecudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice));cudaMemcpy( d_B, h_B, N * sizeof(float),cudaMemcpyHostToDevice));
// execute the kernel on N/256 blocks of 256 threads eachvecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
© NVIDIA Corporation 2007
CUDA SDK
NVIDIA C Compiler
NVIDIA Assemblyfor Computing CPU Host Code
Integrated CPUand GPU C Source Code
Libraries:FFT, BLAS,…Example Source Code
CUDADriver
DebuggerProfiler Standard C Compiler
GPU CPU
Jan Lemeire
Example program
GPU Programming with CUDA
__global__ matrixMultiplicationInOneBlock(float *inputA, float *inputB, float *output, int size){
// allocate memory for maximal matrix size__shared__ float matrixA[512], matrixB[512];float result=0.;const int tx=threadIdx.x, ty=threadIdx.y;
int position=ty*img_W+tx;matrixA[position]=inputA[position];matrixB[position]=inputA[position];
__syncthreads();
for(int i=0;i< size;i++)result+=matrixA[ty*size+i]*matrixB[i*size+tx];
output[position]=result;}
Jan Lemeire
GPU Programming with CUDAMessage-passing Parallel Processing
Overview
1. CUDA-enabled GPU architecture
2. Programming for GPUs
3. How a CUDA program runs
4. Optimizing CUDA programs
5. Analysis & Conclusions
Jan Lemeire
GPU Programming with CUDA
Threads: grouped in blocks & warps
A block of threads is executed on the same multiprocessor, use the same shared memory (16KB) and can be synchronized.
A block is divided into warps which are run ‘together’.
One multiprocessor can run 4 thread blocks in parallel.
Warp size is 32: 32 threads are executed in a SIMD fashion on the 8 cores of the multiprocessor.
To keep deep pipelines full on the FPUs. It takes 4 cycles for a memory or arithmetic operation.
Use of a 32-bit ActiveMask: a bit for every running thread in a warp
CUDA Scalable Execution Model
A hierarchy of threadsThreads execute a kernel in blocks, blocks are organized in a grid
Threads within a block cooperateshare on-chip memory in PDCbarrier synchronization
Blocks within a grid are independentblocks run to completion in unspecified orderNo global sync, no per-block mutex
Guarantees scalable execution!
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
© NVIDIA Corporation 2006
How thread blocks are partitioned
Thread blocks are partitioned into warpsThread IDs within a warp are consecutive and increasing
Warp 0 starts with Thread ID 0
For a 2D block:ThreadID = threadIndex.x + blockWidth * threadIndex.y
Partitioning is always the sameThus you can use this knowledge in control flow
(Covered next)
However, DO NOT rely on any ordering between warps
If there are any dependencies between threads, you must __syncthreads() to get correct results
© NVIDIA Corporation 2007
A quick review
device = GPU = set of multiprocessors Multiprocessor = set of processors & shared memoryKernel = GPU programGrid = array of thread blocks that execute a kernelThread block = group of SIMD threads that execute a kernel and can communicate via shared memory
One threadRead/writeNoOff-chipLocalAll threads in a blockRead/writeN/AOn-chipSharedAll threads + hostRead/writeNoOff-chipGlobalAll threads + hostReadYesOff-chipConstantAll threads + hostReadYesOff-chipTexture
WhoAccessCachedLocationMemory
© NVIDIA Corporation 2006
Quick terminology review
Thread: concurrent code and associated state executed on the CUDA device (in parallel with other threads)
The unit of parallelism in CUDA
Note difference from CPU threads: creation cost, resource usage, and switching cost of GPU threads is much smaller
Warp: a group of threads executed physically in parallel (SIMD)
Half-warp: the first or second half of a warp of threads
Thread Block: a group of threads that are executed together and can share memory on a single multiprocessor
Grid: a group of thread blocks that execute a single CUDA program logically in parallel
© NVIDIA Corporation 2007
Device Runtime Component:Synchronization Function
void __syncthreads();
Synchronizes all threads in a blockOnce all threads have reached this point, execution resumes normallyUsed to avoid RAW / WAR / WAW hazards when accessing shared or global memoryAllowed in conditional code only if the conditional is uniform across the entire thread block
Jan Lemeire
GPU Programming with CUDA
Thread divergences in a SIMD
thread divergence: supported by the hardware!
For example: if (x < 5) y = 5; else y = -5;SIMD performs the 3 steps
y = 5; is only executed on threads for which x < 5
y = -5; is executed on all others
Only when treads in the same warp do the same thing => effective parallelism
Even more general: instruction predication
© NVIDIA Corporation 2006
Control Flow Instructions
Main performance concern with branching is divergence
Threads within a single warp take different paths
Different execution paths must be serialized
Avoid divergence when branch condition is a function of thread ID
Example with divergence:
If (threadIdx.x > 2) { }
Branch granularity < warp size
Example without divergence:
If (threadIdx.x / WARP_SIZE > 2) { }
Branch granularity is a whole multiple of warp size
© NVIDIA Corporation 2006
Instruction Predication
Comparison instructions set condition codes (CC)
Instructions can be predicated to write results only when CC meets criterion (CC != 0, CC >= 0, etc.)
Compiler tries to predict if a branch condition is likely to produce many divergent warps
If guaranteed not to diverge: only predicates if < 4 instructions
If not guaranteed: only predicates if < 7 instructions
May replace branches with instruction predication
ALL predicated instructions take execution cyclesThose with false conditions don’t write their output
Or invoke memory loads and stores
Saves branch instructions, so can be cheaper than serializing divergent paths
© NVIDIA Corporation 2006
Memory Instruction Latency
Memory instructions take 4 cycles per warp to issueIssue global and local memory loads / stores (not cached)Constant and texture loads (cached)Shared memory reads / writes
Example__shared__ float shared[];
__device__ float global[];
shared[threadIdx.x] = global[threadIdx.x];
4 cycles to issue read from global (device) memory, 4 cycles to issue write to shared memory400-600 cycles to read a float from global (device) memory
But can be hidden by scheduling independent math instructions or even other loads / stores if there are enough active threads
© NVIDIA Corporation 2006
Arithmetic Instruction Latency
int and float add, shift, min, max and float mul, mad: 4 cycles per warp
int multiply (*) is by default 32-bit
requires multiple cycles / warp
Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit intmultiply
Integer divide and modulo are more expensive
Compiler will convert literal power-of-2 divides to shifts
But we have seen it miss some cases
Be explicit in cases where compiler can’t tell that divisor is a power of 2!
Useful trick: foo % n == foo & (n-1) if n is a power of 2
© NVIDIA Corporation 2006
Arithmetic Instruction Latency
Reciprocal, reciprocal square root, sin/cos, log, exp: 16 cycles per warp
These are the versions prefixed with “__”
Examples:__rcp(), __sin(), __exp()
Other functions are combinations of the above
y / x == rcp(x) * y takes 20 cycles per warp
sqrt(x) == rcp(rsqrt(x)) takes 32 cycles per warp
Jan Lemeire
Latency Hiding for Memory Accesses
During global to shared memory copying
During shared memory reads
Keep Multiprocessors busy with a huge amount of threads
1 multiprocessor can simultaneously execute multiple thread blocks of maximal 512 threads
Is limited by amount of shared and register memory needed by each thread
Note: GPU communicates with CPU via relatively slow PCI Express bus (500 Mb/s)
GPU Programming with CUDA
Jan Lemeire
GPU Programming with CUDAMessage-passing Parallel Processing
Overview
1. CUDA-enabled GPU architecture
2. Programming for GPUs
3. How a CUDA program runs
4. Optimizing CUDA programs
5. Analysis & Conclusions
Optimizing CUDAOptimizing CUDAMark Harris
AstroGPU 2007
2
CUDA is fast and efficientCUDA is fast and efficient
CUDA enables efficient use of the massive parallelism of NVIDIA GPUs
Direct execution of data-parallel programsWithout the overhead of a graphics API
Using CUDA on Tesla GPUs can provide large speedups on data-parallel computations straight out of the box!
Even higher speedups are achievable by understanding and tuning for GPU architecture
This presentation covers general performance, common pitfalls, and useful strategies
5
CUDA Optimization StrategiesCUDA Optimization Strategies
Optimize Algorithms for the GPU
Optimize Memory Access Coherence
Take Advantage of On-Chip Shared Memory
Use Parallelism Efficiently
6
Optimize Algorithms for the GPUOptimize Algorithms for the GPU
Maximize independent parallelism
Maximize arithmetic intensity (math/bandwidth)
Sometimes it’s better to recompute than to cacheGPU spends its transistors on ALUs, not memory
Do more computation on the GPU to avoid costly data transfers
Even low parallelism computations can sometimes be faster than transferring back and forth to host
7
Optimize Memory CoherenceOptimize Memory Coherence
Coalesced vs. Non-coalesced = order of magnitudeGlobal/Local device memory
Optimize for spatial locality in cached texture memory
In shared memory, avoid high-degree bank conflicts
12
Coalesced Access: Coalesced Access: Reading floatsReading floats
t0 t1 t2 t14 t15t3
t0 t1 t2 t14 t15t3
132 136 184 192128 140 144 188
132 136 184 192128 140 144 188
Some Threads Do Not Participate
All threads participate
13
UncoalescedUncoalesced Access: Access: Reading floatsReading floats
t0 t1 t2 t14 t15t3
132 136128 140 144
Permuted Access by Threads
184 192188
Misaligned Starting Address (not a multiple of 64)
t0 t1 t2 t13 t15t3
132 136 184 192128 140 144 188
t14
8
Take Advantage of Shared MemoryTake Advantage of Shared Memory
Hundreds of times faster than global memoryThreads can cooperate via shared memory
Use one / a few threads to load / compute data shared by all threads
Use it to avoid non-coalesced accessStage loads and stores in shared memory to re-order non-coalesceable addressingMatrix transpose example later
9
Use Parallelism EfficientlyUse Parallelism Efficiently
Partition your computation to keep the GPU multiprocessors equally busy
Many threads, many thread blocks
Keep resource usage low enough to support multiple active thread blocks per multiprocessor
Registers, shared memory
31
Optimizing threads per blockOptimizing threads per blockChoose threads per block as a multiple of warp size
Avoid wasting computation on under-populated warpsMore threads per block == better memory latency hidingBut, more threads per block == fewer registers per thread
Kernel invocations can fail if too many registers are usedHeuristics
Minimum: 64 threads per blockOnly if multiple concurrent blocks
128 to 256 threads a better choiceUsually still enough regs to compile and invoke successfully
This all depends on your computation!Experiment!
25
OccupancyOccupancy
Thread instructions executed sequentially, executing other warps is the only way to hide latencies and keep the hardware busy
Occupancy = Number of warps running concurrently on a multiprocessor divided by maximum number of warps that can run concurrently
Minimize occupancy requirements by minimizing latencyMaximize occupancy by optimizing threads per multiprocessor
33
Parameterize Your ApplicationParameterize Your Application
Parameterization helps adaptation to different GPUsGPUs vary in many ways
# of multiprocessorsMemory bandwidthShared memory sizeRegister file sizeThreads per block
You can even make apps self-tuning (like FFTW and ATLAS)
“Experiment” mode discovers and saves optimal configuration
Jan Lemeire
Wavefront algorithm
GPU Programming with CUDA
About Wavefront parallelism: see exercises
512x512 image divided into 8x8 blocks
=> 64 x 64 blocks
On GTX280: 240 cores
=> 30 multiprocessors
Conclusion: keep al multiprocessors busy
32
62
38
Parallel Memory ArchitectureParallel Memory Architecture
In a parallel machine, many threads access memoryTherefore, memory is divided into banksEssential to achieve high bandwidth
Each bank can service one address per cycleA memory can service as many simultaneous accesses as it has banks
Multiple simultaneous accesses to a bankresult in a bank conflict
Conflicting accesses are serialized
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
42
Shared memory bank conflictsShared memory bank conflicts
Shared memory is as fast as registers if there are no bank conflicts
The fast case:If all threads of a half-warp access different banks, there is no bank conflictIf all threads of a half-warp read the identical address, there is no bank conflict (broadcast)
The slow case:Bank Conflict: multiple threads in the same half-warp access the same bankMust serialize the accessesCost = max # of simultaneous accesses to a single bank
39
Bank Addressing ExamplesBank Addressing Examples
No Bank ConflictsLinear addressing stride == 1
No Bank ConflictsRandom 1:1 Permutation
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
40
Bank Addressing ExamplesBank Addressing Examples
2-way Bank ConflictsLinear addressing stride == 2
8-way Bank ConflictsLinear addressing stride == 8
Thread 11Thread 10Thread 9Thread 8
Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 9Bank 8
Bank 15
Bank 7
Bank 2Bank 1Bank 0x8
x8
Unrolling Last Steps
Only one warp is active during the last few stepsUnroll them and remove unneeded __syncthreads()
for (unsigned int s = bd/2; s > 32; s >>= 1) {
if (t < s) {data[t] += data[t + s];
}__syncthreads();
}if (t < 32) data[t] += data[t + 32];if (t < 16) data[t] += data[t + 16];if (t < 8) data[t] += data[t + 8];if (t < 4) data[t] += data[t + 4];if (t < 2) data[t] += data[t + 2];if (t < 1) data[t] += data[t + 1];
© NVIDIA Corporation 2006
CUDA Optimization Priorities
Memory coalescing is #1 priorityHighest !/$ optimizationOptimize for locality
Take advantage of shared memoryVery high bandwidthThreads can cooperate to save work
Use parallelism efficientlyKeep the GPU busy at all timesHigh arithmetic / bandwidth ratioMany threads & thread blocks
Leave bank conflicts for last!4-way and smaller conflicts are not usually worth avoiding if avoiding them will cost more instructions
Jan Lemeire
GPU Programming with CUDAMessage-passing Parallel Processing
Overview
1. CUDA-enabled GPU architecture
2. Programming for GPUs
3. How a CUDA program runs
4. Optimizing CUDA programs
5. Analysis & Conclusions
Jan Lemeire
GPU Programming with CUDA
Strategy
Light-weight threads, supported by the hardwareThread processors, upto 96 threads per processor
Context switch can happen in 1 cycle!
No caching mechanism, branch prediction, …GPU does not try to be efficient for every program, does not spend transistors on optimization
Simple straight-forward sequential programming should be abandoned…
Less higher-level memory:GPU: 16KB shared memory per SIMD multiprocessor
CPU: L2 cache contains several MB’s
Massively floating-point computation power
Transparent system organizationModern (sequential) CPUs based on simple Von Neumann architecture
Jan Lemeire
Strategy II
Don't write explicitly threaded codeCompiler handles it => no chance of deadlocks or race conditions
Think differently: analyze the data instead of the algorithm.
In contrast with modern superscalar CPUs: programmer writes sequential code (single-threaded), processor tries to execute it in parallel, through pipelining etc. (instruction parallelism). But by the data and resource dependencies more speedup cannot be reached with > 4-way superscalar CPUs. 1.5 Instructions per cycles seems a maximum.
GPU Programming with CUDA
Link 1: white paper
Jan Lemeire
GPU Programming with CUDA
Results
Performance doubling every 6 months!
1000s of threads possible!
High Bandwidth
PCI Express bus (connection GPU-CPU) is the bottleneck
Enormous possibilities for latency hiding
Matrix Multiplication 13 times faster on a standard GPU (GeForce 8500GT) compared to a state-of-the art CPU (Intel Dual Core)
200 times faster on a high-end GPU, 50 times if quadcore.
Low threshold:
C, good documentation, many examples, easy-to-install, automatic card detection, easy-compilation
Jan Lemeire
GPU Programming with CUDA
How to get maximal performance, or call it ... limitations
Create many threads, make them ‘aggressively’ parallel
Keep threads busy in a warp
Align memory readsGlobal memory <> Shared memory
Using shared memory
Limited memory per thread
Close to hardware architectureHardware is made for exploiting data parallelism
Jan Lemeire
GPU Programming with CUDA
When to use CUDA?
Special computational intensive programs.
Keep it simple
…
Jan Lemeire
Disadvantages
Maintenance…
CUDA = NVIDIA
Alternatives:
– OpenCL: a standard language for writing code for GPUs and multicores. Supported by ATI, NVIDIA, Apple, …
– RapidMind’s Multicore Development, supports multiple architectures, less dependent on it
– AMD, IBM, Intel, Microsoft and others are working on standard parallel-processing extensions to C/C++
– Larrabee: combining processing power of GPUs with programmability of x86 processors
CUDA promises an abstract, scalable hardware model, but is it true?
GPU Programming with CUDA
Link 1: white paper
Links in Scientific Study section
Jan Lemeire
Parallel Systems: Introduction
Heterogeneous Chip Designs
Augment standard CPU with attached processors performing the compute-intensive portions:
Graphics Processing Units (GPUs)
Field Programmable Gate Arrays (FPGAs)
Cell processors, designed for video games
Jan Lemeire
Parallel Systems: Introduction
Cell processor
8 Synergistic Processing Elements (SPEs)
128-bit wide data paths
for vector instructions
256K on-chip RAM
No memory coherencePerformance and simplicity
Programmers should carefully manage data movement