David Luebke NVIDIA Research GPU Architecture & Implications
Dec 25, 2015
© NVIDIA Corporation 2007
GPU Architecture
CUDA provides a parallel programming model
The Tesla GPU architecture implements this
This talk will describe the characteristics, goals, and implications of that architecture
G80 GPU Implementation: Tesla C870
681 million transistors470 mm2 in 90 nm CMOS
128 thread processors518 GFLOPS peak1.35 GHz processor clock
1.5 GB DRAM76 GB/s peak800 MHz GDDR3 clock384 pin DRAM interface
ATX form factor cardPCI Express x16170 W max with DRAM
© NVIDIA Corporation 2007
G80 (launched Nov 2006)
128 Thread Processors execute kernel threads
Up to 12,288 parallel threads active
Per-block shared memory (PBSM) accelerates processing
Block Diagram Redux
Thread Execution Manager
Input Assembler
Host
PBSM
Global Memory
Load/store
PBSM
Thread Processors
PBSM
Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors
PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSMPBSM
© NVIDIA Corporation 2007
Streaming Multiprocessor (SM)
Processing elements8 scalar thread processors (SP)32 GFLOPS peak at 1.35 GHz8192 32-bit registers (32KB)
½ MB total register file space!
usual ops: float, int, branch, …
Hardware multithreadingup to 8 blocks resident at onceup to 768 active threads in total
16KB on-chip memorylow latency storageshared amongst threads of a blocksupports thread communication
SP
SharedMemory
MT IU
SM t0 t1 … tB
Goal: Scalability
Scalable executionProgram must be insensitive to the number of cores
Write one program for any number of SM cores
Program runs on any size GPU without recompiling
Hierarchical execution modelDecompose problem into sequential steps (kernels)
Decompose kernel into computing parallel blocks
Decompose block into computing parallel threads
Hardware distributes independent blocks to SMs as available
Blocks Run on Multiprocessors
Kernel launched by host
. . .
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
. . .
Device processor array
Device Memory
Goal: easy to program
Strategies:Familiar programming language mechanics
C/C++ with small extension
Simple parallel abstractionsSimple barrier synchronization
Shared memory semantics
Hardware-managed hierarchy of threads
© NVIDIA Corporation 2007
Hardware Multithreading
Hardware allocates resources to blocksblocks need: thread slots, registers, shared memory
blocks don’t run until resources are available
Hardware schedules threadsthreads have their own registers
any thread not waiting for something can run
context switching is (basically) free – every cycle
Hardware relies on threads to hide latencyi.e., parallelism is necessary for performance
SP
SharedMemory
MT IU
SM
Goal: Performance per millimeter
For GPUs, perfomance == throughput
Strategy: hide latency with computation not cacheHeavy multithreading – already discussed by Kevin
Implication: need many threads to hide latencyOccupancy – typically need 128 threads/SM minimumMultiple thread blocks/SM good to minimize effect of barriers
Strategy: Single Instruction Multiple Thread (SIMT)Balances performance with ease of programming
© NVIDIA Corporation 2007
SIMT Thread Execution
Groups of 32 threads formed into warpsalways executing same instructionshared instruction fetch/dispatchsome become inactive when code path divergeshardware automatically handles divergence
Warps are the primitive unit of schedulingpick 1 of 24 warps for each instruction slot
SIMT execution is an implementation choicesharing control logic leaves more space for ALUslargely invisible to programmermust understand for performance, not correctness
SP
SharedMemory
MT IU
SM
© NVIDIA Corporation 2007gh07 Hot3D: Tesla GPU Computing12
SIMT Multithreaded Execution
Weaving: the original parallel thread technology is about 10,000 years oldWarp: a set of 32 parallel threadsthat execute a SIMD instruction
SM hardware implements zero-overhead warp and thread schedulingEach SM executes up to 768 concurrent threads, as 24 SIMD warps of 32 threads
Threads can execute independentlySIMD warp automatically diverges and converges when threads branchBest efficiency and performance when threads of a warp execute togetherSIMT across threads (not just SIMD across data) gives easy single-thread scalar programming with SIMD efficiency
warp 8 instruction 11
SM multithreadedinstruction scheduler
warp 1 instruction 42
warp 3 instruction 95
warp 8 instruction 12
...
time
warp 3 instruction 96
© NVIDIA Corporation 2007
SPSharedMemory
MT
IU
Device Memory
Texture Cache Constant Cache
I Cac
heMemory Architecture
Direct load/store access to device memorytreated as the usual linear sequence of bytes (i.e., not pixels)
Texture & constant caches are read-only access paths
On-chip shared memory shared amongst threads of a blockimportant for communication amongst threads
provides low-latency temporary storage (~100x less than DRAM)
HostMemory
PCIe
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphicsNO: CUDA compiles directly to the hardware
GPUs architectures are:Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:Very wide (1000s) SIMD machines… NO: warps are 32-wide
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive… NOPE
…with 4-wide vector registers.
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers. NO: scalar thread processors
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.
GPUs are power-inefficient: No – 4-10x perf/W advantage, up to 89x reported for some studies
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.
GPUs are power-inefficient:
GPUs don’t do real floating point
GPU Floating Point FeaturesG80 SSE IBM Altivec Cell SPE
Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754
Rounding modes for FADD and FMUL
Round to nearest and round to zero
All 4 IEEE, round to nearest, zero, inf, -inf
Round to nearest only
Round to zero/truncate only
Denormal handling Flush to zeroSupported,1000’s of cycles
Supported,1000’s of cycles
Flush to zero
NaN support Yes Yes Yes No
Overflow and Infinity support
Yes, only clamps to max norm
Yes Yes No, infinity
Flags No Yes Yes Some
Square root Software only Hardware Software only Software only
Division Software only Hardware Software only Software only
Reciprocal estimate accuracy
24 bit 12 bit 12 bit 12 bit
Reciprocal sqrt estimate accuracy
23 bit 12 bit 12 bit 12 bit
log2(x) and 2^x estimates accuracy
23 bit No 12 bit No
Do GPUs Do Real IEEE FP?
G8x GPU FP is IEEE 754Comparable to other processors / accelerators
More precise / usable in some ways
Less precise in other ways
GPU FP getting better every generationDouble precision support shortly
Goal: best of class by 2009
© NVIDIA Corporation 2007
GPU Computing Sweet Spots
Applications:
High arithmetic intensity:
Dense linear algebra, PDEs, n-body, finite difference, …
High bandwidth: Sequencing (virus scanning, genomics), sorting, database…
Visual computing:Graphics, image processing, tomography, machine vision…
© NVIDIA Corporation 2007
GPU Computing Example Markets
Computational
Modeling
Computational
Chemistry
Computational
Medicine
Computational
Science
Computational
Biology
Computational
Finance
Computational
Geoscience
Image
Processing
© NVIDIA Corporation 2007
Applications - Condensed
3D image analysisAdaptive radiation therapyAcousticsAstronomyAudioAutomobile visionBioinfomaticsBiological simulationBroadcastCellular automataComputational Fluid DynamicsComputer VisionCryptographyCT reconstructionData MiningDigital cinema/projectionsElectromagnetic simulationEquity training
FilmFinancial - lots of areasLanguagesGISHolographics cinemaImaging (lots)Mathematics researchMilitary (lots)Mine planningMolecular dynamicsMRI reconstructionMultispectral imagingnbodyNetwork processingNeural networkOceanographic researchOptical inspectionParticle physics
Protein foldingQuantum chemistryRay tracingRadarReservoir simulationRobotic vision/AIRobotic surgerySatellite data analysisSeismic imagingSurgery simulationSurveillanceUltrasoundVideo conferencingTelescopeVideoVisualizationWirelessX-ray
© NVIDIA Corporation 2007
GPU Computing Sweet Spots
From cluster to workstationThe “personal supercomputing” phase change
From lab to clinic
From machine room to engineer, grad student desks
From batch processing to interactive
From interactive to real-time
GPU-enabled clusters A 100x or better speedup changes the science
Solve at different scales
Direct brute-force methods may outperform cleverness
New bottlenecks may emerge
Approaches once inconceivable may become practical
© NVIDIA Corporation 2007
New Applications
Real-time options implied volatility engine
Swaption volatility cube calculator
Manifold 8 GIS
Ultrasound imaging
HOOMD Molecular Dynamics
Also…Image rotation/classificationGraphics processing toolboxMicroarray data analysisData parallel primitivesAstrophysics simulations
SDK: Mandelbrot, computer vision
Seismic migration
© NVIDIA Corporation 2007
The Future of GPUs
GPU Computing drives new applicationsReducing “Time to Discovery”
100x Speedup changes science and research methods
New applications drive the future of GPUs and GPU Computing
Drives new GPU capabilities
Drives hunger for more performance
Some exciting new domains: Vision, acoustic, and embedded applications
Large-scale simulation & physics
© NVIDIA Corporation 2007
GPU Floating Point FeaturesG80 SSE IBM Altivec Cell SPE
Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754
Rounding modes for FADD and FMUL
Round to nearest and round to zero
All 4 IEEE, round to nearest, zero, inf, -inf
Round to nearest only
Round to zero/truncate only
Denormal handling Flush to zeroSupported,1000’s of cycles
Supported,1000’s of cycles
Flush to zero
NaN support Yes Yes Yes No
Overflow and Infinity support
Yes, only clamps to max norm
Yes Yes No, infinity
Flags No Yes Yes Some
Square root Software only Hardware Software only Software only
Division Software only Hardware Software only Software only
Reciprocal estimate accuracy
24 bit 12 bit 12 bit 12 bit
Reciprocal sqrt estimate accuracy
23 bit 12 bit 12 bit 12 bit
log2(x) and 2^x estimates accuracy
23 bit No 12 bit No
© NVIDIA Corporation 2007
Do GPUs Do Real IEEE FP?
G8x GPU FP is IEEE 754Comparable to other processors / accelerators
More precise / usable in some ways
Less precise in other ways
GPU FP getting better every generationDouble precision support shortly
Goal: best of class by 2009
© NVIDIA Corporation 2007
CUDA Performance Advantages
Performance:BLAS1: 60+ GB/sec
BLAS3: 127 GFLOPS
FFT: 52 benchFFT* GFLOPS
FDTD: 1.2 Gcells/sec
SSEARCH: 5.2 Gcells/sec
Black Scholes: 4.7 GOptions/sec
VMD: 290 GFLOPS
How:Leveraging shared memory
GPU memory bandwidth
GPU GFLOPS performance
Custom hardware intrinsics
__sinf(), __cosf(), __expf(), __logf(), …
All benchmarks are compiled code!
© NVIDIA Corporation 2007
Problem: GPGPU
OLD: GPGPU – trick the GPU into general-purpose computing by casting problem as graphics
Turn data into images (“texture maps”)
Turn algorithms into image synthesis (“rendering passes”)
Promising results, but:Tough learning curve, particularly for non-graphics experts
Potentially high overhead of graphics API
Highly constrained memory layout & access model
Need for many passes drives up bandwidth consumption
© NVIDIA Corporation 2007
Solution: CUDA
NEW: GPU Computing with CUDACUDA = Compute Unified Driver Architecture
Co-designed hardware & software for direct GPU computing
Hardware: fully general data-parallel architecture
Software: program the GPU in C
General thread launch
Global load-store
Parallel data cache
Scalar architecture
Integers, bit operations
Double precision (soon)
Scalable data-parallel execution/memory model
C with minimal yet powerful extensions
© NVIDIA Corporation 2007
Graphics Programming Model
Graphics Application
Vertex Program
Rasterization
Fragment Program
Display
© NVIDIA Corporation 2007
Streaming GPGPU Programming
OpenGL Program to Add A and B
Vertex Program
Rasterization
Fragment Program
CPU Reads Texture Memory
for Results
Start by creating a quad
“Programs” created with raster operation
Write answer to texture memory as a “color”
Read textures as input to OpenGL shader program
All this just to do A + B
© NVIDIA Corporation 2007
What’s Wrong With GPGPU?
Application
Vertex Program
Rasterization
Pixel Program
Display
Input Registers
Pixel Program
Output Registers
Constants
Texture
Temp Registers
© NVIDIA Corporation 2007
What’s Wrong With GPGPU?
Application
Vertex Program
Rasterization
Fragment Program
Display
Input Registers
Fragment Program
Output Registers
Constants
Texture
Temp Registers
APIs are specific to graphics
Limited texture size and dimension
Limited shader outputs
No scatter
Limited instruction set
No thread communication
Limited local storage
© NVIDIA Corporation 2007
Building a Better Pixel
Input Registers
Fragment Program
Output Registers
Constants
Texture
Registers
© NVIDIA Corporation 2007
Building a Better Pixel Thread
Thread Program
Output Registers
Constants
Texture
Registers
Thread Number
Features
Millions of instructions
Full Integer and Bit instructions
No limits on branching, looping
1D, 2D, or 3D thread ID allocation
© NVIDIA Corporation 2007
Global Memory
Thread Program
Global Memory
Thread Number
Constants
Texture
Registers
Features
Fully general load/store to
GPU memory
Untyped, not fixed texture types
Pointer support
© NVIDIA Corporation 2007
Parallel Data Cache
Thread Program
Global Memory
Thread Number
Constants
Texture
Registers
Features
Dedicated on-chip memory
Shared between threads for inter-thread communication
Explicitly managed
As fast as registers
Parallel Data Cache
© NVIDIA Corporation 2007
Example Algorithm - Fluids
So the pressure for each particle is…
Pressure1 = P1 + P2 + P3 + P4
Pressure2 = P3 + P4 + P5 + P6
Pressure3 = P5 + P6 + P7 + P8
Pressure4 = P7 + P8 + P9 + P10
Pressure depends on neighbors
Goal: Calculate PRESSURE in a fluid
Pressure = Sum of neighboring pressures
Pn’ = P1 + P2 + P3 + P4
© NVIDIA Corporation 2007
Example Fluid Algorithm
CPU GPGPU
GPU Computingwith CUDA
Multiple passes through video
memory
Parallel execution through cache
Single thread out of cache
Program/Control
Data/Computation
Control
ALU
Cache DRAM
P1
P2
P3
P4
Pn’=P1+P2+P3+P4
ALU
VideoMemory
Control
ALU
Control
ALU
Control
ALU
P1,P2P3,P4
P1,P2P3,P4
P1,P2P3,P4
Pn’=P1+P2+P3+P4
Pn’=P1+P2+P3+P4
Pn’=P1+P2+P3+P4
ParallelData Cache
ThreadExecutionManager
ALU
Control
ALU
Control
ALU
Control
ALU
DRAM
P1
P2
P3
P4
P5
SharedData
Pn’=P1+P2+P3+P4
Pn’=P1+P2+P3+P4
Pn’=P1+P2+P3+P4
© NVIDIA Corporation 2007
Parallel Data Cache
Bring the data closer to the ALU
Addresses a fundamental problem of stream computing:
• The data are far from the FLOPS, video RAM latency is high
• Threads can only communicate their results through this high latency RAM
GPGPU
Multiple passes through video
memory
ALU
VideoMemory
Control
ALU
Control
ALU
Control
ALU
P1,P2P3,P4
P1,P2
P3,P4
P1,P2P3,P4
Pn’=P1+P2+P3+P4
Pn’=P1+P2+P3+P4
Pn’=P1+P2+P3+P4
© NVIDIA Corporation 2007
Parallel Data Cache
Parallel execution through cache
ParallelData Cache
ThreadExecutionManager
ALU
Control
ALU
Control
ALU
Control
ALU
DRAM
P1
P2
P3
P4
P5
SharedData
Pn’=P1+P2+P3+P4
Pn’=P1+P2+P3+P4
Pn’=P1+P2+P3+P4
Bring the data closer to the ALU
Stage computation for the parallel data cache
Minimize trips to external memory
Share values to minimize overfetch and computation
Increases arithmetic intensity by keeping data close to the processors
User managed generic memory, threads read/write arbitrarily
© NVIDIA Corporation 2007
Streaming vs. GPU Computing
GPGPU
CUDA
ALU
ALU
StreamingGather in, Restricted write
Memory is far from ALU
No inter-element communication
GPU Computing with CUDAMore general data parallel model
Full Scatter / Gather
PDC brings the data closer to the ALU
App decides how to decompose the problem across threads
Share and communicate between threads to solve problems efficiently
© NVIDIA Corporation 2007
CPU/GPU Parallelism
Moore’s Law gives you more and more transistorsWhat do you want to do with them? CPU strategy: make the workload (one compute thread) run as fast as possible
Tactics: – Cache (area limiting)– Instruction/Data prefetch– Speculative executionlimited by “perimeter” – communication bandwidth…then add task parallelism…multi-core
GPU strategy: make the workload (as many threads as possible) run as fast as possible
Tactics:
– Parallelism (1000s of threads)– Pipelining limited by “area” – compute capability
© NVIDIA Corporation 2007
Background: Unified Design
Shader D
Shader A
Shader B
Shader C
Shader Core
ibuffer ibuffer ibuffer ibuffer
obuffer obuffer obufferobuffer
Discrete Design Unified Design
© NVIDIA Corporation 2007
Hardware Implementation:Collection of SIMT Multiprocessors
Each multiprocessor is a set of SIMT thread processors
Single Instruction Multiple Thread
Each thread processor has:program counter, register file, etc.scalar data pathread/write memory access
Unit of SIMT execution: warpexecute same instruction/clockHardware handles thread scheduling and divergence transparently
Warps enable a friendly data-parallel programming model!
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
InstructionUnit
Processor 1 …Processor 2 Processor M
© NVIDIA Corporation 2007
Hardware Implementation:Memory Architecture
The device has local device memory
Can be read and written by the host and by the multiprocessors
Each multiprocessor has:A set of 32-bit registers per processor
on-chip shared memory
A read-only constant cache
A read-only texture cache
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
Device memory
Shared Memory
InstructionUnit
Processor 1
Registers
…Processor 2
Registers
Processor M
Registers
ConstantCache
TextureCache
© NVIDIA Corporation 2007
Hardware Implementation:Memory Model
Each thread can:Read/write per-block on-chip shared memory
Read per-grid cached constant memory
Read/write non-cached device memory:
Per-grid global memory
Per-thread local memory
Read cached texture memory
Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
© NVIDIA Corporation 2007
CUDA SDK
NVIDIA C Compiler
NVIDIA Assemblyfor Computing CPU Host Code
Integrated CPUand GPU C Source Code
Libraries:FFT, BLAS,…Example Source Code
CUDADriver
DebuggerProfiler
Standard C Compiler
GPU CPU
© NVIDIA Corporation 2007
CUDA: Features available to kernels
Standard mathematical functionssinf, powf, atanf, ceil, etc.
Built-in vector typesfloat4, int4, uint4, etc. for dimensions 1..4
Texture accesses in kernelstexture<float,2> my_texture; // declare texture reference
float4 texel = texfetch(my_texture, u, v);
© NVIDIA Corporation 2007
G8x CUDA = C with Extensions
Philosophy: provide minimal set of extensions necessary to expose power
Function qualifiers:__global__ void MyKernel() { }__device__ float MyDeviceFunc() { }
Variable qualifiers:__constant__ float MyConstantArray[32];__shared__ float MySharedArray[32];
Execution configuration:dim3 dimGrid(100, 50); // 5000 thread blocks dim3 dimBlock(4, 8, 8); // 256 threads per block MyKernel <<< dimGrid, dimBlock >>> (...); // Launch kernel
Built-in variables and functions valid in device code:dim3 gridDim; // Grid dimensiondim3 blockDim; // Block dimensiondim3 blockIdx; // Block indexdim3 threadIdx; // Thread indexvoid __syncthreads(); // Thread synchronization
© NVIDIA Corporation 2007
CUDA: Runtime support
Explicit memory allocation returns pointers to GPU memory
cudaMalloc(), cudaFree()
Explicit memory copy for host ↔ device, device ↔ device
cudaMemcpy(), cudaMemcpy2D(), ...
Texture management
cudaBindTexture(), cudaBindTextureToArray(), ...
OpenGL & DirectX interoperability
cudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(), …
© NVIDIA Corporation 2007
Example: Adding matrices w/ 2D grids
CPU C program CUDA C program
void addMatrix(float *a, float *b,
float *c, int N)
{
int i, j, index;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
index = i + j * N;
c[index]=a[index] + b[index];
}
}
}
void main()
{
.....
addMatrix(a, b, c, N);
}
__global__ void addMatrix(float *a,float *b,
float *c, int N)
{
int i=blockIdx.x*blockDim.x+threadIdx.x;
int j=blockIdx.y*blockDim.y+threadIdx.y;
int index = i + j * N;
if ( i < N && j < N)
c[index]= a[index] + b[index];
}
void main()
{
..... // allocate & transfer data to GPU
dim3 dimBlk (blocksize, blocksize);
dim3 dimGrd (N/dimBlk.x, N/dimBlk.y);
addMatrix<<<dimGrd,dimBlk>>>(a, b, c,N);
}
© NVIDIA Corporation 2007
Example: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}
© NVIDIA Corporation 2007
Example: Invoking the Kernel
__global__ void vecAdd(float* A, float* B, float* C);
void main()
{
// Execute on N/256 blocks of 256 threads each
vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
© NVIDIA Corporation 2007
Example: Host code for memory
// allocate host (CPU) memoryfloat* h_A = (float*) malloc(N * sizeof(float));float* h_B = (float*) malloc(N * sizeof(float));… initalize h_A and h_B …
// allocate device (GPU) memoryfloat* d_A, d_B, d_C;cudaMalloc( (void**) &d_A, N * sizeof(float));cudaMalloc( (void**) &d_B, N * sizeof(float));cudaMalloc( (void**) &d_C, N * sizeof(float));
// copy host memory to devicecudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice));cudaMemcpy( d_B, h_B, N * sizeof(float),cudaMemcpyHostToDevice));
// execute the kernel on N/256 blocks of 256 threads eachvecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
© NVIDIA Corporation 2007
A quick review
device = GPU = set of multiprocessors
Multiprocessor = set of processors & shared memory
Kernel = GPU program
Grid = array of thread blocks that execute a kernel
Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory
Memory Location Cached Access Who
Local Off-chip No Read/write One thread
Shared On-chip N/A - resident Read/write All threads in a block
Global Off-chip No Read/write All threads + host
Constant Off-chip Yes Read All threads + host
Texture Off-chip Yes Read All threads + host
Scan Literature
Pre-Hibernation
First proposed in APL by Iverson (1962)
Used as a data parallel primitive in the Connection Machine (1990)Feature of C* and CM-Lisp
Guy Blelloch used scan as a primitive for various parallel algorithms; his balanced-tree scan is used in the example here
Blelloch, 1990, “Prefix Sums and Their Applications”
Post-Democratization
O(n log n) work GPU implementation by Daniel Horn (GPU Gems 2)Applied to Summed Area Tables by Hensley et al. (EG05)
O(n) work GPU scan by Sengupta et al. (EDGE06) and Greß et al. (EG06)
O(n) work & space GPU implementation by Harris et al. (2007)NVIDIA CUDA SDK and GPU Gems 3
Applied to radix sort, stream compaction, and summed area tables
Parallel Reduction Complexity
Log(N) parallel steps, each step S does N/2S independent ops
Step Complexity is O(log N)
For N=2D, performs S[1..D]2D-S = N-1 operations
Work Complexity is O(N) – It is work-efficient
i.e. does not perform more operations than a sequential algorithm
With P threads physically in parallel (P processors), time complexity is O(N/P + log N)
Compare to O(N) for sequential reduction
Unrolling Last Steps
Only one warp is active during the last few steps
Unroll them and remove unneeded __syncthreads()
for (unsigned int s = bd/2; s > 32; s >>= 1) {
if (t < s) { data[t] += data[t + s]; } __syncthreads(); } if (t < 32) data[t] += data[t + 32]; if (t < 16) data[t] += data[t + 16]; if (t < 8) data[t] += data[t + 8]; if (t < 4) data[t] += data[t + 4]; if (t < 2) data[t] += data[t + 2]; if (t < 1) data[t] += data[t + 1];
Unrolling the Loop Completely
When block size is known at compile time, we can completely unroll the loop
It often is, since the maximum thread block size of 512 constrains us
Use templates…
#define STEP(d) \ if (t < (d)) data[t] += data[t+(d));#define SYNC __syncthreads();
template <unsigned int bsize>__global__ void d_reduce(int *g_idata, int *g_odata){ ... if (bsize == 512) STEP(512) SYNC if (bsize >= 256) STEP(256) SYNC if (bsize >= 128) STEP(128) SYNC if (bsize >= 64) STEP(64) SYNC if (bsize >= 32) { STEP(32) STEP(16)
STEP(8) STEP(4) STEP(2)
STEP(1) }}
© NVIDIA Corporation 2007
Extreme Growth in Raw Data
Source: John Bates, NOAA Nat. Climate Center
NOAA Weather Data
Source: Alexa, YouTube 2006
YouTube Bandwidth Growth
Mill
ions
Source: Hedburg, CPI, Walmart
Walmart Transaction Tracking
Mill
ions
Source: Jim Farnsworth, BP May 2005
BP Oil and Gas Active Data
Ter
abyt
es
Pet
abyt
es
NOAA NASA Weather Data in Petabytes
0
10
20
30
40
50
60
70
80
90
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
© NVIDIA Corporation 2007
Computational Horsepower
GPU is a massively parallel computation engineHigh memory bandwidth (5-10x CPU)
High floating-point performance (5-10x CPU)
© NVIDIA Corporation 2007
Benchmarking: CPU vs. GPU Computing
G80 vs. Core2 Duo 2.66 GHzMeasured against commercial CPU benchmarks when possible
“Free” Massively Parallel Processors
It’s not science fiction, it’s just funded by them
Asst Master Chief Harvard
© NVIDIA Corporation 2007
Success Stories: Data to Design
Acceleware EM Field simulation technology for the GPU
3D Finite-Difference and Finite-Element (FDTD)
Modeling of:
Cell phone irradiation
MRI Design / Modeling
Printed Circuit Boards
Radar Cross Section (Military)
Pacemaker with Transmit Antenna10X
1X
4 GPUs2 GPUs1 GPUCPU3.2 GHz
0
100
200
300
400
500
600
700
Performance (Mcells/s)
20X
5X
© NVIDIA Corporation 2007
EvolvedMachines
130X Speed up
Simulate brain circuitry
Sensory computing: vision, olfactory
EvolvedMachines
© NVIDIA Corporation 2007
10X with MATLAB CPU+GPU
Pseudo-spectral simulation of 2D Isotropic turbulence
Matlab: Language of Science
http://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_2Dturb.m
http://developer.nvidia.com/object/matlab_cuda.html
© NVIDIA Corporation 2007
MATLAB Example:Advection of an elliptic vortex
Matlab 168 seconds
Matlab with CUDA(single precision FFTs)20 seconds
256x256 mesh, 512 RK4 steps, Linux, MATLAB filehttp://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_vortex.m
© NVIDIA Corporation 2007
MATLAB Example:Pseudo-spectral simulation of 2D Isotropic turbulence
MATLAB 992 seconds
MATLAB with CUDA(single precision FFTs)93 seconds
512x512 mesh, 400 RK4 steps, Windows XP, MATLAB filehttp://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_2Dturb.m
© NVIDIA Corporation 2007
NAMD/VMD Molecular Dynamics
http://www.ks.uiuc.edu/Research/vmd/projects/ece498/lecture/
240X speedup
Computational biology
© NVIDIA Corporation 2007
Molecular Dynamics Example
Case study: molecular dynamics research at U. Illinois Urbana-Champaign
(Scientist-sponsored) course project for CS 498AL: Programming Massively Parallel Multiprocessors (Kirk/Hwu)
Next slides stolen from a nice description of problem, algorithms, and iterative optimization process available at:
http://www.ks.uiuc.edu/Research/vmd/projects/ece498/lecture/
© NVIDIA Corporation 2007
Molecular Modeling: Ion Placement
Biomolecular simulations attempt to replicate in vivo conditions in silico.Model structures are initially constructed in vacuumSolvent (water) and ions are added as necessary for the required biological conditionsComputational requirements scale with the size of the simulated structure
© NVIDIA Corporation 2007
Evolution of Ion Placement Code
First implementation was sequentialVirus structure with 10^6 atoms would require 10 CPU daysTuned for Intel C/C++ vectorization+SSE, ~20x speedupParallelized /w pthreads: high data parallelism = linear speedupParallelized GPU accelerated implementation: 3 GeForce 8800GTX cards outrun ~300 Itanium2 CPUs!Virus structure now runs in 25 seconds on 3 GPUs!Further speedups should still be possible…
© NVIDIA Corporation 2007
Multi-GPU CUDA Coulombic Potential Map Performance
Host: Intel Core 2 Quad, 8GB RAM, ~$3,000
3 GPUs: NVIDIA GeForce 8800GTX, ~$550 each
32-bit RHEL4 Linux (want 64-bit CUDA!!)
235 GFLOPS per GPU for current version of coulombic potential map kernel
705 GFLOPS total for multithreaded multi-GPU version Three GeForce 8800GTX GPUs
in a single machine, cost ~$4,650
© NVIDIA Corporation 2007
NVIDIA Professor Partnership
Support faculty research & teaching effortsSmall equipment gifts (1-2 GPUs) Significant discounts on GPU purchases
Especially Quadro, Tesla equipmentUseful for cost matching
Research contracts Small cash grants (typically ~$25K gifts)Medium-scale equipment donations (10-30 GPUs)
Informal proposals, reviewed quarterlyFocus areas: GPU computing, especially with an educational mission or component
http://www.nvidia.com/page/professor_partnership.html
Easy
Competitive