Top Banner
David Luebke NVIDIA Research GPU Architecture & Implications
94

David Luebke NVIDIA Research GPU Architecture & Implications.

Dec 25, 2015

Download

Documents

Wilfrid Cole
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: David Luebke NVIDIA Research GPU Architecture & Implications.

David Luebke

NVIDIA Research

GPU Architecture & Implications

Page 2: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

GPU Architecture

CUDA provides a parallel programming model

The Tesla GPU architecture implements this

This talk will describe the characteristics, goals, and implications of that architecture

Page 3: David Luebke NVIDIA Research GPU Architecture & Implications.

G80 GPU Implementation: Tesla C870

681 million transistors470 mm2 in 90 nm CMOS

128 thread processors518 GFLOPS peak1.35 GHz processor clock

1.5 GB DRAM76 GB/s peak800 MHz GDDR3 clock384 pin DRAM interface

ATX form factor cardPCI Express x16170 W max with DRAM

Page 4: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

G80 (launched Nov 2006)

128 Thread Processors execute kernel threads

Up to 12,288 parallel threads active

Per-block shared memory (PBSM) accelerates processing

Block Diagram Redux

Thread Execution Manager

Input Assembler

Host

PBSM

Global Memory

Load/store

PBSM

Thread Processors

PBSM

Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors

PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSMPBSM

Page 5: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Streaming Multiprocessor (SM)

Processing elements8 scalar thread processors (SP)32 GFLOPS peak at 1.35 GHz8192 32-bit registers (32KB)

½ MB total register file space!

usual ops: float, int, branch, …

Hardware multithreadingup to 8 blocks resident at onceup to 768 active threads in total

16KB on-chip memorylow latency storageshared amongst threads of a blocksupports thread communication

SP

SharedMemory

MT IU

SM t0 t1 … tB

Page 6: David Luebke NVIDIA Research GPU Architecture & Implications.

Goal: Scalability

Scalable executionProgram must be insensitive to the number of cores

Write one program for any number of SM cores

Program runs on any size GPU without recompiling

Hierarchical execution modelDecompose problem into sequential steps (kernels)

Decompose kernel into computing parallel blocks

Decompose block into computing parallel threads

Hardware distributes independent blocks to SMs as available

Page 7: David Luebke NVIDIA Research GPU Architecture & Implications.

Blocks Run on Multiprocessors

Kernel launched by host

. . .

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

. . .

Device processor array

Device Memory

Page 8: David Luebke NVIDIA Research GPU Architecture & Implications.

Goal: easy to program

Strategies:Familiar programming language mechanics

C/C++ with small extension

Simple parallel abstractionsSimple barrier synchronization

Shared memory semantics

Hardware-managed hierarchy of threads

Page 9: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Hardware Multithreading

Hardware allocates resources to blocksblocks need: thread slots, registers, shared memory

blocks don’t run until resources are available

Hardware schedules threadsthreads have their own registers

any thread not waiting for something can run

context switching is (basically) free – every cycle

Hardware relies on threads to hide latencyi.e., parallelism is necessary for performance

SP

SharedMemory

MT IU

SM

Page 10: David Luebke NVIDIA Research GPU Architecture & Implications.

Goal: Performance per millimeter

For GPUs, perfomance == throughput

Strategy: hide latency with computation not cacheHeavy multithreading – already discussed by Kevin

Implication: need many threads to hide latencyOccupancy – typically need 128 threads/SM minimumMultiple thread blocks/SM good to minimize effect of barriers

Strategy: Single Instruction Multiple Thread (SIMT)Balances performance with ease of programming

Page 11: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

SIMT Thread Execution

Groups of 32 threads formed into warpsalways executing same instructionshared instruction fetch/dispatchsome become inactive when code path divergeshardware automatically handles divergence

Warps are the primitive unit of schedulingpick 1 of 24 warps for each instruction slot

SIMT execution is an implementation choicesharing control logic leaves more space for ALUslargely invisible to programmermust understand for performance, not correctness

SP

SharedMemory

MT IU

SM

Page 12: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007gh07 Hot3D: Tesla GPU Computing12

SIMT Multithreaded Execution

Weaving: the original parallel thread technology is about 10,000 years oldWarp: a set of 32 parallel threadsthat execute a SIMD instruction

SM hardware implements zero-overhead warp and thread schedulingEach SM executes up to 768 concurrent threads, as 24 SIMD warps of 32 threads

Threads can execute independentlySIMD warp automatically diverges and converges when threads branchBest efficiency and performance when threads of a warp execute togetherSIMT across threads (not just SIMD across data) gives easy single-thread scalar programming with SIMD efficiency

warp 8 instruction 11

SM multithreadedinstruction scheduler

warp 1 instruction 42

warp 3 instruction 95

warp 8 instruction 12

...

time

warp 3 instruction 96

Page 13: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

SPSharedMemory

MT

IU

Device Memory

Texture Cache Constant Cache

I Cac

heMemory Architecture

Direct load/store access to device memorytreated as the usual linear sequence of bytes (i.e., not pixels)

Texture & constant caches are read-only access paths

On-chip shared memory shared amongst threads of a blockimportant for communication amongst threads

provides low-latency temporary storage (~100x less than DRAM)

HostMemory

PCIe

Page 14: David Luebke NVIDIA Research GPU Architecture & Implications.

Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:Very wide (1000s) SIMD machines…

…on which branching is impossible or prohibitive…

…with 4-wide vector registers.

GPUs are power-inefficient

GPUs don’t do real floating point

Page 15: David Luebke NVIDIA Research GPU Architecture & Implications.

Myths of GPU Computing

GPUs layer normal programs on top of graphicsNO: CUDA compiles directly to the hardware

GPUs architectures are:Very wide (1000s) SIMD machines…

…on which branching is impossible or prohibitive…

…with 4-wide vector registers.

GPUs are power-inefficient

GPUs don’t do real floating point

Page 16: David Luebke NVIDIA Research GPU Architecture & Implications.

Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:Very wide (1000s) SIMD machines…

…on which branching is impossible or prohibitive…

…with 4-wide vector registers.

GPUs are power-inefficient

GPUs don’t do real floating point

Page 17: David Luebke NVIDIA Research GPU Architecture & Implications.

Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:Very wide (1000s) SIMD machines… NO: warps are 32-wide

…on which branching is impossible or prohibitive…

…with 4-wide vector registers.

GPUs are power-inefficient

GPUs don’t do real floating point

Page 18: David Luebke NVIDIA Research GPU Architecture & Implications.

Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:Very wide (1000s) SIMD machines…

…on which branching is impossible or prohibitive… NOPE

…with 4-wide vector registers.

GPUs are power-inefficient

GPUs don’t do real floating point

Page 19: David Luebke NVIDIA Research GPU Architecture & Implications.

Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:Very wide (1000s) SIMD machines…

…on which branching is impossible or prohibitive…

…with 4-wide vector registers.

GPUs are power-inefficient

GPUs don’t do real floating point

Page 20: David Luebke NVIDIA Research GPU Architecture & Implications.

Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:Very wide (1000s) SIMD machines…

…on which branching is impossible or prohibitive…

…with 4-wide vector registers. NO: scalar thread processors

GPUs are power-inefficient

GPUs don’t do real floating point

Page 21: David Luebke NVIDIA Research GPU Architecture & Implications.

Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:Very wide (1000s) SIMD machines…

…on which branching is impossible or prohibitive…

…with 4-wide vector registers.

GPUs are power-inefficient

GPUs don’t do real floating point

Page 22: David Luebke NVIDIA Research GPU Architecture & Implications.

Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:Very wide (1000s) SIMD machines…

…on which branching is impossible or prohibitive…

…with 4-wide vector registers.

GPUs are power-inefficient: No – 4-10x perf/W advantage, up to 89x reported for some studies

GPUs don’t do real floating point

Page 23: David Luebke NVIDIA Research GPU Architecture & Implications.

Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:Very wide (1000s) SIMD machines…

…on which branching is impossible or prohibitive…

…with 4-wide vector registers.

GPUs are power-inefficient:

GPUs don’t do real floating point

Page 24: David Luebke NVIDIA Research GPU Architecture & Implications.

GPU Floating Point FeaturesG80 SSE IBM Altivec Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADD and FMUL

Round to nearest and round to zero

All 4 IEEE, round to nearest, zero, inf, -inf

Round to nearest only

Round to zero/truncate only

Denormal handling Flush to zeroSupported,1000’s of cycles

Supported,1000’s of cycles

Flush to zero

NaN support Yes Yes Yes No

Overflow and Infinity support

Yes, only clamps to max norm

Yes Yes No, infinity

Flags No Yes Yes Some

Square root Software only Hardware Software only Software only

Division Software only Hardware Software only Software only

Reciprocal estimate accuracy

24 bit 12 bit 12 bit 12 bit

Reciprocal sqrt estimate accuracy

23 bit 12 bit 12 bit 12 bit

log2(x) and 2^x estimates accuracy

23 bit No 12 bit No

Page 25: David Luebke NVIDIA Research GPU Architecture & Implications.

Do GPUs Do Real IEEE FP?

G8x GPU FP is IEEE 754Comparable to other processors / accelerators

More precise / usable in some ways

Less precise in other ways

GPU FP getting better every generationDouble precision support shortly

Goal: best of class by 2009

Page 26: David Luebke NVIDIA Research GPU Architecture & Implications.

Questions?

David [email protected]

Page 27: David Luebke NVIDIA Research GPU Architecture & Implications.

Applications &

Sweet Spots

Page 28: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

GPU Computing Sweet Spots

Applications:

High arithmetic intensity:

Dense linear algebra, PDEs, n-body, finite difference, …

High bandwidth: Sequencing (virus scanning, genomics), sorting, database…

Visual computing:Graphics, image processing, tomography, machine vision…

Page 29: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

GPU Computing Example Markets

Computational

Modeling

Computational

Chemistry

Computational

Medicine

Computational

Science

Computational

Biology

Computational

Finance

Computational

Geoscience

Image

Processing

Page 30: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Applications - Condensed

3D image analysisAdaptive radiation therapyAcousticsAstronomyAudioAutomobile visionBioinfomaticsBiological simulationBroadcastCellular automataComputational Fluid DynamicsComputer VisionCryptographyCT reconstructionData MiningDigital cinema/projectionsElectromagnetic simulationEquity training

FilmFinancial - lots of areasLanguagesGISHolographics cinemaImaging (lots)Mathematics researchMilitary (lots)Mine planningMolecular dynamicsMRI reconstructionMultispectral imagingnbodyNetwork processingNeural networkOceanographic researchOptical inspectionParticle physics

Protein foldingQuantum chemistryRay tracingRadarReservoir simulationRobotic vision/AIRobotic surgerySatellite data analysisSeismic imagingSurgery simulationSurveillanceUltrasoundVideo conferencingTelescopeVideoVisualizationWirelessX-ray

Page 31: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

GPU Computing Sweet Spots

From cluster to workstationThe “personal supercomputing” phase change

From lab to clinic

From machine room to engineer, grad student desks

From batch processing to interactive

From interactive to real-time

GPU-enabled clusters A 100x or better speedup changes the science

Solve at different scales

Direct brute-force methods may outperform cleverness

New bottlenecks may emerge

Approaches once inconceivable may become practical

Page 32: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

New Applications

Real-time options implied volatility engine

Swaption volatility cube calculator

Manifold 8 GIS

Ultrasound imaging

HOOMD Molecular Dynamics

Also…Image rotation/classificationGraphics processing toolboxMicroarray data analysisData parallel primitivesAstrophysics simulations

SDK: Mandelbrot, computer vision

Seismic migration

Page 33: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

The Future of GPUs

GPU Computing drives new applicationsReducing “Time to Discovery”

100x Speedup changes science and research methods

New applications drive the future of GPUs and GPU Computing

Drives new GPU capabilities

Drives hunger for more performance

Some exciting new domains: Vision, acoustic, and embedded applications

Large-scale simulation & physics

Page 34: David Luebke NVIDIA Research GPU Architecture & Implications.

Accuracy &

Performance

Page 35: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

GPU Floating Point FeaturesG80 SSE IBM Altivec Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADD and FMUL

Round to nearest and round to zero

All 4 IEEE, round to nearest, zero, inf, -inf

Round to nearest only

Round to zero/truncate only

Denormal handling Flush to zeroSupported,1000’s of cycles

Supported,1000’s of cycles

Flush to zero

NaN support Yes Yes Yes No

Overflow and Infinity support

Yes, only clamps to max norm

Yes Yes No, infinity

Flags No Yes Yes Some

Square root Software only Hardware Software only Software only

Division Software only Hardware Software only Software only

Reciprocal estimate accuracy

24 bit 12 bit 12 bit 12 bit

Reciprocal sqrt estimate accuracy

23 bit 12 bit 12 bit 12 bit

log2(x) and 2^x estimates accuracy

23 bit No 12 bit No

Page 36: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Do GPUs Do Real IEEE FP?

G8x GPU FP is IEEE 754Comparable to other processors / accelerators

More precise / usable in some ways

Less precise in other ways

GPU FP getting better every generationDouble precision support shortly

Goal: best of class by 2009

Page 37: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

CUDA Performance Advantages

Performance:BLAS1: 60+ GB/sec

BLAS3: 127 GFLOPS

FFT: 52 benchFFT* GFLOPS

FDTD: 1.2 Gcells/sec

SSEARCH: 5.2 Gcells/sec

Black Scholes: 4.7 GOptions/sec

VMD: 290 GFLOPS

How:Leveraging shared memory

GPU memory bandwidth

GPU GFLOPS performance

Custom hardware intrinsics

__sinf(), __cosf(), __expf(), __logf(), …

All benchmarks are compiled code!

Page 38: David Luebke NVIDIA Research GPU Architecture & Implications.

GPGPU vs.

GPU Computing

Page 39: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Problem: GPGPU

OLD: GPGPU – trick the GPU into general-purpose computing by casting problem as graphics

Turn data into images (“texture maps”)

Turn algorithms into image synthesis (“rendering passes”)

Promising results, but:Tough learning curve, particularly for non-graphics experts

Potentially high overhead of graphics API

Highly constrained memory layout & access model

Need for many passes drives up bandwidth consumption

Page 40: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Solution: CUDA

NEW: GPU Computing with CUDACUDA = Compute Unified Driver Architecture

Co-designed hardware & software for direct GPU computing

Hardware: fully general data-parallel architecture

Software: program the GPU in C

General thread launch

Global load-store

Parallel data cache

Scalar architecture

Integers, bit operations

Double precision (soon)

Scalable data-parallel execution/memory model

C with minimal yet powerful extensions

Page 41: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Graphics Programming Model

Graphics Application

Vertex Program

Rasterization

Fragment Program

Display

Page 42: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Streaming GPGPU Programming

OpenGL Program to Add A and B

Vertex Program

Rasterization

Fragment Program

CPU Reads Texture Memory

for Results

Start by creating a quad

“Programs” created with raster operation

Write answer to texture memory as a “color”

Read textures as input to OpenGL shader program

All this just to do A + B

Page 43: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

What’s Wrong With GPGPU?

Application

Vertex Program

Rasterization

Pixel Program

Display

Input Registers

Pixel Program

Output Registers

Constants

Texture

Temp Registers

Page 44: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

What’s Wrong With GPGPU?

Application

Vertex Program

Rasterization

Fragment Program

Display

Input Registers

Fragment Program

Output Registers

Constants

Texture

Temp Registers

APIs are specific to graphics

Limited texture size and dimension

Limited shader outputs

No scatter

Limited instruction set

No thread communication

Limited local storage

Page 45: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Building a Better Pixel

Input Registers

Fragment Program

Output Registers

Constants

Texture

Registers

Page 46: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Building a Better Pixel Thread

Thread Program

Output Registers

Constants

Texture

Registers

Thread Number

Features

Millions of instructions

Full Integer and Bit instructions

No limits on branching, looping

1D, 2D, or 3D thread ID allocation

Page 47: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Global Memory

Thread Program

Global Memory

Thread Number

Constants

Texture

Registers

Features

Fully general load/store to

GPU memory

Untyped, not fixed texture types

Pointer support

Page 48: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Parallel Data Cache

Thread Program

Global Memory

Thread Number

Constants

Texture

Registers

Features

Dedicated on-chip memory

Shared between threads for inter-thread communication

Explicitly managed

As fast as registers

Parallel Data Cache

Page 49: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Example Algorithm - Fluids

So the pressure for each particle is…

Pressure1 = P1 + P2 + P3 + P4

Pressure2 = P3 + P4 + P5 + P6

Pressure3 = P5 + P6 + P7 + P8

Pressure4 = P7 + P8 + P9 + P10

Pressure depends on neighbors

Goal: Calculate PRESSURE in a fluid

Pressure = Sum of neighboring pressures

Pn’ = P1 + P2 + P3 + P4

Page 50: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Example Fluid Algorithm

CPU GPGPU

GPU Computingwith CUDA

Multiple passes through video

memory

Parallel execution through cache

Single thread out of cache

Program/Control

Data/Computation

Control

ALU

Cache DRAM

P1

P2

P3

P4

Pn’=P1+P2+P3+P4

ALU

VideoMemory

Control

ALU

Control

ALU

Control

ALU

P1,P2P3,P4

P1,P2P3,P4

P1,P2P3,P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

ParallelData Cache

ThreadExecutionManager

ALU

Control

ALU

Control

ALU

Control

ALU

DRAM

P1

P2

P3

P4

P5

SharedData

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Page 51: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Parallel Data Cache

Bring the data closer to the ALU

Addresses a fundamental problem of stream computing:

• The data are far from the FLOPS, video RAM latency is high

• Threads can only communicate their results through this high latency RAM

GPGPU

Multiple passes through video

memory

ALU

VideoMemory

Control

ALU

Control

ALU

Control

ALU

P1,P2P3,P4

P1,P2

P3,P4

P1,P2P3,P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Page 52: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Parallel Data Cache

Parallel execution through cache

ParallelData Cache

ThreadExecutionManager

ALU

Control

ALU

Control

ALU

Control

ALU

DRAM

P1

P2

P3

P4

P5

SharedData

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Bring the data closer to the ALU

Stage computation for the parallel data cache

Minimize trips to external memory

Share values to minimize overfetch and computation

Increases arithmetic intensity by keeping data close to the processors

User managed generic memory, threads read/write arbitrarily

Page 53: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Streaming vs. GPU Computing

GPGPU

CUDA

ALU

ALU

StreamingGather in, Restricted write

Memory is far from ALU

No inter-element communication

GPU Computing with CUDAMore general data parallel model

Full Scatter / Gather

PDC brings the data closer to the ALU

App decides how to decompose the problem across threads

Share and communicate between threads to solve problems efficiently

Page 54: David Luebke NVIDIA Research GPU Architecture & Implications.

GPU Design

Page 55: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

CPU/GPU Parallelism

Moore’s Law gives you more and more transistorsWhat do you want to do with them? CPU strategy: make the workload (one compute thread) run as fast as possible

Tactics: – Cache (area limiting)– Instruction/Data prefetch– Speculative executionlimited by “perimeter” – communication bandwidth…then add task parallelism…multi-core

GPU strategy: make the workload (as many threads as possible) run as fast as possible

Tactics:

– Parallelism (1000s of threads)– Pipelining limited by “area” – compute capability

Page 56: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Background: Unified Design

Shader D

Shader A

Shader B

Shader C

Shader Core

ibuffer ibuffer ibuffer ibuffer

obuffer obuffer obufferobuffer

Discrete Design Unified Design

Page 57: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Hardware Implementation:Collection of SIMT Multiprocessors

Each multiprocessor is a set of SIMT thread processors

Single Instruction Multiple Thread

Each thread processor has:program counter, register file, etc.scalar data pathread/write memory access

Unit of SIMT execution: warpexecute same instruction/clockHardware handles thread scheduling and divergence transparently

Warps enable a friendly data-parallel programming model!

Device

Multiprocessor N

Multiprocessor 2

Multiprocessor 1

InstructionUnit

Processor 1 …Processor 2 Processor M

Page 58: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Hardware Implementation:Memory Architecture

The device has local device memory

Can be read and written by the host and by the multiprocessors

Each multiprocessor has:A set of 32-bit registers per processor

on-chip shared memory

A read-only constant cache

A read-only texture cache

Device

Multiprocessor N

Multiprocessor 2

Multiprocessor 1

Device memory

Shared Memory

InstructionUnit

Processor 1

Registers

…Processor 2

Registers

Processor M

Registers

ConstantCache

TextureCache

Page 59: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Hardware Implementation:Memory Model

Each thread can:Read/write per-block on-chip shared memory

Read per-grid cached constant memory

Read/write non-cached device memory:

Per-grid global memory

Per-thread local memory

Read cached texture memory

Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Page 60: David Luebke NVIDIA Research GPU Architecture & Implications.

CUDAProgramming

Page 61: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

CUDA SDK

NVIDIA C Compiler

NVIDIA Assemblyfor Computing CPU Host Code

Integrated CPUand GPU C Source Code

Libraries:FFT, BLAS,…Example Source Code

CUDADriver

DebuggerProfiler

Standard C Compiler

GPU CPU

Page 62: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

CUDA: Features available to kernels

Standard mathematical functionssinf, powf, atanf, ceil, etc.

Built-in vector typesfloat4, int4, uint4, etc. for dimensions 1..4

Texture accesses in kernelstexture<float,2> my_texture; // declare texture reference

float4 texel = texfetch(my_texture, u, v);

Page 63: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

G8x CUDA = C with Extensions

Philosophy: provide minimal set of extensions necessary to expose power

Function qualifiers:__global__ void MyKernel() { }__device__ float MyDeviceFunc() { }

Variable qualifiers:__constant__ float MyConstantArray[32];__shared__ float MySharedArray[32];

Execution configuration:dim3 dimGrid(100, 50); // 5000 thread blocks dim3 dimBlock(4, 8, 8); // 256 threads per block MyKernel <<< dimGrid, dimBlock >>> (...); // Launch kernel

Built-in variables and functions valid in device code:dim3 gridDim; // Grid dimensiondim3 blockDim; // Block dimensiondim3 blockIdx; // Block indexdim3 threadIdx; // Thread indexvoid __syncthreads(); // Thread synchronization

Page 64: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

CUDA: Runtime support

Explicit memory allocation returns pointers to GPU memory

cudaMalloc(), cudaFree()

Explicit memory copy for host ↔ device, device ↔ device

cudaMemcpy(), cudaMemcpy2D(), ...

Texture management

cudaBindTexture(), cudaBindTextureToArray(), ...

OpenGL & DirectX interoperability

cudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(), …

Page 65: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Example: Adding matrices w/ 2D grids

CPU C program CUDA C program

void addMatrix(float *a, float *b,

float *c, int N)

{

int i, j, index;

for (i = 0; i < N; i++) {

for (j = 0; j < N; j++) {

index = i + j * N;

c[index]=a[index] + b[index];

}

}

}

void main()

{

.....

addMatrix(a, b, c, N);

}

__global__ void addMatrix(float *a,float *b,

float *c, int N)

{

int i=blockIdx.x*blockDim.x+threadIdx.x;

int j=blockIdx.y*blockDim.y+threadIdx.y;

int index = i + j * N;

if ( i < N && j < N)

c[index]= a[index] + b[index];

}

void main()

{

..... // allocate & transfer data to GPU

dim3 dimBlk (blocksize, blocksize);

dim3 dimGrd (N/dimBlk.x, N/dimBlk.y);

addMatrix<<<dimGrd,dimBlk>>>(a, b, c,N);

}

Page 66: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Example: Vector Addition Kernel

// Compute vector sum C = A+B

// Each thread performs one pair-wise addition

__global__ void vecAdd(float* A, float* B, float* C)

{

int i = threadIdx.x + blockDim.x * blockIdx.x;

C[i] = A[i] + B[i];

}

Page 67: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Example: Invoking the Kernel

__global__ void vecAdd(float* A, float* B, float* C);

void main()

{

// Execute on N/256 blocks of 256 threads each

vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);

}

Page 68: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Example: Host code for memory

// allocate host (CPU) memoryfloat* h_A = (float*) malloc(N * sizeof(float));float* h_B = (float*) malloc(N * sizeof(float));… initalize h_A and h_B …

// allocate device (GPU) memoryfloat* d_A, d_B, d_C;cudaMalloc( (void**) &d_A, N * sizeof(float));cudaMalloc( (void**) &d_B, N * sizeof(float));cudaMalloc( (void**) &d_C, N * sizeof(float));

// copy host memory to devicecudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice));cudaMemcpy( d_B, h_B, N * sizeof(float),cudaMemcpyHostToDevice));

// execute the kernel on N/256 blocks of 256 threads eachvecAdd<<<N/256, 256>>>(d_A, d_B, d_C);

Page 69: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

A quick review

device = GPU = set of multiprocessors

Multiprocessor = set of processors & shared memory

Kernel = GPU program

Grid = array of thread blocks that execute a kernel

Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory

Memory Location Cached Access Who

Local Off-chip No Read/write One thread

Shared On-chip N/A - resident Read/write All threads in a block

Global Off-chip No Read/write All threads + host

Constant Off-chip Yes Read All threads + host

Texture Off-chip Yes Read All threads + host

Page 70: David Luebke NVIDIA Research GPU Architecture & Implications.

Data-ParallelProgramming

Page 71: David Luebke NVIDIA Research GPU Architecture & Implications.

Scan Literature

Pre-Hibernation

First proposed in APL by Iverson (1962)

Used as a data parallel primitive in the Connection Machine (1990)Feature of C* and CM-Lisp

Guy Blelloch used scan as a primitive for various parallel algorithms; his balanced-tree scan is used in the example here

Blelloch, 1990, “Prefix Sums and Their Applications”

Post-Democratization

O(n log n) work GPU implementation by Daniel Horn (GPU Gems 2)Applied to Summed Area Tables by Hensley et al. (EG05)

O(n) work GPU scan by Sengupta et al. (EDGE06) and Greß et al. (EG06)

O(n) work & space GPU implementation by Harris et al. (2007)NVIDIA CUDA SDK and GPU Gems 3

Applied to radix sort, stream compaction, and summed area tables

Page 72: David Luebke NVIDIA Research GPU Architecture & Implications.

Parallel Reduction Complexity

Log(N) parallel steps, each step S does N/2S independent ops

Step Complexity is O(log N)

For N=2D, performs S[1..D]2D-S = N-1 operations

Work Complexity is O(N) – It is work-efficient

i.e. does not perform more operations than a sequential algorithm

With P threads physically in parallel (P processors), time complexity is O(N/P + log N)

Compare to O(N) for sequential reduction

Page 73: David Luebke NVIDIA Research GPU Architecture & Implications.

Unrolling Last Steps

Only one warp is active during the last few steps

Unroll them and remove unneeded __syncthreads()

for (unsigned int s = bd/2; s > 32; s >>= 1) {

if (t < s) { data[t] += data[t + s]; } __syncthreads(); } if (t < 32) data[t] += data[t + 32]; if (t < 16) data[t] += data[t + 16]; if (t < 8) data[t] += data[t + 8]; if (t < 4) data[t] += data[t + 4]; if (t < 2) data[t] += data[t + 2]; if (t < 1) data[t] += data[t + 1];

Page 74: David Luebke NVIDIA Research GPU Architecture & Implications.

Unrolling the Loop Completely

When block size is known at compile time, we can completely unroll the loop

It often is, since the maximum thread block size of 512 constrains us

Use templates…

#define STEP(d) \ if (t < (d)) data[t] += data[t+(d));#define SYNC __syncthreads();

template <unsigned int bsize>__global__ void d_reduce(int *g_idata, int *g_odata){ ... if (bsize == 512) STEP(512) SYNC if (bsize >= 256) STEP(256) SYNC if (bsize >= 128) STEP(128) SYNC if (bsize >= 64) STEP(64) SYNC if (bsize >= 32) { STEP(32) STEP(16)

STEP(8) STEP(4) STEP(2)

STEP(1) }}

Page 75: David Luebke NVIDIA Research GPU Architecture & Implications.

GPU ComputingMotivation

Page 76: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Computing Challenge

graphic

Task Computing Data Computing

Page 77: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Extreme Growth in Raw Data

Source: John Bates, NOAA Nat. Climate Center

NOAA Weather Data

Source: Alexa, YouTube 2006

YouTube Bandwidth Growth

Mill

ions

Source: Hedburg, CPI, Walmart

Walmart Transaction Tracking

Mill

ions

Source: Jim Farnsworth, BP May 2005

BP Oil and Gas Active Data

Ter

abyt

es

Pet

abyt

es

NOAA NASA Weather Data in Petabytes

0

10

20

30

40

50

60

70

80

90

2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017

Page 78: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Computational Horsepower

GPU is a massively parallel computation engineHigh memory bandwidth (5-10x CPU)

High floating-point performance (5-10x CPU)

Page 79: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Benchmarking: CPU vs. GPU Computing

G80 vs. Core2 Duo 2.66 GHzMeasured against commercial CPU benchmarks when possible

Page 80: David Luebke NVIDIA Research GPU Architecture & Implications.

“Free” Massively Parallel Processors

It’s not science fiction, it’s just funded by them

Asst Master Chief Harvard

Page 81: David Luebke NVIDIA Research GPU Architecture & Implications.

SuccessStories

Page 82: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Success Stories: Data to Design

Acceleware EM Field simulation technology for the GPU

3D Finite-Difference and Finite-Element (FDTD)

Modeling of:

Cell phone irradiation

MRI Design / Modeling

Printed Circuit Boards

Radar Cross Section (Military)

Pacemaker with Transmit Antenna10X

1X

4 GPUs2 GPUs1 GPUCPU3.2 GHz

0

100

200

300

400

500

600

700

Performance (Mcells/s)

20X

5X

Page 83: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

EvolvedMachines

130X Speed up

Simulate brain circuitry

Sensory computing: vision, olfactory

EvolvedMachines

Page 84: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

10X with MATLAB CPU+GPU

Pseudo-spectral simulation of 2D Isotropic turbulence

Matlab: Language of Science

http://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_2Dturb.m

http://developer.nvidia.com/object/matlab_cuda.html

Page 85: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

MATLAB Example:Advection of an elliptic vortex

Matlab 168 seconds

Matlab with CUDA(single precision FFTs)20 seconds

256x256 mesh, 512 RK4 steps, Linux, MATLAB filehttp://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_vortex.m

Page 86: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

MATLAB Example:Pseudo-spectral simulation of 2D Isotropic turbulence

MATLAB 992 seconds

MATLAB with CUDA(single precision FFTs)93 seconds

512x512 mesh, 400 RK4 steps, Windows XP, MATLAB filehttp://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_2Dturb.m

Page 87: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

NAMD/VMD Molecular Dynamics

http://www.ks.uiuc.edu/Research/vmd/projects/ece498/lecture/

240X speedup

Computational biology

Page 88: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Molecular Dynamics Example

Case study: molecular dynamics research at U. Illinois Urbana-Champaign

(Scientist-sponsored) course project for CS 498AL: Programming Massively Parallel Multiprocessors (Kirk/Hwu)

Next slides stolen from a nice description of problem, algorithms, and iterative optimization process available at:

http://www.ks.uiuc.edu/Research/vmd/projects/ece498/lecture/

Page 89: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Page 90: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Molecular Modeling: Ion Placement

Biomolecular simulations attempt to replicate in vivo conditions in silico.Model structures are initially constructed in vacuumSolvent (water) and ions are added as necessary for the required biological conditionsComputational requirements scale with the size of the simulated structure

Page 91: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Evolution of Ion Placement Code

First implementation was sequentialVirus structure with 10^6 atoms would require 10 CPU daysTuned for Intel C/C++ vectorization+SSE, ~20x speedupParallelized /w pthreads: high data parallelism = linear speedupParallelized GPU accelerated implementation: 3 GeForce 8800GTX cards outrun ~300 Itanium2 CPUs!Virus structure now runs in 25 seconds on 3 GPUs!Further speedups should still be possible…

Page 92: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

Multi-GPU CUDA Coulombic Potential Map Performance

Host: Intel Core 2 Quad, 8GB RAM, ~$3,000

3 GPUs: NVIDIA GeForce 8800GTX, ~$550 each

32-bit RHEL4 Linux (want 64-bit CUDA!!)

235 GFLOPS per GPU for current version of coulombic potential map kernel

705 GFLOPS total for multithreaded multi-GPU version Three GeForce 8800GTX GPUs

in a single machine, cost ~$4,650

Page 93: David Luebke NVIDIA Research GPU Architecture & Implications.

ProfessorPartnership

Page 94: David Luebke NVIDIA Research GPU Architecture & Implications.

© NVIDIA Corporation 2007

NVIDIA Professor Partnership

Support faculty research & teaching effortsSmall equipment gifts (1-2 GPUs) Significant discounts on GPU purchases

Especially Quadro, Tesla equipmentUseful for cost matching

Research contracts Small cash grants (typically ~$25K gifts)Medium-scale equipment donations (10-30 GPUs)

Informal proposals, reviewed quarterlyFocus areas: GPU computing, especially with an educational mission or component

http://www.nvidia.com/page/professor_partnership.html

Easy

Competitive