COMP 635: Seminar on Heterogeneous Processors …vsarkar/PDF/comp635-lec4-v4.pdfCOMP 635: Seminar on Heterogeneous Processors ... • First Workshop on General Purpose Processing on

Vivek Sarkar

Department of Computer ScienceRice University

[email protected] 24, 2007

COMP 635: Seminar on Heterogeneous Processors

Lecture 4: Introduction to General-Purposecomputation on GPUs (GPGPUs)

www.cs.rice.edu/~vsarkar/comp635

2COMP 635, Fall 2007 (V.Sarkar)

Announcements• Acknowledgments

— Wen-mei Hwu & David Kirk, UIUC ECE 498 AL1 course, “Programming Massively Parallel Processors”– http://courses.ece.uiuc.edu/ece498/al1/Syllabus.html

— Dana Schaa, “Using CUDA for High Performance Scientific Computing”– http://www.ece.neu.edu/~dschaa/files/dschaa_cuda.ppt

• Class TA: Raghavan Raman

• Reading list for next lecture (10/1) --- volunteers needed to lead discussion!1. “Scan Primitives for GPU Computing”, S.Sengupta et al, Proceedings of the 22nd ACM

SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware• http://graphics.idav.ucdavis.edu/publications/func/return_pdf?pub_id=915

2. “EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system”,P.Wang et al, PLDI 2007• http://doi.acm.org/10.1145/1250734.1250753

• Additional references• NVIDIA, NVidia CUDA Programming Guide, NVidia, 2007

• http://developer.download.nvidia.com/compute/cuda/1_0/NVIDIA_CUDA_Programming_Guide_1.0.pdf

• Hubert Nguyen, GPU Gems 3, Addison Wesley, 2007

• First Workshop on General Purpose Processing on Graphics Processing Units, October 4, 2007, Boston• For details, see http://www.ece.neu.edu/GPGPU/• Contact me if you’re interesting in attending so as to work on a class project or to give a summary report back to the

class


• Two major trends1. Increasing performance gap relative to mainstream CPUs

– Calculation: 367 GFLOPS vs. 32 GFLOPS– Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s

2. Availability of more general (non-graphics) programming interfaces

— GPU in every PC and workstation – massive volume and potential impact

Why GPUs?


What is GPGPU ?

• General Purpose computation using GPUin applications other than 3D graphics—GPU accelerates critical path of application

• Data parallel algorithms leverage GPU attributes—Large data arrays, streaming throughput—Fine-grain SIMD parallelism—Low-latency floating point (FP) computation

• Applications – see GPGPU.org—Game effects (FX) physics, image processing—Physical modeling, computational engineering, matrix algebra,

convolution, correlation, sorting


Nvidia GeForce 8800 GTX (a.k.a. G80)

• The device is a set of 16multiprocessors

• Each multiprocessor is a set of32-bit processors with a SingleInstruction Multiple Dataarchitecture – sharedinstruction unit

• Each multiprocessor has:— 32 32-bit registers per processor— 16KB on-chip shared memory per

multiprocessor— A read-only constant cache— A read-only texture cache

Device

Multiprocessor N

Multiprocessor 2

Multiprocessor 1

Device memory

Shared Memory

InstructionUnit

Processor 1

Registers

…Processor 2

Registers

Processor M

Registers

ConstantCache

TextureCache


CUDA Taxonomy• CUDA = Compute Unified Device Architecture

• Device = GPU, Host = CPU, Kernel = GPU program

• Thread = instance of a kernel program

• Warp = group of threads that execute in SIMD mode• Maximum warp size for the Nvidia G80 is 32 threads

• Thread block = group of warps (all warps in same block must be of equal size)- One thread block at a time is assigned to a multiprocessor- Each warp contains threads of consecutive, increasing thread indices with the

first warp containing thread 0- Maximum block size for the Nvidia G80 is 16 warps

• Grid = array of thread blocks• Blocks within a grid can not be synchronized- Nvidia G80 has 16 multiprocessors

- need minimum of 16 blocks (2^4 * 2^4 * 2^5 = 8K threads) to fully utilizedevice?

- A multiprocessor can hold multiple blocks if resources (registers, threadspace, shared memory) permit- Maximum of 64K threads permitted per grid


CUDA Taxonomy (contd.)

GRIDGRIDBLOCKBLOCK

WWWW WW

BLOCKBLOCK

WW WW WW

BLOCKBLOCK

WW WW WW

BLOCKBLOCK

WW WW WW

BLOCKBLOCK

WW WW WW

BLOCKBLOCK

WW WW WW

BLOCKBLOCK

WW WW WW

BLOCKBLOCK

WW WW WW

BLOCKBLOCK

WW WW WW

BLOCKBLOCK

WW WW WW

BLOCKBLOCK

WW WW WW

BLOCKBLOCK

WW WW WW

BLOCKBLOCK

WW WW WW

BLOCKBLOCK

WW WW WW

BLOCKBLOCK

WW WW WW

BLOCKBLOCK

WW WW WW


Thread Batching: Grids and Blocks

• A kernel is executed as a gridof thread blocks— All threads share data

memory space

• A thread block is a batch ofthreads that can cooperate witheach other by:— Synchronizing their execution

– For hazard-free sharedmemory accesses

— Efficiently sharing datathrough a low latency sharedmemory

• Two threads from two differentblocks cannot cooperate

Host

Kernel1

Kernel2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Courtesy: NDVIA


Block and Thread IDs

• Threads and blocks have IDs— So each thread can decide

what data to work on— Block ID: 1D or 2D— Thread ID: 1D, 2D, or 3D

• Simplifies memoryaddressing when processingmultidimensional data— Image processing— Solving PDEs on volumes— …

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Courtesy: NDVIA


Device Memory Space Overview

• Each thread can:— R/W per-thread registers— R/W per-thread local memory— R/W per-block shared memory— R/W per-grid global memory— Read only per-grid constant memory— Read only per-grid texture memory

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

• The host can R/W global,constant, and texture memories

• These memory spaces arepersistent across kernels calledby the same application.


CUDA Host-Device Data Transfer

• cudaError_t cudaMemcpy(void* dst, const void* src,size_t count, enum cudaMemcpyKind kind)

• copies count bytes from the memory area pointed toby src to the memory area pointed to by dst, wherekind is one of

— cudaMemcpyHostToHost

— cudaMemcpyHostToDevice

— cudaMemcpyDeviceToHost

— cudaMemcpyDeviceToDevice

• The memory areas may not overlap

• Calling cudaMemcpy() with dst and srcpointers that do not match the direction of thecopy results in an undefined behavior.

• Synchronous in CUDA 1.0 --- will becomeasynchronous in future CUDA version?

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemor

y

Thread (0,0)

Registers

LocalMemor

y

Thread (1,0)

Registers

Block (1, 0)

Shared Memory

LocalMemor

y

Thread (0,0)

Registers

LocalMemor

y

Thread (1,0)

Registers

Host


Access Times

• Register – dedicated HW - single cycle

• Shared Memory – dedicated HW - single cycle

• Local Memory – DRAM, no cache - *slow*

• Global Memory – DRAM, no cache - *slow*

• Constant Memory – DRAM, cached, 1…10s…100s of cycles,depending on cache locality

• Texture Memory – DRAM, cached, 1…10s…100s of cycles,depending on cache locality

• Instruction Memory (invisible) – DRAM, cached


CUDA Variable Declarations

• __device__ is optional when used with __local__, __shared__, or__constant__

• Automatic variables without any qualifier reside in a register— Except arrays that reside in local memory

• Pointers can only point to memory allocated or declared in globalmemory:— Allocated in the host and passed to the kernel:

__global__ void KernelFunc(float* ptr)— Obtained as the address of a global variable: float* ptr = &GlobalVar;

blockblockshared__device__ __shared__ int SharedVar;

applicationgridglobal__device__ int GlobalVar;

threadthreadlocal__device__ __local__ int LocalVar;

grid

Scope

applicationconstant__device__ __constant__ int ConstantVar;

LifetimeMemory


CUDA Function Declarations

hosthost__host__ float HostFunc()

hostdevice__global__ void KernelFunc()

devicedevice__device__ float DeviceFunc()

Only callablefrom the:

Executedon the:

• __global__ defines a kernel function— Must return void

• __device__ and __host__ can be used together

• __device__ functions cannot have their address taken

• For functions executed on the device:— No recursion— No static variable declarations inside the function


Invoking a Kernel Function – Thread Creation

• A kernel function must be called with an executionconfiguration:

__global__ void KernelFunc(...);

dim3 DimGrid(100, 50); // 5000 thread blocks

dim3 DimBlock(4, 8, 8); // 256 threads per block

size_t SharedMemBytes = 64; // 64 bytes of shared memory

KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...);

• Any call to a kernel function is asynchronous from CUDA1.0 on, explicit synch needed for blocking

• cudaThreadSynchronize() explicitly forces the runtimeto wait until all preceding device tasks have finished


Language Extensions: Built-in Types & Variables

• [u]char[1..4], [u]short[1..4], [u]int[1..4], [u]long[1..4], float[1..4]— Structures accessed with x, y, z, w fields:

uint4 param;int y = param.y;

• dim3

— Based on uint3— Used to specify dimensions

• dim3 gridDim;

— Dimensions of the grid in blocks (gridDim.z unused)

• dim3 blockDim;

— Dimensions of the block in threads• dim3 blockIdx;

— Block index within the grid• dim3 threadIdx;

— Thread index within the block


Host Runtime Component

• Provides functions to deal with:— Device management (including multi-device systems)— Memory management— Error handling

• Initializes the first time a runtime function is called

• A host thread can invoke device code on only one device— Multiple host threads required to run on multiple devices


Device Mathematical Functions

• Some mathematical functions (e.g. sin(x)) have a lessaccurate, but faster device-only version (e.g. __sin(x))— __pow— __log, __log2, __log10— __exp— __sin, __cos, __tan


Device Synchronization Function

• void __syncthreads();

• Synchronizes all threads in a block

• Once all threads have reached this point, execution resumesnormally

• Used to avoid RAW/WAR/WAW hazards when accessingshared or global memory

• Allowed in conditional constructs only if the conditional isuniform across the entire thread block


Extended C (Summary)

• Declspecs— global, device, shared,

local, constant

• Keywords— threadIdx, blockIdx

• Intrinsics— __syncthreads

• Runtime API— Memory, symbol,

execution management

• Function launch

__device__ float filter[N];

__global__ void convolve (float *image) {

__shared__ float region[M]; ...

region[threadIdx] = image[i];

__syncthreads() ...

image[j] = result;}

// Allocate GPU memoryvoid *myimage = cudaMalloc(bytes)

// 100 blocks, 10 threads per blockconvolve<<<100, 10>>> (myimage);


gcc / cl

G80 SASSfoo.sass

OCG

Extended C

cudaccEDG C/C++ frontend

Open64 Global Optimizer

GPU Assemblyfoo.s

CPU Host Code foo.cpp

Integrated source(foo.cu)


Matrix Multiply Example in CUDA(Host Code)

Source: NVIDIA Compute Unified DeviceArchitecture Programming Guide Version 1.0


Matrix Multiply Example in CUDA(Device Code)


Ideal CUDA Programs

• High intrinsic parallelism- per-pixel or per-element operations

• Minimal communication (if any) between threads- limited synchronization

• High ratio of arithmetic to memory operations• Few control flow statements

- Divergent execution paths among threads from the same warpmust be serialized (costly)

- The compiler may replace conditional instructions by predicatedinstructions to reduce divergence


Sample GPU Applications

16%931,365Finite-Difference Time Domain analysis of2D electromagnetic wave propagation

FDTD

>99%33490Computing a matrix Q, a scanner’sconfiguration in MRI reconstruction

MRI-Q

96%98536Two Point Angular Correlation FunctionTRACF

>99%31952Single-precision implementation of saxpy,used in Linpack’s Gaussian elim. routine

SAXPY

>99%160322Petri Net simulation of a distributed systemPNS

99%2811,104Rye Polynomial Equation Solver, quantumchem, 2-electron repulsion

RPES

99%1461,874Finite element modeling, simulation of 3Dgraded materials

FEM

>99%2181,979Distributed.net RC5-72 challenge client codeRC5-72

>99%2851,481SPEC ‘06 version, change to single precisionand print fewer reports

LBM

35%19434,811SPEC ‘06 version, change in guess vectorH.264

% timeKernelSourceDescriptionApplication


Performance of Sample Kernels and Applications

• GeForce 8800 GTX vs. 2.2GHz Opteron 248• 10× speedup in a kernel is typical, as long as the kernel can occupy enough

parallel threads• 25× to 400× speedup if the function’s data requirements and control flow suit

the GPU and the application is optimized• Keep in mind that the speedup also reflects how suitable the CPU is for

executing the kernelSource: Slide 21, Lecture 1, UIUC ECE 498, David Kirk & Wen-mei Hwu, http://courses.ece.uiuc.edu/ece498/al1/lectures/lecture1%20intro%20fall%202007.ppt

COMP 635: Seminar on Heterogeneous Processors …vsarkar/PDF/comp635-lec4-v4.pdfCOMP 635: Seminar on Heterogeneous Processors ... • First Workshop on General Purpose Processing on

Documents