Vivek Sarkar Department of Computer Science Rice University [email protected]September 24, 2007 COMP 635: Seminar on Heterogeneous Processors Lecture 4: Introduction to General-Purpose computation on GPUs (GPGPUs) www.cs.rice.edu/~vsarkar/comp635 2 COMP 635, Fall 2007 (V.Sarkar) Announcements • Acknowledgments — Wen-mei Hwu & David Kirk, UIUC ECE 498 AL1 course, “Programming Massively Parallel Processors” – http://courses.ece.uiuc.edu/ece498/al1/Syllabus.html — Dana Schaa, “Using CUDA for High Performance Scientific Computing” – http://www.ece.neu.edu/~dschaa/files/dschaa_cuda.ppt • Class TA: Raghavan Raman • Reading list for next lecture (10/1) --- volunteers needed to lead discussion! 1. “Scan Primitives for GPU Computing”, S.Sengupta et al, Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware • http://graphics.idav.ucdavis.edu/publications/func/return_pdf?pub_id=915 2. “EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system”, P.Wang et al, PLDI 2007 • http://doi.acm.org/10.1145/1250734.1250753 • Additional references • NVIDIA, NVidia CUDA Programming Guide, NVidia, 2007 • http://developer.download.nvidia.com/compute/cuda/1_0/NVIDIA_CUDA_Programming_Guide_1.0.pdf • Hubert Nguyen, GPU Gems 3, Addison Wesley, 2007 • First Workshop on General Purpose Processing on Graphics Processing Units, October 4, 2007, Boston • For details, see http://www.ece.neu.edu/GPGPU/ • Contact me if you’re interesting in attending so as to work on a class project or to give a summary report back to the class
13
Embed
COMP 635: Seminar on Heterogeneous Processors …vsarkar/PDF/comp635-lec4-v4.pdfCOMP 635: Seminar on Heterogeneous Processors ... • First Workshop on General Purpose Processing on
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
— Dana Schaa, “Using CUDA for High Performance Scientific Computing”– http://www.ece.neu.edu/~dschaa/files/dschaa_cuda.ppt
• Class TA: Raghavan Raman
• Reading list for next lecture (10/1) --- volunteers needed to lead discussion!1. “Scan Primitives for GPU Computing”, S.Sengupta et al, Proceedings of the 22nd ACM
SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware• http://graphics.idav.ucdavis.edu/publications/func/return_pdf?pub_id=915
2. “EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system”,P.Wang et al, PLDI 2007• http://doi.acm.org/10.1145/1250734.1250753
• Additional references• NVIDIA, NVidia CUDA Programming Guide, NVidia, 2007
• First Workshop on General Purpose Processing on Graphics Processing Units, October 4, 2007, Boston• For details, see http://www.ece.neu.edu/GPGPU/• Contact me if you’re interesting in attending so as to work on a class project or to give a summary report back to the
class
3COMP 635, Fall 2007 (V.Sarkar)
• Two major trends1. Increasing performance gap relative to mainstream CPUs
– Calculation: 367 GFLOPS vs. 32 GFLOPS– Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s
2. Availability of more general (non-graphics) programming interfaces
— GPU in every PC and workstation – massive volume and potential impact
Why GPUs?
4COMP 635, Fall 2007 (V.Sarkar)
What is GPGPU ?
• General Purpose computation using GPUin applications other than 3D graphics—GPU accelerates critical path of application
• Data parallel algorithms leverage GPU attributes—Large data arrays, streaming throughput—Fine-grain SIMD parallelism—Low-latency floating point (FP) computation
• Each multiprocessor is a set of32-bit processors with a SingleInstruction Multiple Dataarchitecture – sharedinstruction unit
• Each multiprocessor has:— 32 32-bit registers per processor— 16KB on-chip shared memory per
multiprocessor— A read-only constant cache— A read-only texture cache
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
Device memory
Shared Memory
InstructionUnit
Processor 1
Registers
…Processor 2
Registers
Processor M
Registers
ConstantCache
TextureCache
6COMP 635, Fall 2007 (V.Sarkar)
CUDA Taxonomy• CUDA = Compute Unified Device Architecture
• Device = GPU, Host = CPU, Kernel = GPU program
• Thread = instance of a kernel program
• Warp = group of threads that execute in SIMD mode• Maximum warp size for the Nvidia G80 is 32 threads
• Thread block = group of warps (all warps in same block must be of equal size)- One thread block at a time is assigned to a multiprocessor- Each warp contains threads of consecutive, increasing thread indices with the
first warp containing thread 0- Maximum block size for the Nvidia G80 is 16 warps
• Grid = array of thread blocks• Blocks within a grid can not be synchronized- Nvidia G80 has 16 multiprocessors
- need minimum of 16 blocks (2^4 * 2^4 * 2^5 = 8K threads) to fully utilizedevice?
- A multiprocessor can hold multiple blocks if resources (registers, threadspace, shared memory) permit- Maximum of 64K threads permitted per grid
7COMP 635, Fall 2007 (V.Sarkar)
CUDA Taxonomy (contd.)
GRIDGRIDBLOCKBLOCK
WWWW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
BLOCKBLOCK
WW WW WW
8COMP 635, Fall 2007 (V.Sarkar)
Thread Batching: Grids and Blocks
• A kernel is executed as a gridof thread blocks— All threads share data
memory space
• A thread block is a batch ofthreads that can cooperate witheach other by:— Synchronizing their execution
– For hazard-free sharedmemory accesses
— Efficiently sharing datathrough a low latency sharedmemory
• Two threads from two differentblocks cannot cooperate
Host
Kernel1
Kernel2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Courtesy: NDVIA
9COMP 635, Fall 2007 (V.Sarkar)
Block and Thread IDs
• Threads and blocks have IDs— So each thread can decide
what data to work on— Block ID: 1D or 2D— Thread ID: 1D, 2D, or 3D
• Simplifies memoryaddressing when processingmultidimensional data— Image processing— Solving PDEs on volumes— …
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Courtesy: NDVIA
10COMP 635, Fall 2007 (V.Sarkar)
Device Memory Space Overview
• Each thread can:— R/W per-thread registers— R/W per-thread local memory— R/W per-block shared memory— R/W per-grid global memory— Read only per-grid constant memory— Read only per-grid texture memory
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
• The host can R/W global,constant, and texture memories
• These memory spaces arepersistent across kernels calledby the same application.
• Initializes the first time a runtime function is called
• A host thread can invoke device code on only one device— Multiple host threads required to run on multiple devices
18COMP 635, Fall 2007 (V.Sarkar)
Device Mathematical Functions
• Some mathematical functions (e.g. sin(x)) have a lessaccurate, but faster device-only version (e.g. __sin(x))— __pow— __log, __log2, __log10— __exp— __sin, __cos, __tan
19COMP 635, Fall 2007 (V.Sarkar)
Device Synchronization Function
• void __syncthreads();
• Synchronizes all threads in a block
• Once all threads have reached this point, execution resumesnormally
• Used to avoid RAW/WAR/WAW hazards when accessingshared or global memory
• Allowed in conditional constructs only if the conditional isuniform across the entire thread block