Top Banner
CUDA - 2
21

Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads. All threads in a grid run the same kernel code (SPMD) Each thread has.

Jan 04, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads. All threads in a grid run the same kernel code (SPMD) Each thread has.

CUDA - 2

Page 2: Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads. All threads in a grid run the same kernel code (SPMD) Each thread has.

Arrays of Parallel ThreadsA CUDA kernel is executed by a grid (array)

of threads.All threads in a grid run the same kernel code

(SPMD)Each thread has indexes that it uses to

compute memory addresses and make control decisions. blockIdx.x*blockDim.x+threadIdx.x

Page 3: Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads. All threads in a grid run the same kernel code (SPMD) Each thread has.

Thread Blocks: Scalable CooperationDivide thread array into multiple blocks

Threads within a block cooperate via: Shared memory, Atomic operations, and Barrier Synchronization

Threads in different blocks do not interact.

Page 4: Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads. All threads in a grid run the same kernel code (SPMD) Each thread has.

Transparent ScalabilityEach block can execute in any order relative

to others.Hardware is free to assign blocks to any

processor at any timeA kernel scales to any number of parallel

processors

Page 5: Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads. All threads in a grid run the same kernel code (SPMD) Each thread has.

Example: Executing Thread BlocksThreads are assigned to Streaming

Multiprocessors (SM) in block granularityUp to 8 blocks to each SM as

resource allowsFermi SM can take up to 1536

threads Could be 256 (threads/block) * 6

blocks Or 512 (threads/block) * 3 blocks, etc.

SM maintains thread/block idx #sSM manages/schedules thread

execution

Page 6: Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads. All threads in a grid run the same kernel code (SPMD) Each thread has.

Warps as Scheduling UnitsEach Block is executed as 32-thread Warps

An implementation decision, not part of the CUDA programming model

Warps are scheduling units on an SMThreads in a warp execute in SIMD

Page 7: Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads. All threads in a grid run the same kernel code (SPMD) Each thread has.

Warp ExampleIf 3 blocks are assigned to an SM and each

block has 256 threads, how many Warps are there in an SM?Each Block is divided into 256/32 = 8 WarpsThere are 8 * 3 = 24 Warps

Page 8: Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads. All threads in a grid run the same kernel code (SPMD) Each thread has.

Example: Thread Scheduling (Cont.)SM implements zero-overhead warp

schedulingWarps whose next instruction has its operands

ready for consumption are eligible for execution

Eligible Warps are selected for execution on a prioritized scheduling policy

All threads in a warp execute the same instruction when selected

Page 9: Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads. All threads in a grid run the same kernel code (SPMD) Each thread has.

How thread blocks are partitionedThread blocks are partitioned into warps

Thread IDs within a warp are consecutive and increasingWarp 0 starts with Thread ID 0

Partitioning is always the sameThus you can use this knowledge in control flowHowever, the exact size of warps may change from

generation to generation

DO NOT rely on any ordering within or between warpsIf there are any dependencies between threads, you

must __syncthreads() to get correct results (more later).

Page 10: Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads. All threads in a grid run the same kernel code (SPMD) Each thread has.

Partial Overview of CUDA MemoriesDevice code can:

R/W per-thread registers

R/W all shared global memory

Host code can:Transfer data

to/from per-grid global memory

Page 11: Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads. All threads in a grid run the same kernel code (SPMD) Each thread has.

CUDA Device Memory Management API functionscudaMalloc()

Allocates object in the device global memoryTwo parameters

Address of a pointer to the allocated object Size ofallocated object in terms of bytes

cudaFree()Frees object from device global memory

Pointer to freed object

Page 12: Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads. All threads in a grid run the same kernel code (SPMD) Each thread has.

Host-Device Data Transfer API functionscudaMemcpy()

memory data transferRequires four parameters

Pointer to destination Pointer to source Number of bytes copied Type/Direction of transfer

Transfer to device is asynchronous

Page 13: Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads. All threads in a grid run the same kernel code (SPMD) Each thread has.

INTER-WARP/THREAD LEVELSYNCHRONIZATIONFiner grained control of execution order

within a kernelIf they can be avoided easily, they shouldSometimes you can’t however

Page 14: Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads. All threads in a grid run the same kernel code (SPMD) Each thread has.

SYNCTHREADSThe most simple:

__synchthreads(); When reached a thread will block until all threads

in the block have reached the callMaintain memory access order

The CUDA compiler will try to delay memory writes as long as it can

Page 15: Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads. All threads in a grid run the same kernel code (SPMD) Each thread has.

Programmer View of CUDA Memories

Page 16: Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads. All threads in a grid run the same kernel code (SPMD) Each thread has.

Declaring CUDA Variables

__device__ is optional when used with __shared__, or __constant__

Automatic variables reside in a registerExcept per-thread arrays that reside in global

memory

Variable declaration Memory

Scope Lifetime

int LocalVar; Register Thread

Thread

__device__ __shared__ int SharedVar; Shared Block Block

__device__ int GlobalVar; Global Grid Application

__device__ __constant__ int ConstantVar; Constant

Grid Application

Page 17: Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads. All threads in a grid run the same kernel code (SPMD) Each thread has.

Where to Declare Variables?

Page 18: Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads. All threads in a grid run the same kernel code (SPMD) Each thread has.

Shared Memory in CUDAA special type of memory whose contents are

explicitly declared and used in the source codeOne in each SMAccessed at much higher speed (in both

latency and throughput) than global memoryStill accessed by memory access instructionsA form of scratchpad memory in computer

architecture

Page 19: Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads. All threads in a grid run the same kernel code (SPMD) Each thread has.

Hardware View of CUDA Memories

Page 20: Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads. All threads in a grid run the same kernel code (SPMD) Each thread has.

A Common Programming StrategyPartition data into subsets or tiles that fit into

shared memoryUse one thread block to handle each tile by:

Loading the tile from global memory to shared memory, using multiple threads

Performing the computation on the subset from shared memory, reducing traffic to the global memory

Upon completion, writing results from shared memory to global memory

Page 21: Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads. All threads in a grid run the same kernel code (SPMD) Each thread has.