Top Banner
Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. CUDA Lecture 7 CUDA Threads and Atomics
38

CUDA Lecture 7 CUDA Threads and Atomics

Mar 23, 2016

Download

Documents

Duy

CUDA Lecture 7 CUDA Threads and Atomics. Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Topic 1: Atomics. The Problem: how do you do global communication? Finish a kernel and start a new one All writes from all threads complete before a kernel finishes - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CUDA Lecture 7 CUDA Threads and Atomics

Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

CUDA Lecture 7CUDA Threads and

Atomics

Page 2: CUDA Lecture 7 CUDA Threads and Atomics

The Problem: how do you do global communication?Finish a kernel and start a new oneAll writes from all threads complete before a kernel

finishes

Would need to decompose kernels into before and after parts

CUDA Threads and Atomics – Slide 2

Topic 1: Atomics

step1<<<grid1,blk1>>>(…);// The system ensures that all// writes from step 1 completestep2<<<grid2,blk2>>>(…);

Page 3: CUDA Lecture 7 CUDA Threads and Atomics

Alternately write to a predefined memory locationRace condition! Updates can be lost

What is the value of a in thread 0? In thread 1917?

CUDA Threads and Atomics – Slide 3

Race Conditions

threadID: 0 threadID: 1917// vector[0] was equal to zero

vector[0] += 5;…a = vector[0];

vector[0] += 1;…a = vector[0];

Page 4: CUDA Lecture 7 CUDA Threads and Atomics

Thread 0 could have finished execution before 1917 started

Or the other way aroundOr both are executing at the same timeAnswer: not defined by the programming model; can

be arbitraryCUDA Threads and Atomics – Slide 4

Race Conditions (cont.)threadID: 0 threadID: 1917

// vector[0] was equal to zerovector[0] += 5;…a = vector[0];

vector[0] += 1;…a = vector[0];

Page 5: CUDA Lecture 7 CUDA Threads and Atomics

CUDA provides atomic operations to deal with this problemAn atomic operation guarantees that only a

single thread has access to a piece of memory while an operation completes

The name atomic comes from the fact that it is uninterruptable

No dropped data, but ordering is still arbitrary

CUDA Threads and Atomics – Slide 5

Atomics

Page 6: CUDA Lecture 7 CUDA Threads and Atomics

CUDA provides atomic operations to deal with this problemRequires hardware with compute capability 1.1

and aboveDifferent types of atomic instructions

Addition/subtraction: atomicAdd, atomicSubMinimum/maximum: atomicMin, atomicMaxConditional increment/decrement: atomicInc, atomicDec

Exchange/compare-and-swap: atomicExch, atomicCAS

More types in fermi: atomicAnd, atomicOr, atomicXor

CUDA Threads and Atomics – Slide 6

Atomics (cont.)

Page 7: CUDA Lecture 7 CUDA Threads and Atomics

CUDA Threads and Atomics – Slide 7

Example: Histogram// Determine frequency of colors in a picture// colors have already been converted into integers// Each thread looks at one pixel and increments// a counter atomically__global__ void histogram(int* color, int* buckets){ int i = threadIdx.x + blockDim.x * blockIdx.x; int c = colors[i]; atomicAdd(&buckets[c], 1);}atomicAdd returns the previous value at a

certain addressUseful for grabbing variable amounts of data

from the list

Page 8: CUDA Lecture 7 CUDA Threads and Atomics

CUDA Threads and Atomics – Slide 8

Example: Workqueue// For algorithms where the amount of work per item// is highly non-uniform, it often makes sense for// to continuously grab work from a queue __global__ void workq(int* work_q, int* q_counter, int* output, int queue_max){ int i = threadIdx.x + blockDim.x * blockIdx.x; int q_index = atomicInc(q_counter, queue_max); int result = do_work(work_q[q_index]); output[i] = result;}

Page 9: CUDA Lecture 7 CUDA Threads and Atomics

if compare equals old value stored at address then val is stored instead

in either case, routine returns the value of old

seems a bizarre routine at first sight, but can be very useful for atomic locks

CUDA Threads and Atomics – Slide 9

Compare and Swapint atomicCAS(int* address, int compare, int val)

Page 10: CUDA Lecture 7 CUDA Threads and Atomics

Most general type of atomicCan emulate all others with CAS

CUDA Threads and Atomics – Slide 10

Compare and Swap (cont.)int atomicCAS(int* address, int oldval, int val) { int old_reg_val = *address; if (old_reg_val == compare) *address = val; return old_reg_val; }

Page 11: CUDA Lecture 7 CUDA Threads and Atomics

Atomics are slower than normal load/storeMost of these are associative operations on

signed/unsigned integers:quite fast for data in shared memoryslower for data in device memory

You can have the whole machine queuing on a single location in memory

Atomics unavailable on G80!

CUDA Threads and Atomics – Slide 11

Atomics

Page 12: CUDA Lecture 7 CUDA Threads and Atomics

CUDA Threads and Atomics – Slide 12

Example: Global Min/Max (Naïve)

// If you require the maximum across all threads// in a grid, you could do it with a single global// maximum value, but it will be VERY slow__global__ void global_max(int* values, int* gl_max){ int i = threadIdx.x + blockDim.x * blockIdx.x; int val = values[i]; atomicMax(gl_max,val);}

Page 13: CUDA Lecture 7 CUDA Threads and Atomics

CUDA Threads and Atomics – Slide 13

Example: Global Min/Max (Better)

// introduce intermediate maximum results, so that // most threads do not try to update the global max__global__ void global_max(int* values, int* max, int *regional_maxes, int num_regions){ int i = threadIdx.x + blockDim.x * blockIdx.x; int val = values[i]; int region = i % num_regions; if (atomicMax(&reg_max[region],val) < val) { atomicMax(max,val); }}

Page 14: CUDA Lecture 7 CUDA Threads and Atomics

Single value causes serial bottleneckCreate hierarchy of values for more

parallelismPerformance will still be slow, so use

judiciously

CUDA Threads and Atomics – Slide 14

Global Max/Min

Page 15: CUDA Lecture 7 CUDA Threads and Atomics

Can’t use normal load/store for inter-thread communication because of race conditions

Use atomic instructions for sparse and/or unpredictable global communication

Decompose data (very limited use of single global sum/max/min/etc.) for more parallelism

CUDA Threads and Atomics – Slide 15

Atomics: Summary

Page 16: CUDA Lecture 7 CUDA Threads and Atomics

How a streaming multiprocessor (SM) executes threadsOverview of how a streaming multiprocessor

worksSIMT ExecutionDivergence

CUDA Threads and Atomics – Slide 16

Topic 2: Streaming Multiprocessor Execution and Divergence

Page 17: CUDA Lecture 7 CUDA Threads and Atomics

Hardware schedules thread blocks onto available SMsNo guarantee of ordering among thread blocksHardware will schedule thread blocks as soon as a

previous thread block finished CUDA Threads and Atomics – Slide 17

Scheduling Blocks onto SMsThread Block 5

Thread Block 27

Thread Block 61

Streaming Multiprocessor

Thread Block 2001

Page 18: CUDA Lecture 7 CUDA Threads and Atomics

A warp = 32 threads launched togetherUsually execute together as well

CUDA Threads and Atomics – Slide 18

Recall: Warps

ALU

Control

ALU

Control

ALU

Control

ALU

Control

ALU

Control

ALU

Control

ALU ALU ALU

Control

ALU ALU ALU

Page 19: CUDA Lecture 7 CUDA Threads and Atomics

Each thread block is mapped to one or more warps

The hardware schedules each warp independently

CUDA Threads and Atomics – Slide 19

Mapping of Thread Blocks

Thread Block N (128 threads)

TB N W1TB N W2TB N W3TB N W4

Page 20: CUDA Lecture 7 CUDA Threads and Atomics

SM implements zero-overhead warp schedulingAt any time, only one of the warps is executed

by SM*Warps whose next instruction has its inputs

ready for consumption are eligible for execution

Eligible warps are selected for execution on a prioritized scheduling policy

All threads in a warp execute the same instruction when selected

CUDA Threads and Atomics – Slide 20

Thread Scheduling Example

TB1W1

TB = Thread Block, W = Warp

TB2W1

TB3W1

TB2W1

TB1W1

TB3W2

TB1W2

TB1W3

TB3W2

Time

TB1, W1 stallTB3, W2 stallTB2, W1 stall

Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4

Page 21: CUDA Lecture 7 CUDA Threads and Atomics

Threads are executed in warps of 32, with all threads in the warp executing the same instruction at the same time

What happens if you have the following code?

CUDA Threads and Atomics – Slide 21

Control Flow Divergence

if (foo(threadIdx.x)){ do_A();}else{ do_B();}

Page 22: CUDA Lecture 7 CUDA Threads and Atomics

This is called warp divergence – CUDA will generate correct code to handle this, but to understand the performance you need to understand what CUDA does with it

CUDA Threads and Atomics – Slide 22

Control Flow Divergence (cont.)

if (foo(threadIdx.x)){ do_A();}else{ do_B();}

Page 23: CUDA Lecture 7 CUDA Threads and Atomics

CUDA Threads and Atomics – Slide 23

Control Flow Divergence (cont.)

From Fung et al. MICRO ’07

Branch

Path A

Path B

Branch

Path A

Path B

Page 24: CUDA Lecture 7 CUDA Threads and Atomics

Nested branches are handled as well

CUDA Threads and Atomics – Slide 24

Control Flow Divergence (cont.)

if (foo(threadIdx.x)){ if (bar(threadIdx.x)) do_A(); else do_B();}else do_C();

Page 25: CUDA Lecture 7 CUDA Threads and Atomics

CUDA Threads and Atomics – Slide 25

Control Flow Divergence (cont.)

BranchBranch

Path A

Path C

Branch

Path B

Page 26: CUDA Lecture 7 CUDA Threads and Atomics

You don’t have to worry about divergence for correctnessMostly true, except corner cases (for example

intra-warp locks)You might have to think about it for

performanceDepends on your branch conditions

CUDA Threads and Atomics – Slide 26

Control Flow Divergence (cont.)

Page 27: CUDA Lecture 7 CUDA Threads and Atomics

One solution: NVIDIA GPUs have predicated instructions which are carried out only if a logical flag is true.

In the previous example, all threads compute the logical predicate and two predicated instructions

CUDA Threads and Atomics – Slide 27

Control Flow Divergence (cont.)

p = (foo(threadIdx.x)); p: do_A();!p: do_B();

Page 28: CUDA Lecture 7 CUDA Threads and Atomics

Performance drops off with the degree of divergence

CUDA Threads and Atomics – Slide 28

Control Flow Divergence (cont.)

0 2 4 6 8 10 12 14 16 180

5

10

15

20

25

30

35

Divergence

Performan

ce

switch (threadIdx.x % N) { case 0: ... case 1: ...}

Page 29: CUDA Lecture 7 CUDA Threads and Atomics

Performance drops off with the degree of divergenceIn worst case, effectively lose a factor of 32 in

performance if one thread needs expensive branch, while rest do nothing

Another example: processing a long list of elements where, depending on run-time values, a few require very expensive processing GPU implementation:first process list to build two sub-lists of “simple”

and “expensive” elementsthen process two sub-lists separately

CUDA Threads and Atomics – Slide 29

Control Flow Divergence (cont.)

Page 30: CUDA Lecture 7 CUDA Threads and Atomics

Already introduced __syncthreads(); which forms a barrier – all threads wait until every one has reached this point.

When writing conditional code, must be careful to make sure that all threads do reach the __syncthreads();

Otherwise, can end up in deadlock

CUDA Threads and Atomics – Slide 30

Synchronization

Page 31: CUDA Lecture 7 CUDA Threads and Atomics

Fermi supports some new synchronisation instructions which are similar to __syncthreads() but have extra capabilities:int __syncthreads_count(predicate)

counts how many predicates are trueint __syncthreads_and(predicate)

returns non-zero (true) if all predicates are trueint __syncthreads_or(predicate)

returns non-zero (true) if any predicate is true

CUDA Threads and Atomics – Slide 31

Synchronization (cont.)

Page 32: CUDA Lecture 7 CUDA Threads and Atomics

There are similar warp voting instructions which operate at the level of a warp:int __all(predicate)

returns non-zero (true) if all predicates in warp are true

int __any(predicate)returns non-zero (true) if any predicate is true

unsigned int __ballot(predicate)sets nth bit based on nth predicate

CUDA Threads and Atomics – Slide 32

Warp voting

Page 33: CUDA Lecture 7 CUDA Threads and Atomics

Use very judiciouslyAlways include a max_iter in your spinloop!Decompose your data and your locks

CUDA Threads and Atomics – Slide 33

Topic 3: Locks

Page 34: CUDA Lecture 7 CUDA Threads and Atomics

Problem: when a thread writes data to device memory the order of completion is not guaranteed, so global writes may not have completed by the time the lock is unlockedCUDA Threads and Atomics – Slide 34

Example: Global atomic lock// global variable: 0 unlocked, 1 locked__device__ int lock=0;__global__ void kernel(...) {...// set lockdo {} while(atomicCAS(&lock,0,1));...// free locklock = 0;}

Page 35: CUDA Lecture 7 CUDA Threads and Atomics

CUDA Threads and Atomics – Slide 35

Example: Global atomic lock (cont.)

// global variable: 0 unlocked, 1 locked__device__ int lock=0;__global__ void kernel(...) {...// set lockdo {} while(atomicCAS(&lock,0,1));...// free lock__threadfence(); // wait for writes to finishlock = 0;}

Page 36: CUDA Lecture 7 CUDA Threads and Atomics

__threadfence_block();wait until all global and shared memory writes

are visible to all threads in block__threadfence();

wait until all global and shared memory writes are visible to all threads in block (or all threads, for global data)

CUDA Threads and Atomics – Slide 36

Example: Global atomic lock (cont.)

Page 37: CUDA Lecture 7 CUDA Threads and Atomics

lots of esoteric capabilities – don’t worry about most of

themessential to understand warp divergence –

can have avery big impact on performance__syncthreads(); is vital the rest can be ignored until you have a

critical need – then read the documentation carefully and look for examples in the SDK

CUDA Threads and Atomics – Slide 37

Summary

Page 38: CUDA Lecture 7 CUDA Threads and Atomics

Based on original material fromOxford University: Mike GilesStanford University

Jared Hoberock, David TarjanRevision history: last updated 8/8/2011.

CUDA Threads and Atomics – Slide 38

End Credits