CUDA Lecture 8 CUDA Memories

Prepared 8/9/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

CUDA Lecture 8CUDA Memories

Each thread can: Read/write per-

thread registers Read/write per-

thread local memory

Read/write per-block shared memory

Read/write per-grid global memory

Read/only per-grid constant memory

CUDA Memories – Slide 2

Hardware Implementation of CUDA Memories

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

__device__ is optional when used with __local__, __shared__ or __constant__


CUDA Variable Type QualifiersVariable declaration Memor

yScop

eLifetim

e int LocalVar; register thread thread

__device__ __local__ int LocalVar; int ArrayVar[10];

local thread thread

__device__ __shared__ int SharedVar; shared block block

__device__ int GlobalVar; global grid application

__device__ __constant__ int ConstantVar; constant grid application

Automatic scalar variables without any qualifier reside in a register Compiler will spill to thread local memory

Automatic array variables without any qualifier reside in a thread-local memory


CUDA Variable Type Qualifiers (cont.)

Variable declaration Memory

Scope

Lifetime

int LocalVar; register thread thread


local thread thread

__device__ __shared__ int SharedVar; shared block block

__device__ int GlobalVar; global grid application

__device__ __constant__ int ConstantVar; constant grid application

scalar variables reside in fast, on-chip registersshared variables reside in fast, on-chip memoriesthread-local arrays and global variables reside in

uncached off-chip memoryconstant variables reside in cached off-chip memory


CUDA Variable Type PerformanceVariable declaration Memor

yPenalty

int LocalVar; register 1x


local 100x

__device__ __shared__ int SharedVar; shared 1x

__device__ int GlobalVar; global 100x

__device__ __constant__ int ConstantVar; constant 1x

100Ks per-thread variables, R/W by 1 thread100s shared variables, each R/W by 100s of

threads1 global variable is R/W by 100Ks threads1 constant variable is readable by 100Ks threads


CUDA Variable Type ScaleVariable declaration Instanc

esVisibilit

y int LocalVar; 100,000s 1


100,000s 1

__device__ __shared__ int SharedVar; 100s 100s

__device__ int GlobalVar; 1 100,000s

__device__ __constant__ int ConstantVar; 1 100,000s


Where to declare variables?Can host access

it?Yes No

Outside of any function In the kernel

__constant__ int ConstantVar;__device__ int GlobalVar;

int LocalVar; int ArrayVar[10];__shared__ int SharedVar;


Example: Thread-local Variables// motivate per-thread variables with// Ten Nearest Neighbors application__global__ void ten_nn(float2 *result, float2 *ps, float2 *qs, size_t num_qs){ // p goes in a register float2 p = ps[threadIdx.x];

// per-thread heap goes in off-chip memory float2 heap[10];

// read through num_qs points, maintaining // the nearest 10 qs to p in the heap ... // write out the contents of heap to result ...}


Example: Shared Variables// motivate shared variables with// Adjacent Difference application:// compute result[i] = input[i] – input[i-1]__global__ void adj_diff_naive(int *result, int *input){ // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i > 0) { // each thread loads two elements from global memory int x_i = input[i]; int x_i_minus_one = input[i-1];

result[i] = x_i – x_i_minus_one; }}


Example: Shared Variables (cont.)// motivate shared variables with// Adjacent Difference application:// compute result[i] = input[i] – input[i-1]__global__ void adj_diff_naive(int *result, int *input){ // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i > 0) { // what are the bandwidth requirements of this kernel? int x_i = input[i]; int x_i_minus_one = input[i-1];


Two loads



if (i > 0) { // how many times does this kernel load input[i]? int x_i = input[i]; int x_i_minus_one = input[i-1];


// once by thread i// again by thread i+1



if (i > 0) { // Idea: eliminate redundancy by sharing data int x_i = input[i]; int x_i_minus_one = input[i-1];



Example: Shared Variables (cont.)// optimized version of adjacent difference__global__ void adj_diff(int *result, int *input){ // shorthand for threadIdx.x int tx = threadIdx.x; // allocate a __shared__ array, one element per thread __shared__ int s_data[BLOCK_SIZE]; // each thread reads one element to s_data unsigned int i = blockDim.x * blockIdx.x + tx; s_data[tx] = input[i];

// avoid race condition: ensure all loads complete before continuing __syncthreads(); if (tx > 0) result[i] = s_data[tx] – s_data[tx–1]; else if (i > 0) { // handle thread block boundary result[i] = s_data[tx] – input[i-1]; }}


Example: Shared Variables (cont.)// when the size of the array isn’t known at compile time...__global__ void adj_diff(int *result, int *input){ // use extern to indicate a __shared__ array will be // allocated dynamically at kernel launch time extern __shared__ int s_data[]; ...}// pass the size of the per-block array, in bytes, as the third// argument to the triple chevronsadj_diff<<<num_blocks, block_size, block_size * sizeof(int)>>>(r,i);

Experiment performed on a GT200 chipImprovement likely better on an older architectureImprovement likely worse on a newer architecture

Optimizations tend to come with a development cost


Optimization AnalysisImplementation Origina

lImproved

Global loads 2N N + N/BLOCK_SIZE

Global stores N NThroughput 36.8

GB/s57.5 GB/s

Source lines of code (SLOCs)

18 35

Relative improvement 1x 1.57xImprovement/SLOC 1x 0.81x

Pointers can only point to memory allocated or declared in global memory:

Allocated in the host and passed to the kernel: __global__ void KernelFunc(float* ptr)

Obtained as the address of a global variable: float* ptr = &GlobalVar;


Variable Type Restrictions

So you can use pointers and point at any memory space per se:


Variable Type Restrictions (cont.)

__device__ int my_global_variable;__constant__ int my_constant_variable = 13;

__global__ void foo(void){ __shared__ int my_shared_variable;

int *ptr_to_global = &my_global_variable; const int *ptr_to_constant = &my_constant_variable; int *ptr_to_shared = &my_shared_variable; ... *ptr_to_global = *ptr_to_shared;}

Pointers aren’t typed on memory space

Where does ptr point?ptr is a __shared__ pointer variable, not a

pointer to a __shared__ variable!


Variable Type Restrictions (cont.)

__shared__ int *ptr;


Don’t confuse the compiler!__device__ int my_global_variable;__global__ void foo(int *input){ __shared__ int my_shared_variable;

int *ptr = 0; if (input[threadIdx.x] % 2) ptr = &my_global_variable; else ptr = &my_shared_variable; // where does ptr point?}

Prefer dereferencing pointers in simple, regular access patterns

Avoid propagating pointersAvoid pointers to pointers

The GPU would rather not pointer chaseLinked lists will not perform well

Pay attention to compiler warning messagesWarning: Cannot tell what pointer points to, assuming global memory space

Crash waiting to happen


Advice

Global memory resides in device memory (DRAM)Much slower access than shared memory

So, a profitable way of performing computation on the device is to tile data to take advantage of fast shared memory:Generalize from adjacent_difference

exampleDivide and conquer


A Common Programming Strategy

Partition data into subsets that fit into shared memory CUDA Memories – Slide 22

A Common Programming Strategy (cont.)

Handle each data subset with one thread block as follows: CUDA Memories – Slide 23


Load the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism CUDA Memories – Slide 24


Perform the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element CUDA Memories – Slide 25


Copy the results from shared memory back to global memory



Constant memory also resides in device memory (DRAM)Much slower access than shared memoryBut…cached!Highly efficient access for read-only data



Carefully partition data according to access patternsRead-only __constant__ memory (very fast

if in cache)R/W & shared within block __shared__

memory (very fast)R/W within each thread registers (very fast)Indexed R/W within each thread local

memory (slow)R/W inputs/results cudaMalloc’ed global

memory (very slow)



This is a race condition; the result is undefined

The order in which threads access the variable is undefined without explicit coordination

Two ways to enforce well-defined semanticsCUDA Memories – Slide 29

Communication through Memory

__global__ void race(void){ __shared__ int my_shared_variable; my_shared_variable = threadIdx.x;}

Use barriers (e.g., __syncthreads) to ensure data is ready for access

The state of the entire data array is now well-defined for all threads in this block.


Communication through Memory (cont.)

__global__ void share_data(int *input){ __shared__ int data[BLOCK_SIZE]; data[threadIdx.x] = input[threadIdx.x]; __syncthreads();}

Use atomic operations (e.g., atomicAdd) to ensure exclusive access to a variable

After this kernel exits, the value of *result will be the sum of the inputs


Communication through Memory (cont.)

// assume *result is initialized to 0

__global__ void sum(int *input, int *result){ atomicAdd(result, input[threadIdx.x]);}

Atomic operations aren’t cheap; they imply serialized access to a variable.

How many threads will contend for exclusive access to result?


Resource Contention

__global__ void sum(int *input, int *result){ atomicAdd(result, input[threadIdx.x]);}

sum<<<B,N/B>>>(input,result);

Divide and ConquerPer-thread atomicAdd to a __shared__ partial

sumPer-block atomicAdd to the total sumCUDA Memories – Slide 33

Hierarchical AtomicsS

S0 S1 Si


Hierarchical Atomics (cont.)__global__ void sum(int *input, int *result){ __shared__ int partial_sum;

// thread 0 is responsible for initializing partial_sum if (threadIdx.x == 0) partial_sum = 0; __syncthreads();

// each thread updates the partial sum atomicAdd(&partial_sum, input[threadIdx.x]); __syncthreads();

// thread 0 updates the total sum if (threadIdx.x == 0) atomicAdd(result, partial_sum);}

Use barriers such as __syncthreads to wait until __shared__ data is ready

Prefer barriers to atomics when data access patterns are regular or predictable

Prefer atomics to barriers when data access patterns are sparse or unpredictable

Atomics to __shared__ variables are much faster than atomics to global variables

Don’t synchronize or serialize unnecessarily


Advice

Generalize adjacent_difference exampleAB = A * B

Each element ABij = dot(row(A,i),col(B,j))

Parallelization strategyThread ABij2D kernel


Example: Matrix Multiplication using Shared Memory

A

B

AB


First Try: Matrix Multiply Kernel using Multiple Blocks

__global__ void mat_mul(float *a, float *b, float *ab, int width){ // calculate the row & col index of the element int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x;

float result = 0;

// do dot product between row of a and col of b for (int k = 0; k < width; ++k) result += a[row * width + k] * b[k * width + col];

ab[row * width+col] = result;}


How will this perform?How many loads per term of dot product?

2 (a and b) = 8 Bytes

How many floating point (FP) operations?

2 (multiply and addition)

Global memory access to flop ratio (GMAC)

8 Bytes / 2 ops = 4 B/op

What is the peak FP performance of GeForce GTX 260?

805 GFLOPS

Lower bound on bandwidth required to reach peak FP performance

GMAC * Peak FLOPS = 4 * 805 = 3.2 TB/s

What is the actual memory bandwidth of GeForce GTX 260?

112 GB/s

Then what is an upper bound on performance of our implementation?

Actual BW / GMAC = 112 / 4 = 28 GFLOPS

All threads access global memory for their input matrix elements

The actual code runs at about 15 GFLOPS

Need to drastically cut down memory accesses to get closer to the peak 805 GFLOPS


How will this perform? (cont.)Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

Each input element is read by width threads

Load each element into __shared__ memory and have several threads use the local version to reduce the memory bandwidth


Idea: Use __shared__ memory to reuse global data

A

B

AB

width

Partition kernel loop into phases so that the data accesses in each phase are focused on one subset (tile) of A and B

Load a tile of both matrices into __shared__ each phase


Tiled Multiply TILE_WIDTH

A

B

AB

Each phaseeach block

computes one square sub-matrix ABsub of size TILE_WIDTH

each phase, each thread computes a partial result, one element of ABsub


Tiled Multiply (cont.) TILE_WIDTH

A

B

AB


A Small Example

AB1,0A2,0

A1,1

A1,0A0,0

A0,1

A3,0

A2,1

AB0,0

A3,1 AB0,1

AB2,0AB3,0

B0,3 B1,3

B1,2

B1,1

B1,0B0,0

B0,1

B0,2

AB1,1

AB0,2 AB2,2AB3,2AB1,2

AB3,1AB2,1

AB0,3 AB2,3AB3,3AB1,3

Every A and B element is used exactly twice in generating a 2-by-2 tile of AB


A Small Example (cont.)

AB0,0

thread0,0

AB1,0

thread1,0

AB0,1

thread0,1

AB1,1

thread1,1

A0,0 * B0,0 A0,0 * B1,0 A0,1 * B0,0 A0,1 * B1,0

A1,0 * B0,1 A1,0 * B1,1 A1,1 * B0,1 A1,1 * B1,1

A2,0 * B0,2 A2,0 * B1,2 A2,1 * B0,2 A2,1 * B1,2

A3,0 * B0,3 A3,0 * B1,3 A3,1 * B0,3 A3,1 * B1,3

Accessorder


Breaking A and B into Tiles

AB1,0A2,0

A1,1

A1,0A0,0

A0,1

A3,0

A2,1

AB0,0

A3,1 AB0,1

AB2,0AB3,0

B0,3 B1,3

B1,2

B1,1

B1,0B0,0

B0,1

B0,2

AB1,1

AB0,2 AB2,2AB3,2AB1,2

AB3,1AB2,1

AB0,3 AB2,3AB3,3AB1,3

Each phase of a thread block uses one tile from A and one from B


Breaking A and B into Tiles (cont.)

Phase 1 Phase 2T0,0 A0,0

↓ s_a0,0

B0,0

↓ s_b0,0

AB0,0 += s_a0,0*s_b0,0 + s_a1,0*s_b0,1

A2,0 ↓ s_a0,0

B0,2

↓ s_b0,0

AB0,0 += s_a0,0*s_b0,0 + s_a1,0*s_b0,1

T1,0 A1,0

↓ s_a1,0

B1,0

↓ s_b1,0

AB1,0 += s_a0,0*s_b1,0 + s_a1,0*s_b1,1

A3,0 ↓ s_a1,0

B1,2

↓ s_b1,0

AB1,0 += s_a0,0*s_b1,0 + s_a1,0*s_b1,1

T0,1 A0,1

↓ s_a0,1

B0,1

↓ s_b0,1

AB0,1 += s_a0,1*s_b0,0 + s_a1,1*s_b0,1

A2,1

↓ s_a0,1

B0,3

↓ s_b0,1

AB0,1 += s_a0,1*s_b0,0 + s_a1,1*s_b0,1

T1,1 A1,1

↓ s_a1,1

B1,1

↓ s_b1,1

AB1,1 += s_a0,1*s_b1,0 + s_a1,1*s_b1,1

A3,1 ↓ s_a1,1

B1,3

↓ s_b1,1

AB1,1 += s_a0,1*s_b1,0 + s_a1,1*s_b1,1

time

Each phaseeach block

computes one square sub-matrix ABsub of size TILE_WIDTH

each phase, each thread computes a partial result, one element of ABsub


Tiled Multiply (cont.) TILE_WIDTH

A

B

AB

Set up the execution configuration


Better Implementation

dim3 dimBlock (TILE_WIDTH, TILE_WIDTH);

dim3 dimGrid (Width / TILE_WIDTH, Width / TILE_WIDTH);


Better Implementation (cont.)__global__ void mat_mul(float *a, float *b, float *ab, int width){ // shorthand int tx = threadIdx.x, ty = threadIdx.y; int bx = blockIdx.x, by = blockIdx.y; // allocate tiles in __shared__ memory __shared__ float s_a[TILE_WIDTH][TILE_WIDTH]; __shared__ float s_b[TILE_WIDTH][TILE_WIDTH]; // calculate the row & col index int row = by * blockDim.y + ty; int col = bx * blockDim.x + tx;

float result = 0;


Better Implementation (cont.) // loop over the tiles of the input in phases for (int p = 0; p < width/TILE_WIDTH; ++p) { // collaboratively load tiles into __shared__ s_a[ty][tx] = a[row * width + (p * TILE_WIDTH + tx)]; s_b[ty][tx] = b[(m * TILE_WIDTH + ty) * width + col]; __syncthreads();

// dot product between row of s_a and col of s_b for (int k = 0; k < TILE_WIDTH; ++k) result += s_a[ty][k] * s_b[k][tx]; __syncthreads(); }

ab[row * width + col] = result;}

Two barriers per phase:__syncthreads after all data is loaded into __shared__ memory

__syncthreads after all data is read from __shared__ memory

Note that second __syncthreads in phase p guards the load in phase p+1

Use barriers to guard dataGuard against using uninitialized dataGuard against bashing live data


Use of Barriers in mat_mul

Each thread block should have many threadsTILE_WIDTH = 16 16*16 = 256 threads

There should be many thread blocks1024-by-1024 matrices 64*64 = 4096 thread

blocksTILE_WIDTH = 16 gives each SM 3 blocks,

768 threadsFull occupancy

Each thread block performs 2 * 256 = 512 32B loads from global memory for 256 * (2 * 16) = 8,192 FP operationsMemory bandwidth no longer a limiting factorCUDA Memories – Slide 52

First Order Size Considerations

Experiment performed on a GT200This optimization was clearly worth the effortBetter performance still possible in theory


Optimization AnalysisImplementation Original ImprovedGlobal Loads 2N3 2N2

*(N/TILE_WIDTH)Throughput 10.7

GFLOPS183.9 GFLOPS

SLOCs 20 44Relative Improvement 1x 17.2xImprovement/SLOC 1x 7.8x

Effective use of different memory resources reduces the number of accesses to global memory

These resources are finite!The more memory locations each thread requires

the fewer threads an SM can accommodateCUDA Memories – Slide 54

Memory Resources as Limit to Parallelism

Resource Per GT200

SM

Full Occupancy on GT200

Registers 16384 ≤ 16384 / 768 threads= 21 per thread

__shared__ Memory 16 KB ≤ 16 KB / 8 blocks= 2 KB per block

Each SM in GT200 has 16KB shared memory SM size is implementation dependent! For TILE_WIDTH = 16, each thread block uses 2*256*4B

= 2KB of shared memory. Can potentially have up to 8 Thread Blocks actively

executing This allows up to 8*512 = 4,096 pending loads. (2 per thread,

256 threads per block) The next TILE_WIDTH 32 would lead to 2*32*32*4B=

8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same time

Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16 The 112 GB/s bandwidth can now support (112/4)*16 =

448 GFLOPS! CUDA Memories – Slide 55

GT200 Shared Memory and Threading


TILE_SIZE Effects

Global variables declaration __host__ __device__... __global__, __constant__, __texture__

Function prototypes __global__ void kernelOne(…) float handyFunction(…)

Main () allocate memory space on the device – cudaMalloc(&d_GlblVarPtr,

bytes ) transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…) execution configuration setup kernel call – kernelOne<<<execution configuration>>>( args… ); transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…) optional: compare against golden (host computed) solution

Kernel – void kernelOne(type args,…) variables declaration - __local__, __shared__

automatic variables transparently assigned to registers or local memory __syncthreads()…

Other functions float handyFunction(int inVar…);


Summary: Typical Structure of a CUDA Program

repeatas

needed

Effective use of CUDA memory hierarchy decreases bandwidth consumption to increase throughput

Use __shared__ memory to eliminate redundant loads from global memoryUse __syncthreads barriers to protect __shared__ data

Use atomics if access patterns are sparse or unpredictable

Optimization comes with a development costMemory resources ultimately limit parallelism


Final Thoughts

Reading: Chapter 5, “Programming Massively Parallel Processors” by Kirk and Hwu.

Based on original material fromThe University of Illinois at Urbana-Champaign

David Kirk, Wen-mei W. HwuStanford University

Jared Hoberock, David TarjanRevision history: last updated 8/9/2011.


End Credits

CUDA Lecture 8 CUDA Memories

Documents

int localvar int arrayvar10100

int sharedvarsharedblockblock

int constantvar1100

int constantvarconstant1x100ks

int globalvar1100

int sharedvarshared1x

int sharedvar100s100s

int globalvarglobal100x