 Prepared 8/9/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. CUDA Lecture 8 CUDA Memories
59

# CUDA Lecture 8 CUDA Memories

Feb 24, 2016

## Documents

raina

CUDA Lecture 8 CUDA Memories. Prepared 8/9/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Grid. Block (0, 0). Block (1, 0). Shared Memory. Shared Memory. Registers. Registers. Registers. Registers. Thread (0, 0). Thread (1, 0). Thread (0, 0). Thread (1, 0). Host. - PowerPoint PPT Presentation

#### int globalvarglobal100x

CUDA Lecture 1 Introduction to Massively Parallel Computing

Prepared 8/9/2011 by T. ONeil for 3460:677, Fall 2011, The University of Akron.CUDA Lecture 8CUDA Memories

// per-thread heap goes in off-chip memory float2 heap;

// read through num_qs points, maintaining // the nearest 10 qs to p in the heap ... // write out the contents of heap to result ...}CUDA Memories Slide 9Example: Shared Variables// motivate shared variables with// Adjacent Difference application:// compute result[i] = input[i] input[i-1]__global__ void adj_diff_naive(int *result, int *input){ // compute this threads global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i > 0) { // each thread loads two elements from global memory int x_i = input[i]; int x_i_minus_one = input[i-1];

result[i] = x_i x_i_minus_one; }}CUDA Memories Slide 10Example: Shared Variables (cont.)// motivate shared variables with// Adjacent Difference application:// compute result[i] = input[i] input[i-1]__global__ void adj_diff_naive(int *result, int *input){ // compute this threads global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i > 0) { // what are the bandwidth requirements of this kernel? int x_i = input[i]; int x_i_minus_one = input[i-1];

result[i] = x_i x_i_minus_one; }}Two loadsCUDA Memories Slide 11Example: Shared Variables (cont.)// motivate shared variables with// Adjacent Difference application:// compute result[i] = input[i] input[i-1]__global__ void adj_diff_naive(int *result, int *input){ // compute this threads global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i > 0) { // how many times does this kernel load input[i]? int x_i = input[i]; int x_i_minus_one = input[i-1];

result[i] = x_i x_i_minus_one; }}// once by thread i// again by thread i+1CUDA Memories Slide 12Example: Shared Variables (cont.)// motivate shared variables with// Adjacent Difference application:// compute result[i] = input[i] input[i-1]__global__ void adj_diff_naive(int *result, int *input){ // compute this threads global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i > 0) { // Idea: eliminate redundancy by sharing data int x_i = input[i]; int x_i_minus_one = input[i-1];

result[i] = x_i x_i_minus_one; }}CUDA Memories Slide 13Example: Shared Variables (cont.)// optimized version of adjacent difference__global__ void adj_diff(int *result, int *input){ // shorthand for threadIdx.x int tx = threadIdx.x; // allocate a __shared__ array, one element per thread __shared__ int s_data[BLOCK_SIZE]; // each thread reads one element to s_data unsigned int i = blockDim.x * blockIdx.x + tx; s_data[tx] = input[i];

// avoid race condition: ensure all loads complete before continuing __syncthreads(); if (tx > 0) result[i] = s_data[tx] s_data[tx1]; else if (i > 0) { // handle thread block boundary result[i] = s_data[tx] input[i-1]; }}CUDA Memories Slide 14Example: Shared Variables (cont.)// when the size of the array isnt known at compile time...__global__ void adj_diff(int *result, int *input){ // use extern to indicate a __shared__ array will be // allocated dynamically at kernel launch time extern __shared__ int s_data[]; ...}// pass the size of the per-block array, in bytes, as the third// argument to the triple chevronsadj_diff(r,i);Experiment performed on a GT200 chipImprovement likely better on an older architectureImprovement likely worse on a newer architectureOptimizations tend to come with a development costCUDA Memories Slide 15Optimization AnalysisImplementationOriginalImprovedGlobal loads2NN + N/BLOCK_SIZEGlobal storesNNThroughput36.8 GB/s57.5 GB/sSource lines of code (SLOCs)1835Relative improvement1x1.57xImprovement/SLOC1x0.81xPointers can only point to memory allocated or declared in global memory:

Allocated in the host and passed to the kernel: __global__ void KernelFunc(float* ptr)

Obtained as the address of a global variable: float* ptr = &GlobalVar;CUDA Memories Slide 16Variable Type RestrictionsSo you can use pointers and point at any memory space per se:CUDA Memories Slide 17Variable Type Restrictions (cont.)__device__ int my_global_variable;__constant__ int my_constant_variable = 13;

__global__ void foo(void){ __shared__ int my_shared_variable;

int *ptr_to_global = &my_global_variable; const int *ptr_to_constant = &my_constant_variable; int *ptr_to_shared = &my_shared_variable; ... *ptr_to_global = *ptr_to_shared;}Pointers arent typed on memory space

Where does ptr point?ptr is a __shared__ pointer variable, not a pointer to a __shared__ variable!CUDA Memories Slide 18Variable Type Restrictions (cont.)__shared__ int *ptr;CUDA Memories Slide 19Dont confuse the compiler!__device__ int my_global_variable;__global__ void foo(int *input){ __shared__ int my_shared_variable;

int *ptr = 0; if (input[threadIdx.x] % 2) ptr = &my_global_variable; else ptr = &my_shared_variable; // where does ptr point?}Prefer dereferencing pointers in simple, regular access patternsAvoid propagating pointersAvoid pointers to pointersThe GPU would rather not pointer chaseLinked lists will not perform wellPay attention to compiler warning messagesWarning: Cannot tell what pointer points to, assuming global memory spaceCrash waiting to happenCUDA Memories Slide 20AdviceGlobal memory resides in device memory (DRAM)Much slower access than shared memorySo, a profitable way of performing computation on the device is to tile data to take advantage of fast shared memory:Generalize from adjacent_difference exampleDivide and conquerCUDA Memories Slide 21A Common Programming StrategyPartition data into subsets that fit into shared memoryCUDA Memories Slide 22A Common Programming Strategy (cont.)Handle each data subset with one thread block as follows:CUDA Memories Slide 23A Common Programming Strategy (cont.)Load the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelismCUDA Memories Slide 24A Common Programming Strategy (cont.)Perform the computation on the subset from shared memory; each thread can efficiently multi-pass over any data elementCUDA Memories Slide 25A Common Programming Strategy (cont.)Copy the results from shared memory back to global memoryCUDA Memories Slide 26A Common Programming Strategy (cont.)Constant memory also resides in device memory (DRAM)Much slower access than shared memoryButcached!Highly efficient access for read-only dataCUDA Memories Slide 27A Common Programming Strategy (cont.)Carefully partition data according to access patternsRead-only __constant__ memory (very fast if in cache)R/W & shared within block __shared__ memory (very fast)R/W within each thread registers (very fast)Indexed R/W within each thread local memory (slow)R/W inputs/results cudaMalloced global memory (very slow)CUDA Memories Slide 28A Common Programming Strategy (cont.)This is a race condition; the result is undefinedThe order in which threads access the variable is undefined without explicit coordinationTwo ways to enforce well-defined semanticsCUDA Memories Slide 29Communication through Memory__global__ void race(void){ __shared__ int my_shared_variable; my_shared_variable = threadIdx.x;}Use barriers (e.g., __syncthreads) to ensure data is ready for access

The state of the entire data array is now well-defined for all threads in this block.CUDA Memories Slide 30Communication through Memory (cont.)__global__ void share_data(int *input){ __shared__ int data[BLOCK_SIZE]; data[threadIdx.x] = input[threadIdx.x]; __syncthreads();}Use atomic operations (e.g., atomicAdd) to ensure exclusive access to a variable

After this kernel exits, the value of *result will be the sum of the inputsCUDA Memories Slide 31Communication through Memory (cont.)// assume *result is initialized to 0

__global__ void sum(int *input, int *result){ atomicAdd(result, input[threadIdx.x]);}Atomic operations arent cheap; they imply serialized access to a variable.

How many threads will contend for exclusive access to result?CUDA Memories Slide 32Resource Contention__global__ void sum(int *input, int *result){ atomicAdd(result, input[threadIdx.x]);}sum(input,result);Divide and ConquerPer-thread atomicAdd to a __shared__ partial sumPer-block atomicAdd to the total sumCUDA Memories Slide 33Hierarchical AtomicsSS0S1SiCUDA Memories Slide 34Hierarchical Atomics (cont.)__global__ void sum(int *input, int *result){ __shared__ int partial_sum;

// thread 0 is responsible for initializing partial_sum if (threadIdx.x == 0) partial_sum = 0; __syncthreads();

// thread 0 updates the total sum if (threadIdx.x == 0) atomicAdd(result, partial_sum);}Use barriers such as __syncthreads to wait until __shared__ data is readyPrefer barriers to atomics when data access patterns are regular or predictablePrefer atomics to barriers when data access patterns are sparse or unpredictableAtomics to __shared__ variables are much faster than atomics to global variablesDont synchronize or serialize unnecessarilyCUDA Memories Slide 35AdviceGeneralize adjacent_difference exampleAB = A * BEach element ABij = dot(row(A,i),col(B,j))Parallelization strategyThread ABij2D kernelCUDA Memories Slide 36Example: Matrix Multiplication using Shared MemoryABABCUDA Memories Slide 37First Try: Matrix Multiply Kernel using Multiple Blocks__global__ void mat_mul(float *a, float *b, float *ab, int width){ // calculate the row & col index of the element int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x;

float result = 0;

// do dot product between row of a and col of b for (int k = 0; k < width; ++k) result += a[row * width + k] * b[k * width + col];

dim3 dimGrid (Width / TILE_WIDTH, Width / TILE_WIDTH);CUDA Memories Slide 49Better Implementation (cont.)__global__ void mat_mul(float *a, float *b, float *ab, int width){ // shorthand int tx = threadIdx.x, ty = threadIdx.y; int bx = blockIdx.x, by = blockIdx.y; // allocate tiles in __shared__ memory __shared__ float s_a[TILE_WIDTH][TILE_WIDTH]; __shared__ float s_b[TILE_WIDTH][TILE_WIDTH]; // calculate the row & col index int row = by * blockDim.y + ty; int col = bx * blockDim.x + tx;

float result = 0; CUDA Memories Slide 50Better Implementation (cont.) // loop over the tiles of the input in phases for (int p = 0; p < width/TILE_WIDTH; ++p) { // collaboratively load tiles into __shared__ s_a[ty][tx] = a[row * width + (p * TILE_WIDTH + tx)]; s_b[ty][tx] = b[(m * TILE_WIDTH + ty) * width + col]; __syncthreads();

// dot product between row of s_a and col of s_b for (int k = 0; k < TILE_WIDTH; ++k) result += s_a[ty][k] * s_b[k][tx]; __syncthreads(); }

ab[row * width + col] = result;}Two barriers per phase:__syncthreads after all data is loaded into __shared__ memory__syncthreads after all data is read from __shared__ memoryNote that second __syncthreads in phase p guards the load in phase p+1

Global variables declaration__host____device__... __global__, __constant__, __texture__Function prototypes__global__ void kernelOne()float handyFunction()Main ()allocate memory space on the device cudaMalloc(&d_GlblVarPtr, bytes )transfer data from host to device cudaMemCpy(d_GlblVarPtr, h_Gl)execution configuration setupkernel call kernelOne( args );transfer results from device to host cudaMemCpy(h_GlblVarPtr,)optional: compare against golden (host computed) solutionKernel void kernelOne(type args,)variables declaration - __local__, __shared__automatic variables transparently assigned to registers or local memory__syncthreads()Other functionsfloat handyFunction(int inVar);CUDA Memories Slide 57Summary: Typical Structure of a CUDA Programrepeatas neededEffective use of CUDA memory hierarchy decreases bandwidth consumption to increase throughputUse __shared__ memory to eliminate redundant loads from global memoryUse __syncthreads barriers to protect __shared__ dataUse atomics if access patterns are sparse or unpredictableOptimization comes with a development costMemory resources ultimately limit parallelism

CUDA Memories Slide 58Final ThoughtsReading: Chapter 5, Programming Massively Parallel Processors by Kirk and Hwu.Based on original material fromThe University of Illinois at Urbana-ChampaignDavid Kirk, Wen-mei W. HwuStanford UniversityJared Hoberock, David TarjanRevision history: last updated 8/9/2011.

CUDA Memories Slide 59End CreditsChart110.60223.4048615.968653.804758.255164.88567.9359183.179

Column1TILE_SIZEGFLOPS

Sheet1Column1untiled10.60222x23.404864x415.96868x853.804712x1258.255114x1464.88515x1567.935916x16183.179

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Related Documents