Top Banner
Prepared 8/9/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. CUDA Lecture 8 CUDA Memories
59

CUDA Lecture 8 CUDA Memories

Feb 24, 2016

ReportDownload

Documents

raina

CUDA Lecture 8 CUDA Memories. Prepared 8/9/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Grid. Block (0, 0). Block (1, 0). Shared Memory. Shared Memory. Registers. Registers. Registers. Registers. Thread (0, 0). Thread (1, 0). Thread (0, 0). Thread (1, 0). Host. - PowerPoint PPT Presentation

CUDA Lecture 1 Introduction to Massively Parallel Computing

Prepared 8/9/2011 by T. ONeil for 3460:677, Fall 2011, The University of Akron.CUDA Lecture 8CUDA Memories

Each thread can:Read/write per-thread registersRead/write per-thread local memoryRead/write per-block shared memoryRead/write per-grid global memoryRead/only per-grid constant memory

CUDA Memories Slide 2Hardware Implementation of CUDA MemoriesGridGlobal MemoryBlock (0, 0)Shared MemoryThread (0, 0)RegistersThread (1, 0)RegistersBlock (1, 0)Shared MemoryThread (0, 0)RegistersThread (1, 0)RegistersHostConstant Memory__device__ is optional when used with __local__, __shared__ or __constant__CUDA Memories Slide 3CUDA Variable Type QualifiersVariable declarationMemoryScopeLifetime int LocalVar;registerthreadthread__device__ __local__ int LocalVar; int ArrayVar[10];localthreadthread__device__ __shared__ int SharedVar;sharedblockblock__device__ int GlobalVar;globalgridapplication__device__ __constant__ int ConstantVar;constantgridapplicationAutomatic scalar variables without any qualifier reside in a registerCompiler will spill to thread local memoryAutomatic array variables without any qualifier reside in a thread-local memoryCUDA Memories Slide 4CUDA Variable Type Qualifiers (cont.)Variable declarationMemoryScopeLifetime int LocalVar;registerthreadthread__device__ __local__ int LocalVar; int ArrayVar[10];localthreadthread__device__ __shared__ int SharedVar;sharedblockblock__device__ int GlobalVar;globalgridapplication__device__ __constant__ int ConstantVar;constantgridapplicationscalar variables reside in fast, on-chip registersshared variables reside in fast, on-chip memoriesthread-local arrays and global variables reside in uncached off-chip memoryconstant variables reside in cached off-chip memoryCUDA Memories Slide 5CUDA Variable Type PerformanceVariable declarationMemoryPenalty int LocalVar;register1x__device__ __local__ int LocalVar; int ArrayVar[10];local100x__device__ __shared__ int SharedVar;shared1x__device__ int GlobalVar;global100x__device__ __constant__ int ConstantVar;constant1x100Ks per-thread variables, R/W by 1 thread100s shared variables, each R/W by 100s of threads1 global variable is R/W by 100Ks threads1 constant variable is readable by 100Ks threadsCUDA Memories Slide 6CUDA Variable Type ScaleVariable declarationInstancesVisibility int LocalVar;100,000s1__device__ __local__ int LocalVar; int ArrayVar[10];100,000s1__device__ __shared__ int SharedVar;100s100s__device__ int GlobalVar;1100,000s__device__ __constant__ int ConstantVar;1100,000sCUDA Memories Slide 7Where to declare variables?Can host access it?YesNoOutside of any functionIn the kernel__constant__ int ConstantVar;__device__ int GlobalVar; int LocalVar; int ArrayVar[10];__shared__ int SharedVar;CUDA Memories Slide 8Example: Thread-local Variables// motivate per-thread variables with// Ten Nearest Neighbors application__global__ void ten_nn(float2 *result, float2 *ps, float2 *qs, size_t num_qs){ // p goes in a register float2 p = ps[threadIdx.x];

// per-thread heap goes in off-chip memory float2 heap[10];

// read through num_qs points, maintaining // the nearest 10 qs to p in the heap ... // write out the contents of heap to result ...}CUDA Memories Slide 9Example: Shared Variables// motivate shared variables with// Adjacent Difference application:// compute result[i] = input[i] input[i-1]__global__ void adj_diff_naive(int *result, int *input){ // compute this threads global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i > 0) { // each thread loads two elements from global memory int x_i = input[i]; int x_i_minus_one = input[i-1];

result[i] = x_i x_i_minus_one; }}CUDA Memories Slide 10Example: Shared Variables (cont.)// motivate shared variables with// Adjacent Difference application:// compute result[i] = input[i] input[i-1]__global__ void adj_diff_naive(int *result, int *input){ // compute this threads global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i > 0) { // what are the bandwidth requirements of this kernel? int x_i = input[i]; int x_i_minus_one = input[i-1];

result[i] = x_i x_i_minus_one; }}Two loadsCUDA Memories Slide 11Example: Shared Variables (cont.)// motivate shared variables with// Adjacent Difference application:// compute result[i] = input[i] input[i-1]__global__ void adj_diff_naive(int *result, int *input){ // compute this threads global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i > 0) { // how many times does this kernel load input[i]? int x_i = input[i]; int x_i_minus_one = input[i-1];

result[i] = x_i x_i_minus_one; }}// once by thread i// again by thread i+1CUDA Memories Slide 12Example: Shared Variables (cont.)// motivate shared variables with// Adjacent Difference application:// compute result[i] = input[i] input[i-1]__global__ void adj_diff_naive(int *result, int *input){ // compute this threads global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i > 0) { // Idea: eliminate redundancy by sharing data int x_i = input[i]; int x_i_minus_one = input[i-1];

result[i] = x_i x_i_minus_one; }}CUDA Memories Slide 13Example: Shared Variables (cont.)// optimized version of adjacent difference__global__ void adj_diff(int *result, int *input){ // shorthand for threadIdx.x int tx = threadIdx.x; // allocate a __shared__ array, one element per thread __shared__ int s_data[BLOCK_SIZE]; // each thread reads one element to s_data unsigned int i = blockDim.x * blockIdx.x + tx; s_data[tx] = input[i];

// avoid race condition: ensure all loads complete before continuing __syncthreads(); if (tx > 0) result[i] = s_data[tx] s_data[tx1]; else if (i > 0) { // handle thread block boundary result[i] = s_data[tx] input[i-1]; }}CUDA Memories Slide 14Example: Shared Variables (cont.)// when the size of the array isnt known at compile time...__global__ void adj_diff(int *result, int *input){ // use extern to indicate a __shared__ array will be // allocated dynamically at kernel launch time extern __shared__ int s_data[]; ...}// pass the size of the per-block array, in bytes, as the third// argument to the triple chevronsadj_diff(r,i);Experiment performed on a GT200 chipImprovement likely better on an older architectureImprovement likely worse on a newer architectureOptimizations tend to come with a development costCUDA Memories Slide 15Optimization AnalysisImplementationOriginalImprovedGlobal loads2NN + N/BLOCK_SIZEGlobal storesNNThroughput36.8 GB/s57.5 GB/sSource lines of code (SLOCs)1835Relative improvement1x1.57xImprovement/SLOC1x0.81xPointers can only point to memory allocated or declared in global memory:

Allocated in the host and passed to the kernel: __global__ void KernelFunc(float* ptr)

Obtained as the address of a global variable: float* ptr = &GlobalVar;CUDA Memories Slide 16Variable Type RestrictionsSo you can use pointers and point at any memory space per se:CUDA Memories Slide 17Variable Type Restrictions (cont.)__device__ int my_global_variable;__constant__ int my_constant_variable = 13;

__global__ void foo(void){ __shared__ int my_shared_variable;

int *ptr_to_global = &my_global_variable; const int *ptr_to_constant = &my_constant_variable; int *ptr_to_shared = &my_shared_variable; ... *ptr_to_global = *ptr_to_shared;}Pointers arent typed on memory space

Where does ptr point?ptr is a __shared__ pointer variable, not a pointer to a __shared__ variable!CUDA Memories Slide 18Variable Type Restrictions (cont.)__shared__ int *ptr;CUDA Memories Slide 19Dont confuse the compiler!__device__ int my_global_variable;__global__ void foo(int *input){ __shared__ int my_shared_variable;

int *ptr = 0; if (input[threadIdx.x] % 2) ptr = &my_global_variable; else ptr = &my_shared_variable; // where does ptr point?}Prefer dereferencing pointers in simple, regular access patternsAvoid propagating pointersAvoid pointers to pointersThe GPU would rather not pointer chaseLinked lists will not perform wellPay attention to compiler warning messagesWarning: Cannot tell what pointer points to, assuming global memory spaceCrash waiting to happenCUDA Memories Slide 20AdviceGlobal memory resides in device memory (DRAM)Much slower access than shared memorySo, a profitable way of performing computation on the device is to tile data to take advantage of fast shared memory:Generalize from adjacent_difference exampleDivide and conquerCUDA Memories Slide 21A Common Programming StrategyPartition data into subsets that fit into shared memoryCUDA Memories Slide 22A Common Programming Strategy (cont.)Handle each data subset with one thread block as follows:CUDA Memories Slide 23A Common Programming Strategy (cont.)Load the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelismCUDA Memories Slide 24A Common Programming Strategy (cont.)Perform the computation on the subset from shared memory; each thread can efficiently multi-pass over any data elementCUDA Memories Slide 25A Common Programming Strategy (cont.)Copy the results from shared memory back to global memoryCUDA Memories Slide 26A Common Programming Strategy (cont.)Constant memory also resides in device memory (DRAM)Much slower access than shared memoryButcached!Highly efficient access for read-only dataCUDA Memories Slide 27A Common Programming Strategy (cont.)Carefully partition data according to access patternsRead-only __constant__ memory (very fast if in cache)R/W & shared within block __shared__ memory (very fast)R/W within each thread registers (very fast)Indexed R/W within each thread local memory (slow)R/W inputs/results cudaMalloced global memory (very slow)CUDA Memories Slide 28A Common Programming Strategy (cont.)This is a race condition; the result is undefinedThe order in which threads access the variable is undefined without explicit coordinationTwo ways to enforce well-defined semanticsCUDA Memories Slide 29Communication through Memory__global__ void race(void){ __shared__ int my_shared_variable; my_shared_variable = threadIdx.x;}Use barriers (e.g., __syncthreads) to ensure data is ready for access

The state of the entire data array is now well-defined for all threads in this block.CUDA Memories Slide 30Communication through Memory (cont.)__global__ void share_data(int *input){ __shared__ int data[BLOCK_SIZE]; data[threadIdx.x] = input[threadIdx.x]; __syncthreads();}Use atomic operations (e.g., atomicAdd) to ensure exclusive access to a variable

After this kernel exits, the value of *result will be the sum of the inputsCUDA Memories Slide 31Communication through Memory (cont.)// assume *result is initialized to 0

__global__ void sum(int *input, int *result){ atomicAdd(result, input[threadIdx.x]);}Atomic operations arent cheap; they imply serialized access to a variable.

How many threads will contend for exclusive access to result?CUDA Memories Slide 32Resource Contention__global__ void sum(int *input, int *result){ atomicAdd(result, input[threadIdx.x]);}sum(input,result);Divide and ConquerPer-thread atomicAdd to a __shared__ partial sumPer-block atomicAdd to the total sumCUDA Memories Slide 33Hierarchical AtomicsSS0S1SiCUDA Memories Slide 34Hierarchical Atomics (cont.)__global__ void sum(int *input, int *result){ __shared__ int partial_sum;

// thread 0 is responsible for initializing partial_sum if (threadIdx.x == 0) partial_sum = 0; __syncthreads();

// each thread updates the partial sum atomicAdd(&partial_sum, input[threadIdx.x]); __syncthreads();

// thread 0 updates the total sum if (threadIdx.x == 0) atomicAdd(result, partial_sum);}Use barriers such as __syncthreads to wait until __shared__ data is readyPrefer barriers to atomics when data access patterns are regular or predictablePrefer atomics to barriers when data access patterns are sparse or unpredictableAtomics to __shared__ variables are much faster than atomics to global variablesDont synchronize or serialize unnecessarilyCUDA Memories Slide 35AdviceGeneralize adjacent_difference exampleAB = A * BEach element ABij = dot(row(A,i),col(B,j))Parallelization strategyThread ABij2D kernelCUDA Memories Slide 36Example: Matrix Multiplication using Shared MemoryABABCUDA Memories Slide 37First Try: Matrix Multiply Kernel using Multiple Blocks__global__ void mat_mul(float *a, float *b, float *ab, int width){ // calculate the row & col index of the element int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x;

float result = 0;

// do dot product between row of a and col of b for (int k = 0; k < width; ++k) result += a[row * width + k] * b[k * width + col];

ab[row * width+col] = result;}CUDA Memories Slide 38How will this perform?How many loads per term of dot product?2 (a and b) = 8 BytesHow many floating point (FP) operations?2 (multiply and addition)Global memory access to flop ratio (GMAC)8 Bytes / 2 ops = 4 B/opWhat is the peak FP performance of GeForce GTX 260?805 GFLOPSLower bound on bandwidth required to reach peak FP performanceGMAC * Peak FLOPS = 4 * 805 = 3.2 TB/sWhat is the actual memory bandwidth of GeForce GTX 260?112 GB/sThen what is an upper bound on performance of our implementation?Actual BW / GMAC = 112 / 4 = 28 GFLOPSAll threads access global memory for their input matrix elementsThe actual code runs at about 15 GFLOPSNeed to drastically cut down memory accesses to get closer to the peak 805 GFLOPSCUDA Memories Slide 39How will this perform? (cont.)GridGlobal MemoryBlock (0, 0)Shared MemoryThread (0, 0)RegistersThread (1, 0)RegistersBlock (1, 0)Shared MemoryThread (0, 0)RegistersThread (1, 0)RegistersHostConstant MemoryEach input element is read by width threadsLoad each element into __shared__ memory and have several threads use the local version to reduce the memory bandwidthCUDA Memories Slide 40Idea: Use __shared__ memory to reuse global dataABABwidthPartition kernel loop into phases so that the data accesses in each phase are focused on one subset (tile) of A and BLoad a tile of both matrices into __shared__ each phaseCUDA Memories Slide 41Tiled MultiplyTILE_WIDTHABABEach phaseeach block computes one square sub-matrix ABsub of size TILE_WIDTHeach phase, each thread computes a partial result, one element of ABsubCUDA Memories Slide 42Tiled Multiply (cont.)TILE_WIDTHABABCUDA Memories Slide 43A Small ExampleAB1,0A2,0A1,1A1,0A0,0A0,1A3,0A2,1AB0,0A3,1AB0,1AB2,0AB3,0B0,3B1,3B1,2B1,1B1,0B0,0B0,1B0,2AB1,1AB0,2AB2,2AB3,2AB1,2AB3,1AB2,1AB0,3AB2,3AB3,3AB1,3Every A and B element is used exactly twice in generating a 2-by-2 tile of ABCUDA Memories Slide 44A Small Example (cont.)AB0,0thread0,0AB1,0thread1,0AB0,1thread0,1AB1,1thread1,1A0,0 * B0,0A0,0 * B1,0A0,1 * B0,0A0,1 * B1,0A1,0 * B0,1A1,0 * B1,1A1,1 * B0,1A1,1 * B1,1A2,0 * B0,2A2,0 * B1,2A2,1 * B0,2A2,1 * B1,2A3,0 * B0,3A3,0 * B1,3A3,1 * B0,3A3,1 * B1,3AccessorderCUDA Memories Slide 45Breaking A and B into TilesAB1,0A2,0A1,1A1,0A0,0A0,1A3,0A2,1AB0,0A3,1AB0,1AB2,0AB3,0B0,3B1,3B1,2B1,1B1,0B0,0B0,1B0,2AB1,1AB0,2AB2,2AB3,2AB1,2AB3,1AB2,1AB0,3AB2,3AB3,3AB1,3Each phase of a thread block uses one tile from A and one from BCUDA Memories Slide 46Breaking A and B into Tiles (cont.)Phase 1Phase 2T0,0A0,0 s_a0,0B0,0 s_b0,0AB0,0 += s_a0,0*s_b0,0 + s_a1,0*s_b0,1A2,0 s_a0,0 B0,2 s_b0,0AB0,0 += s_a0,0*s_b0,0 + s_a1,0*s_b0,1T1,0A1,0 s_a1,0 B1,0 s_b1,0AB1,0 += s_a0,0*s_b1,0 + s_a1,0*s_b1,1A3,0 s_a1,0 B1,2 s_b1,0AB1,0 += s_a0,0*s_b1,0 + s_a1,0*s_b1,1T0,1A0,1 s_a0,1B0,1 s_b0,1AB0,1 += s_a0,1*s_b0,0 + s_a1,1*s_b0,1A2,1 s_a0,1B0,3 s_b0,1AB0,1 += s_a0,1*s_b0,0 + s_a1,1*s_b0,1T1,1A1,1 s_a1,1B1,1 s_b1,1AB1,1 += s_a0,1*s_b1,0 + s_a1,1*s_b1,1A3,1 s_a1,1 B1,3 s_b1,1AB1,1 += s_a0,1*s_b1,0 + s_a1,1*s_b1,1timeEach phaseeach block computes one square sub-matrix ABsub of size TILE_WIDTHeach phase, each thread computes a partial result, one element of ABsubCUDA Memories Slide 47Tiled Multiply (cont.)TILE_WIDTHABABSet up the execution configurationCUDA Memories Slide 48Better Implementationdim3 dimBlock (TILE_WIDTH, TILE_WIDTH);

dim3 dimGrid (Width / TILE_WIDTH, Width / TILE_WIDTH);CUDA Memories Slide 49Better Implementation (cont.)__global__ void mat_mul(float *a, float *b, float *ab, int width){ // shorthand int tx = threadIdx.x, ty = threadIdx.y; int bx = blockIdx.x, by = blockIdx.y; // allocate tiles in __shared__ memory __shared__ float s_a[TILE_WIDTH][TILE_WIDTH]; __shared__ float s_b[TILE_WIDTH][TILE_WIDTH]; // calculate the row & col index int row = by * blockDim.y + ty; int col = bx * blockDim.x + tx;

float result = 0; CUDA Memories Slide 50Better Implementation (cont.) // loop over the tiles of the input in phases for (int p = 0; p < width/TILE_WIDTH; ++p) { // collaboratively load tiles into __shared__ s_a[ty][tx] = a[row * width + (p * TILE_WIDTH + tx)]; s_b[ty][tx] = b[(m * TILE_WIDTH + ty) * width + col]; __syncthreads();

// dot product between row of s_a and col of s_b for (int k = 0; k < TILE_WIDTH; ++k) result += s_a[ty][k] * s_b[k][tx]; __syncthreads(); }

ab[row * width + col] = result;}Two barriers per phase:__syncthreads after all data is loaded into __shared__ memory__syncthreads after all data is read from __shared__ memoryNote that second __syncthreads in phase p guards the load in phase p+1

Use barriers to guard dataGuard against using uninitialized dataGuard against bashing live dataCUDA Memories Slide 51Use of Barriers in mat_mulEach thread block should have many threadsTILE_WIDTH = 16 16*16 = 256 threadsThere should be many thread blocks1024-by-1024 matrices 64*64 = 4096 thread blocksTILE_WIDTH = 16 gives each SM 3 blocks, 768 threadsFull occupancyEach thread block performs 2 * 256 = 512 32B loads from global memory for 256 * (2 * 16) = 8,192 FP operationsMemory bandwidth no longer a limiting factorCUDA Memories Slide 52First Order Size ConsiderationsExperiment performed on a GT200This optimization was clearly worth the effortBetter performance still possible in theoryCUDA Memories Slide 53Optimization AnalysisImplementationOriginalImprovedGlobal Loads2N32N2 *(N/TILE_WIDTH)Throughput10.7 GFLOPS183.9 GFLOPSSLOCs2044Relative Improvement1x17.2xImprovement/SLOC1x7.8xEffective use of different memory resources reduces the number of accesses to global memoryThese resources are finite!The more memory locations each thread requires the fewer threads an SM can accommodateCUDA Memories Slide 54Memory Resources as Limit to ParallelismResourcePer GT200 SMFull Occupancy on GT200Registers16384 16384 / 768 threads= 21 per thread__shared__ Memory16 KB 16 KB / 8 blocks= 2 KB per blockEach SM in GT200 has 16KB shared memorySM size is implementation dependent!For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory. Can potentially have up to 8 Thread Blocks actively executing This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same timeUsing 16x16 tiling, we reduce the accesses to the global memory by a factor of 16The 112 GB/s bandwidth can now support (112/4)*16 = 448 GFLOPS!CUDA Memories Slide 55GT200 Shared Memory and ThreadingCUDA Memories Slide 56TILE_SIZE Effects

Global variables declaration__host____device__... __global__, __constant__, __texture__Function prototypes__global__ void kernelOne()float handyFunction()Main ()allocate memory space on the device cudaMalloc(&d_GlblVarPtr, bytes )transfer data from host to device cudaMemCpy(d_GlblVarPtr, h_Gl)execution configuration setupkernel call kernelOne( args );transfer results from device to host cudaMemCpy(h_GlblVarPtr,)optional: compare against golden (host computed) solutionKernel void kernelOne(type args,)variables declaration - __local__, __shared__automatic variables transparently assigned to registers or local memory__syncthreads()Other functionsfloat handyFunction(int inVar);CUDA Memories Slide 57Summary: Typical Structure of a CUDA Programrepeatas neededEffective use of CUDA memory hierarchy decreases bandwidth consumption to increase throughputUse __shared__ memory to eliminate redundant loads from global memoryUse __syncthreads barriers to protect __shared__ dataUse atomics if access patterns are sparse or unpredictableOptimization comes with a development costMemory resources ultimately limit parallelism

CUDA Memories Slide 58Final ThoughtsReading: Chapter 5, Programming Massively Parallel Processors by Kirk and Hwu.Based on original material fromThe University of Illinois at Urbana-ChampaignDavid Kirk, Wen-mei W. HwuStanford UniversityJared Hoberock, David TarjanRevision history: last updated 8/9/2011.

CUDA Memories Slide 59End CreditsChart110.60223.4048615.968653.804758.255164.88567.9359183.179

Column1TILE_SIZEGFLOPS

Sheet1Column1untiled10.60222x23.404864x415.96868x853.804712x1258.255114x1464.88515x1567.935916x16183.179

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.