YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: CUDA Lecture 8 CUDA Memories

Prepared 8/9/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

CUDA Lecture 8CUDA Memories

Page 2: CUDA Lecture 8 CUDA Memories

Each thread can: Read/write per-

thread registers Read/write per-

thread local memory

Read/write per-block shared memory

Read/write per-grid global memory

Read/only per-grid constant memory

CUDA Memories – Slide 2

Hardware Implementation of CUDA Memories

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

Page 3: CUDA Lecture 8 CUDA Memories

__device__ is optional when used with __local__, __shared__ or __constant__

CUDA Memories – Slide 3

CUDA Variable Type QualifiersVariable declaration Memor

yScop

eLifetim

e int LocalVar; register thread thread

__device__ __local__ int LocalVar; int ArrayVar[10];

local thread thread

__device__ __shared__ int SharedVar; shared block block

__device__ int GlobalVar; global grid application

__device__ __constant__ int ConstantVar; constant grid application

Page 4: CUDA Lecture 8 CUDA Memories

Automatic scalar variables without any qualifier reside in a register Compiler will spill to thread local memory

Automatic array variables without any qualifier reside in a thread-local memory

CUDA Memories – Slide 4

CUDA Variable Type Qualifiers (cont.)

Variable declaration Memory

Scope

Lifetime

int LocalVar; register thread thread

__device__ __local__ int LocalVar; int ArrayVar[10];

local thread thread

__device__ __shared__ int SharedVar; shared block block

__device__ int GlobalVar; global grid application

__device__ __constant__ int ConstantVar; constant grid application

Page 5: CUDA Lecture 8 CUDA Memories

scalar variables reside in fast, on-chip registersshared variables reside in fast, on-chip memoriesthread-local arrays and global variables reside in

uncached off-chip memoryconstant variables reside in cached off-chip memory

CUDA Memories – Slide 5

CUDA Variable Type PerformanceVariable declaration Memor

yPenalty

int LocalVar; register 1x

__device__ __local__ int LocalVar; int ArrayVar[10];

local 100x

__device__ __shared__ int SharedVar; shared 1x

__device__ int GlobalVar; global 100x

__device__ __constant__ int ConstantVar; constant 1x

Page 6: CUDA Lecture 8 CUDA Memories

100Ks per-thread variables, R/W by 1 thread100s shared variables, each R/W by 100s of

threads1 global variable is R/W by 100Ks threads1 constant variable is readable by 100Ks threads

CUDA Memories – Slide 6

CUDA Variable Type ScaleVariable declaration Instanc

esVisibilit

y int LocalVar; 100,000s 1

__device__ __local__ int LocalVar; int ArrayVar[10];

100,000s 1

__device__ __shared__ int SharedVar; 100s 100s

__device__ int GlobalVar; 1 100,000s

__device__ __constant__ int ConstantVar; 1 100,000s

Page 7: CUDA Lecture 8 CUDA Memories

CUDA Memories – Slide 7

Where to declare variables?Can host access

it?Yes No

Outside of any function In the kernel

__constant__ int ConstantVar;__device__ int GlobalVar;

int LocalVar; int ArrayVar[10];__shared__ int SharedVar;

Page 8: CUDA Lecture 8 CUDA Memories

CUDA Memories – Slide 8

Example: Thread-local Variables// motivate per-thread variables with// Ten Nearest Neighbors application__global__ void ten_nn(float2 *result, float2 *ps, float2 *qs, size_t num_qs){ // p goes in a register float2 p = ps[threadIdx.x];

// per-thread heap goes in off-chip memory float2 heap[10];

// read through num_qs points, maintaining // the nearest 10 qs to p in the heap ... // write out the contents of heap to result ...}

Page 9: CUDA Lecture 8 CUDA Memories

CUDA Memories – Slide 9

Example: Shared Variables// motivate shared variables with// Adjacent Difference application:// compute result[i] = input[i] – input[i-1]__global__ void adj_diff_naive(int *result, int *input){ // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i > 0) { // each thread loads two elements from global memory int x_i = input[i]; int x_i_minus_one = input[i-1];

result[i] = x_i – x_i_minus_one; }}

Page 10: CUDA Lecture 8 CUDA Memories

CUDA Memories – Slide 10

Example: Shared Variables (cont.)// motivate shared variables with// Adjacent Difference application:// compute result[i] = input[i] – input[i-1]__global__ void adj_diff_naive(int *result, int *input){ // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i > 0) { // what are the bandwidth requirements of this kernel? int x_i = input[i]; int x_i_minus_one = input[i-1];

result[i] = x_i – x_i_minus_one; }}

Two loads

Page 11: CUDA Lecture 8 CUDA Memories

CUDA Memories – Slide 11

Example: Shared Variables (cont.)// motivate shared variables with// Adjacent Difference application:// compute result[i] = input[i] – input[i-1]__global__ void adj_diff_naive(int *result, int *input){ // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i > 0) { // how many times does this kernel load input[i]? int x_i = input[i]; int x_i_minus_one = input[i-1];

result[i] = x_i – x_i_minus_one; }}

// once by thread i// again by thread i+1

Page 12: CUDA Lecture 8 CUDA Memories

CUDA Memories – Slide 12

Example: Shared Variables (cont.)// motivate shared variables with// Adjacent Difference application:// compute result[i] = input[i] – input[i-1]__global__ void adj_diff_naive(int *result, int *input){ // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i > 0) { // Idea: eliminate redundancy by sharing data int x_i = input[i]; int x_i_minus_one = input[i-1];

result[i] = x_i – x_i_minus_one; }}

Page 13: CUDA Lecture 8 CUDA Memories

CUDA Memories – Slide 13

Example: Shared Variables (cont.)// optimized version of adjacent difference__global__ void adj_diff(int *result, int *input){ // shorthand for threadIdx.x int tx = threadIdx.x; // allocate a __shared__ array, one element per thread __shared__ int s_data[BLOCK_SIZE]; // each thread reads one element to s_data unsigned int i = blockDim.x * blockIdx.x + tx; s_data[tx] = input[i];

// avoid race condition: ensure all loads complete before continuing __syncthreads(); if (tx > 0) result[i] = s_data[tx] – s_data[tx–1]; else if (i > 0) { // handle thread block boundary result[i] = s_data[tx] – input[i-1]; }}

Page 14: CUDA Lecture 8 CUDA Memories

CUDA Memories – Slide 14

Example: Shared Variables (cont.)// when the size of the array isn’t known at compile time...__global__ void adj_diff(int *result, int *input){ // use extern to indicate a __shared__ array will be // allocated dynamically at kernel launch time extern __shared__ int s_data[]; ...}// pass the size of the per-block array, in bytes, as the third// argument to the triple chevronsadj_diff<<<num_blocks, block_size, block_size * sizeof(int)>>>(r,i);

Page 15: CUDA Lecture 8 CUDA Memories

Experiment performed on a GT200 chipImprovement likely better on an older architectureImprovement likely worse on a newer architecture

Optimizations tend to come with a development cost

CUDA Memories – Slide 15

Optimization AnalysisImplementation Origina

lImproved

Global loads 2N N + N/BLOCK_SIZE

Global stores N NThroughput 36.8

GB/s57.5 GB/s

Source lines of code (SLOCs)

18 35

Relative improvement 1x 1.57xImprovement/SLOC 1x 0.81x

Page 16: CUDA Lecture 8 CUDA Memories

Pointers can only point to memory allocated or declared in global memory:

Allocated in the host and passed to the kernel: __global__ void KernelFunc(float* ptr)

Obtained as the address of a global variable: float* ptr = &GlobalVar;

CUDA Memories – Slide 16

Variable Type Restrictions

Page 17: CUDA Lecture 8 CUDA Memories

So you can use pointers and point at any memory space per se:

CUDA Memories – Slide 17

Variable Type Restrictions (cont.)

__device__ int my_global_variable;__constant__ int my_constant_variable = 13;

__global__ void foo(void){ __shared__ int my_shared_variable;

int *ptr_to_global = &my_global_variable; const int *ptr_to_constant = &my_constant_variable; int *ptr_to_shared = &my_shared_variable; ... *ptr_to_global = *ptr_to_shared;}

Page 18: CUDA Lecture 8 CUDA Memories

Pointers aren’t typed on memory space

Where does ptr point?ptr is a __shared__ pointer variable, not a

pointer to a __shared__ variable!

CUDA Memories – Slide 18

Variable Type Restrictions (cont.)

__shared__ int *ptr;

Page 19: CUDA Lecture 8 CUDA Memories

CUDA Memories – Slide 19

Don’t confuse the compiler!__device__ int my_global_variable;__global__ void foo(int *input){ __shared__ int my_shared_variable;

int *ptr = 0; if (input[threadIdx.x] % 2) ptr = &my_global_variable; else ptr = &my_shared_variable; // where does ptr point?}

Page 20: CUDA Lecture 8 CUDA Memories

Prefer dereferencing pointers in simple, regular access patterns

Avoid propagating pointersAvoid pointers to pointers

The GPU would rather not pointer chaseLinked lists will not perform well

Pay attention to compiler warning messagesWarning: Cannot tell what pointer points to, assuming global memory space

Crash waiting to happen

CUDA Memories – Slide 20

Advice

Page 21: CUDA Lecture 8 CUDA Memories

Global memory resides in device memory (DRAM)Much slower access than shared memory

So, a profitable way of performing computation on the device is to tile data to take advantage of fast shared memory:Generalize from adjacent_difference

exampleDivide and conquer

CUDA Memories – Slide 21

A Common Programming Strategy

Page 22: CUDA Lecture 8 CUDA Memories

Partition data into subsets that fit into shared memory CUDA Memories – Slide 22

A Common Programming Strategy (cont.)

Page 23: CUDA Lecture 8 CUDA Memories

Handle each data subset with one thread block as follows: CUDA Memories – Slide 23

A Common Programming Strategy (cont.)

Page 24: CUDA Lecture 8 CUDA Memories

Load the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism CUDA Memories – Slide 24

A Common Programming Strategy (cont.)

Page 25: CUDA Lecture 8 CUDA Memories

Perform the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element CUDA Memories – Slide 25

A Common Programming Strategy (cont.)

Page 26: CUDA Lecture 8 CUDA Memories

Copy the results from shared memory back to global memory

CUDA Memories – Slide 26

A Common Programming Strategy (cont.)

Page 27: CUDA Lecture 8 CUDA Memories

Constant memory also resides in device memory (DRAM)Much slower access than shared memoryBut…cached!Highly efficient access for read-only data

CUDA Memories – Slide 27

A Common Programming Strategy (cont.)

Page 28: CUDA Lecture 8 CUDA Memories

Carefully partition data according to access patternsRead-only __constant__ memory (very fast

if in cache)R/W & shared within block __shared__

memory (very fast)R/W within each thread registers (very fast)Indexed R/W within each thread local

memory (slow)R/W inputs/results cudaMalloc’ed global

memory (very slow)

CUDA Memories – Slide 28

A Common Programming Strategy (cont.)

Page 29: CUDA Lecture 8 CUDA Memories

This is a race condition; the result is undefined

The order in which threads access the variable is undefined without explicit coordination

Two ways to enforce well-defined semanticsCUDA Memories – Slide 29

Communication through Memory

__global__ void race(void){ __shared__ int my_shared_variable; my_shared_variable = threadIdx.x;}

Page 30: CUDA Lecture 8 CUDA Memories

Use barriers (e.g., __syncthreads) to ensure data is ready for access

The state of the entire data array is now well-defined for all threads in this block.

CUDA Memories – Slide 30

Communication through Memory (cont.)

__global__ void share_data(int *input){ __shared__ int data[BLOCK_SIZE]; data[threadIdx.x] = input[threadIdx.x]; __syncthreads();}

Page 31: CUDA Lecture 8 CUDA Memories

Use atomic operations (e.g., atomicAdd) to ensure exclusive access to a variable

After this kernel exits, the value of *result will be the sum of the inputs

CUDA Memories – Slide 31

Communication through Memory (cont.)

// assume *result is initialized to 0

__global__ void sum(int *input, int *result){ atomicAdd(result, input[threadIdx.x]);}

Page 32: CUDA Lecture 8 CUDA Memories

Atomic operations aren’t cheap; they imply serialized access to a variable.

How many threads will contend for exclusive access to result?

CUDA Memories – Slide 32

Resource Contention

__global__ void sum(int *input, int *result){ atomicAdd(result, input[threadIdx.x]);}

sum<<<B,N/B>>>(input,result);

Page 33: CUDA Lecture 8 CUDA Memories

Divide and ConquerPer-thread atomicAdd to a __shared__ partial

sumPer-block atomicAdd to the total sumCUDA Memories – Slide 33

Hierarchical AtomicsS

S0 S1 Si

Page 34: CUDA Lecture 8 CUDA Memories

CUDA Memories – Slide 34

Hierarchical Atomics (cont.)__global__ void sum(int *input, int *result){ __shared__ int partial_sum;

// thread 0 is responsible for initializing partial_sum if (threadIdx.x == 0) partial_sum = 0; __syncthreads();

// each thread updates the partial sum atomicAdd(&partial_sum, input[threadIdx.x]); __syncthreads();

// thread 0 updates the total sum if (threadIdx.x == 0) atomicAdd(result, partial_sum);}

Page 35: CUDA Lecture 8 CUDA Memories

Use barriers such as __syncthreads to wait until __shared__ data is ready

Prefer barriers to atomics when data access patterns are regular or predictable

Prefer atomics to barriers when data access patterns are sparse or unpredictable

Atomics to __shared__ variables are much faster than atomics to global variables

Don’t synchronize or serialize unnecessarily

CUDA Memories – Slide 35

Advice

Page 36: CUDA Lecture 8 CUDA Memories

Generalize adjacent_difference exampleAB = A * B

Each element ABij = dot(row(A,i),col(B,j))

Parallelization strategyThread ABij2D kernel

CUDA Memories – Slide 36

Example: Matrix Multiplication using Shared Memory

A

B

AB

Page 37: CUDA Lecture 8 CUDA Memories

CUDA Memories – Slide 37

First Try: Matrix Multiply Kernel using Multiple Blocks

__global__ void mat_mul(float *a, float *b, float *ab, int width){ // calculate the row & col index of the element int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x;

float result = 0;

// do dot product between row of a and col of b for (int k = 0; k < width; ++k) result += a[row * width + k] * b[k * width + col];

ab[row * width+col] = result;}

Page 38: CUDA Lecture 8 CUDA Memories

CUDA Memories – Slide 38

How will this perform?How many loads per term of dot product?

2 (a and b) = 8 Bytes

How many floating point (FP) operations?

2 (multiply and addition)

Global memory access to flop ratio (GMAC)

8 Bytes / 2 ops = 4 B/op

What is the peak FP performance of GeForce GTX 260?

805 GFLOPS

Lower bound on bandwidth required to reach peak FP performance

GMAC * Peak FLOPS = 4 * 805 = 3.2 TB/s

What is the actual memory bandwidth of GeForce GTX 260?

112 GB/s

Then what is an upper bound on performance of our implementation?

Actual BW / GMAC = 112 / 4 = 28 GFLOPS

Page 39: CUDA Lecture 8 CUDA Memories

All threads access global memory for their input matrix elements

The actual code runs at about 15 GFLOPS

Need to drastically cut down memory accesses to get closer to the peak 805 GFLOPS

CUDA Memories – Slide 39

How will this perform? (cont.)Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

Page 40: CUDA Lecture 8 CUDA Memories

Each input element is read by width threads

Load each element into __shared__ memory and have several threads use the local version to reduce the memory bandwidth

CUDA Memories – Slide 40

Idea: Use __shared__ memory to reuse global data

A

B

AB

width

Page 41: CUDA Lecture 8 CUDA Memories

Partition kernel loop into phases so that the data accesses in each phase are focused on one subset (tile) of A and B

Load a tile of both matrices into __shared__ each phase

CUDA Memories – Slide 41

Tiled Multiply TILE_WIDTH

A

B

AB

Page 42: CUDA Lecture 8 CUDA Memories

Each phaseeach block

computes one square sub-matrix ABsub of size TILE_WIDTH

each phase, each thread computes a partial result, one element of ABsub

CUDA Memories – Slide 42

Tiled Multiply (cont.) TILE_WIDTH

A

B

AB

Page 43: CUDA Lecture 8 CUDA Memories

CUDA Memories – Slide 43

A Small Example

AB1,0A2,0

A1,1

A1,0A0,0

A0,1

A3,0

A2,1

AB0,0

A3,1 AB0,1

AB2,0AB3,0

B0,3 B1,3

B1,2

B1,1

B1,0B0,0

B0,1

B0,2

AB1,1

AB0,2 AB2,2AB3,2AB1,2

AB3,1AB2,1

AB0,3 AB2,3AB3,3AB1,3

Page 44: CUDA Lecture 8 CUDA Memories

Every A and B element is used exactly twice in generating a 2-by-2 tile of AB

CUDA Memories – Slide 44

A Small Example (cont.)

AB0,0

thread0,0

AB1,0

thread1,0

AB0,1

thread0,1

AB1,1

thread1,1

A0,0 * B0,0 A0,0 * B1,0 A0,1 * B0,0 A0,1 * B1,0

A1,0 * B0,1 A1,0 * B1,1 A1,1 * B0,1 A1,1 * B1,1

A2,0 * B0,2 A2,0 * B1,2 A2,1 * B0,2 A2,1 * B1,2

A3,0 * B0,3 A3,0 * B1,3 A3,1 * B0,3 A3,1 * B1,3

Accessorder

Page 45: CUDA Lecture 8 CUDA Memories

CUDA Memories – Slide 45

Breaking A and B into Tiles

AB1,0A2,0

A1,1

A1,0A0,0

A0,1

A3,0

A2,1

AB0,0

A3,1 AB0,1

AB2,0AB3,0

B0,3 B1,3

B1,2

B1,1

B1,0B0,0

B0,1

B0,2

AB1,1

AB0,2 AB2,2AB3,2AB1,2

AB3,1AB2,1

AB0,3 AB2,3AB3,3AB1,3

Page 46: CUDA Lecture 8 CUDA Memories

Each phase of a thread block uses one tile from A and one from B

CUDA Memories – Slide 46

Breaking A and B into Tiles (cont.)

Phase 1 Phase 2T0,0 A0,0

↓ s_a0,0

B0,0

↓ s_b0,0

AB0,0 += s_a0,0*s_b0,0 + s_a1,0*s_b0,1

A2,0 ↓ s_a0,0

B0,2

↓ s_b0,0

AB0,0 += s_a0,0*s_b0,0 + s_a1,0*s_b0,1

T1,0 A1,0

↓ s_a1,0

B1,0

↓ s_b1,0

AB1,0 += s_a0,0*s_b1,0 + s_a1,0*s_b1,1

A3,0 ↓ s_a1,0

B1,2

↓ s_b1,0

AB1,0 += s_a0,0*s_b1,0 + s_a1,0*s_b1,1

T0,1 A0,1

↓ s_a0,1

B0,1

↓ s_b0,1

AB0,1 += s_a0,1*s_b0,0 + s_a1,1*s_b0,1

A2,1

↓ s_a0,1

B0,3

↓ s_b0,1

AB0,1 += s_a0,1*s_b0,0 + s_a1,1*s_b0,1

T1,1 A1,1

↓ s_a1,1

B1,1

↓ s_b1,1

AB1,1 += s_a0,1*s_b1,0 + s_a1,1*s_b1,1

A3,1 ↓ s_a1,1

B1,3

↓ s_b1,1

AB1,1 += s_a0,1*s_b1,0 + s_a1,1*s_b1,1

time

Page 47: CUDA Lecture 8 CUDA Memories

Each phaseeach block

computes one square sub-matrix ABsub of size TILE_WIDTH

each phase, each thread computes a partial result, one element of ABsub

CUDA Memories – Slide 47

Tiled Multiply (cont.) TILE_WIDTH

A

B

AB

Page 48: CUDA Lecture 8 CUDA Memories

Set up the execution configuration

CUDA Memories – Slide 48

Better Implementation

dim3 dimBlock (TILE_WIDTH, TILE_WIDTH);

dim3 dimGrid (Width / TILE_WIDTH, Width / TILE_WIDTH);

Page 49: CUDA Lecture 8 CUDA Memories

CUDA Memories – Slide 49

Better Implementation (cont.)__global__ void mat_mul(float *a, float *b, float *ab, int width){ // shorthand int tx = threadIdx.x, ty = threadIdx.y; int bx = blockIdx.x, by = blockIdx.y; // allocate tiles in __shared__ memory __shared__ float s_a[TILE_WIDTH][TILE_WIDTH]; __shared__ float s_b[TILE_WIDTH][TILE_WIDTH]; // calculate the row & col index int row = by * blockDim.y + ty; int col = bx * blockDim.x + tx;

float result = 0;

Page 50: CUDA Lecture 8 CUDA Memories

CUDA Memories – Slide 50

Better Implementation (cont.) // loop over the tiles of the input in phases for (int p = 0; p < width/TILE_WIDTH; ++p) { // collaboratively load tiles into __shared__ s_a[ty][tx] = a[row * width + (p * TILE_WIDTH + tx)]; s_b[ty][tx] = b[(m * TILE_WIDTH + ty) * width + col]; __syncthreads();

// dot product between row of s_a and col of s_b for (int k = 0; k < TILE_WIDTH; ++k) result += s_a[ty][k] * s_b[k][tx]; __syncthreads(); }

ab[row * width + col] = result;}

Page 51: CUDA Lecture 8 CUDA Memories

Two barriers per phase:__syncthreads after all data is loaded into __shared__ memory

__syncthreads after all data is read from __shared__ memory

Note that second __syncthreads in phase p guards the load in phase p+1

Use barriers to guard dataGuard against using uninitialized dataGuard against bashing live data

CUDA Memories – Slide 51

Use of Barriers in mat_mul

Page 52: CUDA Lecture 8 CUDA Memories

Each thread block should have many threadsTILE_WIDTH = 16 16*16 = 256 threads

There should be many thread blocks1024-by-1024 matrices 64*64 = 4096 thread

blocksTILE_WIDTH = 16 gives each SM 3 blocks,

768 threadsFull occupancy

Each thread block performs 2 * 256 = 512 32B loads from global memory for 256 * (2 * 16) = 8,192 FP operationsMemory bandwidth no longer a limiting factorCUDA Memories – Slide 52

First Order Size Considerations

Page 53: CUDA Lecture 8 CUDA Memories

Experiment performed on a GT200This optimization was clearly worth the effortBetter performance still possible in theory

CUDA Memories – Slide 53

Optimization AnalysisImplementation Original ImprovedGlobal Loads 2N3 2N2

*(N/TILE_WIDTH)Throughput 10.7

GFLOPS183.9 GFLOPS

SLOCs 20 44Relative Improvement 1x 17.2xImprovement/SLOC 1x 7.8x

Page 54: CUDA Lecture 8 CUDA Memories

Effective use of different memory resources reduces the number of accesses to global memory

These resources are finite!The more memory locations each thread requires

the fewer threads an SM can accommodateCUDA Memories – Slide 54

Memory Resources as Limit to Parallelism

Resource Per GT200

SM

Full Occupancy on GT200

Registers 16384 ≤ 16384 / 768 threads= 21 per thread

__shared__ Memory 16 KB ≤ 16 KB / 8 blocks= 2 KB per block

Page 55: CUDA Lecture 8 CUDA Memories

Each SM in GT200 has 16KB shared memory SM size is implementation dependent! For TILE_WIDTH = 16, each thread block uses 2*256*4B

= 2KB of shared memory. Can potentially have up to 8 Thread Blocks actively

executing This allows up to 8*512 = 4,096 pending loads. (2 per thread,

256 threads per block) The next TILE_WIDTH 32 would lead to 2*32*32*4B=

8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same time

Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16 The 112 GB/s bandwidth can now support (112/4)*16 =

448 GFLOPS! CUDA Memories – Slide 55

GT200 Shared Memory and Threading

Page 56: CUDA Lecture 8 CUDA Memories

CUDA Memories – Slide 56

TILE_SIZE Effects

Page 57: CUDA Lecture 8 CUDA Memories

Global variables declaration __host__ __device__... __global__, __constant__, __texture__

Function prototypes __global__ void kernelOne(…) float handyFunction(…)

Main () allocate memory space on the device – cudaMalloc(&d_GlblVarPtr,

bytes ) transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…) execution configuration setup kernel call – kernelOne<<<execution configuration>>>( args… ); transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…) optional: compare against golden (host computed) solution

Kernel – void kernelOne(type args,…) variables declaration - __local__, __shared__

automatic variables transparently assigned to registers or local memory __syncthreads()…

Other functions float handyFunction(int inVar…);

CUDA Memories – Slide 57

Summary: Typical Structure of a CUDA Program

repeatas

needed

Page 58: CUDA Lecture 8 CUDA Memories

Effective use of CUDA memory hierarchy decreases bandwidth consumption to increase throughput

Use __shared__ memory to eliminate redundant loads from global memoryUse __syncthreads barriers to protect __shared__ data

Use atomics if access patterns are sparse or unpredictable

Optimization comes with a development costMemory resources ultimately limit parallelism

CUDA Memories – Slide 58

Final Thoughts

Page 59: CUDA Lecture 8 CUDA Memories

Reading: Chapter 5, “Programming Massively Parallel Processors” by Kirk and Hwu.

Based on original material fromThe University of Illinois at Urbana-Champaign

David Kirk, Wen-mei W. HwuStanford University

Jared Hoberock, David TarjanRevision history: last updated 8/9/2011.

CUDA Memories – Slide 59

End Credits


Related Documents