Top Banner
CUDA Memory Model Some material © David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 Monday, 21 February 2011
35

CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

Aug 06, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

CUDA Memory Model

Some material © David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission)1Monday, 21 February 2011

Page 2: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

G80 Implementation of CUDA Memories

• Each thread can:

• Read/write per-thread registers

• Read/write per-thread local memory

• Read/write per-block shared memory

• Read/write per-grid global memory

• Read/only per-grid constant memory

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

2Monday, 21 February 2011

Page 3: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

• __device__ is optional when used with __local__, __shared__, or __constant__

• Automatic variables without any qualifier reside in a register

• Except arrays that reside in local memory

CUDA Variable Type Qualifiers• Variable declaration • Memory • Scope • Lifetime

• __device__ __local__ int LocalVar; • local • thread • thread

• __device__ __shared__ int SharedVar; • shared • block • block

• __device__ int GlobalVar; • global • grid • application

• __device__ __constant__ int ConstantVar; • constant • grid • application

3Monday, 21 February 2011

Page 4: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

Where to Declare Variables?

Can host access it?

Outside of any Function In the kernel

yes noglobalconstant

sharedlocal

register/automatic

4Monday, 21 February 2011

Page 5: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

Variable Type Restrictions

• Pointers can only point to memory allocated or declared in global memory:

• Allocated in the host and passed to the kernel:

• __global__ void KernelFun (float* ptr)

• Obtained as the address of a global variable:

• float* ptr = &GlobalVar;

5Monday, 21 February 2011

Page 6: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

• Global memory resides in device memory (DRAM) - much slower access than shared memory

• Tile data to take advantage of fast shared memory:

• Partition data into subsets that fit into shared memory

• Handle each data subset with one thread block by:

• Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism

• Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element

• Copying results from shared memory to global memory

A Common Programming Strategy

6Monday, 21 February 2011

Page 7: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

• Constant memory also resides in device memory (DRAM) - much slower access than shared memory

• But… cached!

• Highly efficient access for read-only data

• Carefully divide data according to access patterns

• R/Only constant memory (very fast if in cache)

• R/W shared within Block shared memory (very fast)

• R/W within each thread registers (very fast)

• R/W inputs/results global memory (very slow)

A Common Programming Strategy (Cont.)

7Monday, 21 February 2011

Page 8: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

Matrix Multiplication using

Shared Memory

8Monday, 21 February 2011

Page 9: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

Matrix Multiplication Kernel__global__ void MatrixMulKernel(float* Md, float* Nd,

float* Pd, int Width) {

// Calculate the row index of the Pd element and M

int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;

// Calculate the column idenx of Pd and N

int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

float Pvalue = 0;

// each thread computes one element of the block sub-matrix

for (int k = 0; k < Width; ++k)

Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];

Pd[Row*Width+Col] = Pvalue;

}9Monday, 21 February 2011

Page 10: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

•All threads access global memory for their input matrix elements

• Two memory accesses (8 bytes) per floating point multiply-add

• 4B/s of memory bandwidth/FLOPS

• 4*346.5 = 1386 GB/s required to achieve peak FLOP rating

• 86.4 GB/s limits the code at 21.6 GFLOPS

•The actual code runs at about 15 GFLOPS

•Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPS

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

How about performance on G80?

10Monday, 21 February 2011

Page 11: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

Idea: Use Shared Memory to reuse global memory data

M

N

P

WIDTH

WIDTH

WIDTH WIDTH

ty

tx

• Each input element is read by Width threads.

• Load each element into Shared Memory and have several threads use the local version to reduce memory bandwidth

11Monday, 21 February 2011

Page 12: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

Tiled matrix

A =

!""#

!1,1 !1,2 · · ·!2,1 !2,2 · · ·...

.... . .

$%%& =

!""#

'!1,1 !1,2!2,1 !2,2

(· · ·

.... . .

$%%& =)

A1,1 · · ·...

. . .

*

(!b)",j =p!

k=1!",kbk,j (AB)",j =

p/m!

k=1A",kBk,j

12Monday, 21 February 2011

Page 13: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

Md

Nd

Pd

Pdsub

TILE_WIDTH

WIDTHWIDTH

TILE_WIDTHTILE_WIDTH

bx

tx0 1 TILE_WIDTH-12

0 1 2

by ty 210

TILE_WIDTH-1

2

1

0

TILE_WIDTH

TILE_WIDTH

TILE_WIDTHE

WIDTH

WIDTH

Tiled Multiply

• Break up the execution of the kernel into phases so that the data accesses in each phase is focused on one subset (tile) of Md and Nd

13Monday, 21 February 2011

Page 14: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

Pd1,0Md2,0

Md1,1

Md1,0Md0,0

Md0,1

Md3,0

Md2,1

Pd0,0

Md3,1 Pd0,1

Pd2,0 Pd3,0

Nd0,3 Nd1,3

Nd1,2

Nd1,1

Nd1,0Nd0,0

Nd0,1

Nd0,2

Pd1,1

Pd0,2 Pd2,2 Pd3,2Pd1,2

Pd3,1Pd2,1

Pd0,3 Pd2,3 Pd3,3Pd1,3

Tiled multiply

14Monday, 21 February 2011

Page 15: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

Every Md and Nd Element is used exactly twice in generating a 2X2 tile of P

P0,0

thread0,0

P1,0

thread1,0

P0,1

thread0,1

P1,1

thread1,1

M0,0 * N0,0 M0,0 * N1,0 M0,1 * N0,0 M0,1 * N1,0

M1,0 * N0,1 M1,0 * N1,1 M1,1 * N0,1 M1,1 * N1,1

M2,0 * N0,2 M2,0 * N1,2 M2,1 * N0,2 M2,1 * N1,2

M3,0 * N0,3 M3,0 * N1,3 M3,1 * N0,3 M3,1 * N1,3

Accessorder

15Monday, 21 February 2011

Page 16: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

Each phase of a Thread Block uses one tile from Md and one from Nd

Step 4 Step 5 Step 6

T0,0

Md0,0 ↓

Mds0,0

Nd0,0↓

Nds0,0

PValue0,0 += Mds0,0*Nds0,0 + Mds1,0*Nds0,1

Md2,0 ↓

Mds0,0

Nd0,2↓

Nds0,0

PValue0,0 += Mds0,0*Nds0,0 + Mds1,0*Nds0,1

T1,0

Md1,0↓

Mds1,0

Nd1,0↓

Nds1,0

PValue1,0 += Mds0,0*Nds1,0 + Mds1,0*Nds1,1

Md3,0 ↓

Mds1,0

Nd1,2↓

Nds1,0

PValue1,0 += Mds0,0*Nds1,0 + Mds1,0*Nds1,1

T0,1

Md0,1↓

Mds0,1

Nd0,1↓

Nds0,1

PdValue0,1 += Mds0,1*Nds0,0 + Mds1,1*Nds0,1

Md2,1↓

Mds0,1

Nd0,3↓

Nds0,1

PdValue0,1 += Mds0,1*Nds0,0 + Mds1,1*Nds0,1

T1,1

Md1,1↓

Mds1,1

Nd1,1↓

Nds1,1

PdValue1,1 += Mds0,1*Nds1,0 + Mds1,1*Nds1,1

Md3,1 ↓

Mds1,1

Nd1,3↓

Nds1,1

PdValue1,1 += Mds0,1*Nds1,0 + Mds1,1*Nds1,1

Phase 1 Phase 2

time

16Monday, 21 February 2011

Page 17: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

First-order Size Considerations in G80

• Each thread block should have many threads

• TILE_WIDTH of 16 gives 16*16 = 256 threads

• There should be many thread blocks

• A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks

• Each thread block perform 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations.

• Memory bandwidth no longer a limiting factor17Monday, 21 February 2011

Page 18: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

CUDA Code – Kernel Execution Configuration

// Setup the execution configuration

dim3 dimBlock(TILE_WIDTH, TILE_WIDTH);

dim3 dimGrid(Width / TILE_WIDTH,

Width / TILE_WIDTH);

18Monday, 21 February 2011

Page 19: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

__global__ void MatrixMulKernel(float* Md, float* Nd,

float* Pd, int Width){

__shared__float Mds[TILE_WIDTH][TILE_WIDTH];

__shared__float Nds[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;

int tx = threadIdx.x; int ty = threadIdx.y;

// Identify row and column of Pd to work on

int Row = by * TILE_WIDTH + ty;

int Col = bx * TILE_WIDTH + tx;

float Pvalue = 0;

Tiled Matrix Multiplication Kernel

19Monday, 21 February 2011

Page 20: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

// Loop over Md and Nd tiles required to compute Pd

for (int m = 0; m < Width/TILE_WIDTH; ++m) {

// Collaborative loading of Md and Nd tiles into shared memory

Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];

Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];

__syncthreads();

for (int k = 0; k < TILE_WIDTH; ++k)

Pvalue += Mds[ty][k] * Nds[k][tx];

__synchthreads();

}

Pd[Row*Width+Col] = Pvalue;

}

20Monday, 21 February 2011

Page 21: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

• Each SM in G80 has 16KB shared memory

• shared memory size is implementation dependent!

• For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory.

• Can potentially have up to 8 Thread Blocks actively executing

• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)

• The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same time

• Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16

• The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS!

G80 Shared Memory and Threading

21Monday, 21 February 2011

Page 22: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

Block size effect on throughput

0 5 10 15 20

Block Size22Monday, 21 February 2011

Page 23: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

Loop unrolling effects

23Monday, 21 February 2011

Page 24: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

CUDA Memory Model

24Monday, 21 February 2011

Page 25: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

• memory not cached so using the right access pattern is crucial

• memory access (to global memory) is very costly

• variables must be aligned for efficient access (usually are)

• simultaneous memory accesses in half-warps can be coalesced in one transaction

• coalescing depends on device compute capabilities

Global Memory

25Monday, 21 February 2011

Page 26: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

Chapter 5. Performance Guidelines

!

82 CUDA Programming Guide Version 2.2.1!

"#$!%&&'())!*+!%!,%'-%./(!'()-&-#0!-#!0/*.%/!1(1*'$!*'!'(23'#(&!.$!*#(!*+!24(!1(1*'$!%//*5%2-*#!'*32-#()!+'*1!24(!&'-,('!*'!'3#2-1(!"67!-)!%/8%$)!%/-0#(&!2*!%2!/(%)2!9:;!.$2()<!

!"#$%&=!0/*.%/!1(1*'$!.%#&8-&24!-)!3)(&!1*)2!(++-5-(#2/$!84(#!24(!)-13/2%#(*3)!1(1*'$!%55())()!.$!24'(%&)!-#!%!4%/+>8%'?!@&3'-#0!24(!(A(532-*#!*+!%!)-#0/(!'(%&!*'!8'-2(!-#)2'352-*#B!5%#!.(!!"#$%&!%'!-#2*!%!)-#0/(!1(1*'$!2'%#)%52-*#!*+!C9=!;D=!*'!E9F!.$2()<!

G4(!'()2!*+!24-)!)(52-*#!&()5'-.()!24(!,%'-*3)!'(H3-'(1(#2)!+*'!1(1*'$!%55())()!2*!5*%/()5(!.%)(&!*#!24(!5*1?32(!5%?%.-/-2$!*+!24(!&(,-5(<!7+!%!4%/+>8%'?!+3/+-//)!24()(!'(H3-'(1(#2)=!5*%/()5-#0!-)!%54-(,(&!(,(#!-+!24(!8%'?!-)!&-,('0(#2!%#&!)*1(!24'(%&)!*+!24(!4%/+>8%'?!&*!#*2!%523%//$!%55())!1(1*'$<!

I*'!24(!?3'?*)(!*+!24(!+*//*8-#0!&-)53))-*#=!0/*.%/!1(1*'$!-)!5*#)-&('(&!2*!.(!?%'2-2-*#(&!-#2*!)(01(#2)!*+!)-J(!(H3%/!2*!C9=!;D=!*'!E9F!.$2()!#('!%/-0#(&!2*!24-)!)-J(<!

'$()"*#+%,-$%-."/+#"*-0+12-'$3451"-'(4(6+)+17-89:-(%&-898!-

G4(!0/*.%/!1(1*'$!%55())!.$!%//!24'(%&)!*+!%!4%/+>8%'?!-)!5*%/()5(&!-#2*!*#(!*'!28*!1(1*'$!2'%#)%52-*#)!-+!-2!)%2-)+-()!24(!+*//*8-#0!24'((!5*#&-2-*#)K!

! G4'(%&)!13)2!%55())!

! L-24('!C9>.-2!8*'&)=!'()3/2-#0!-#!*#(!;D>.$2(!1(1*'$!2'%#)%52-*#=!

! M'!;D>.-2!8*'&)=!'()3/2-#0!-#!*#(!E9F>.$2(!1(1*'$!2'%#)%52-*#=!

! M'!E9F>.-2!8*'&)=!'()3/2-#0!-#!28*!E9F>.$2(!1(1*'$!2'%#)%52-*#)N!

! "//!E;!8*'&)!13)2!/-(!-#!24(!)%1(!)(01(#2!*+!)-J(!(H3%/!2*!24(!1(1*'$!2'%#)%52-*#!)-J(!@*'!28-5(!24(!1(1*'$!2'%#)%52-*#!)-J(!84(#!%55())-#0!E9F>.-2!8*'&)BN!

! G4'(%&)!13)2!%55())!24(!8*'&)!-#!)(H3(#5(K!G4(!)24!24'(%&!-#!24(!4%/+>8%'?!13)2!%55())!24(!)24!8*'&<!

7+!%!4%/+>8%'?!&*()!#*2!+3/+-//!%//!24(!'(H3-'(1(#2)!%.*,(=!%!)(?%'%2(!1(1*'$!2'%#)%52-*#!-)!-))3(&!+*'!(%54!24'(%&!%#&!24'*304?32!-)!)-0#-+-5%#2/$!'(&35(&<!

I-03'(!:>E!)4*8)!)*1(!(A%1?/()!*+!5*%/()5(&!1(1*'$!%55())()=!84-/(!I-03'(!:>9!%#&!I-03'(!:>C!)4*8!)*1(!(A%1?/()!*+!1(1*'$!%55())()!24%2!%'(!#*#>5*%/()5(&!+*'!&(,-5()!*+!5*1?32(!5%?%.-/-2$!E<O!*'!E<E<!

P*%/()5(&!;D>.-2!%55())()!&(/-,('!%!/-22/(!/*8('!.%#&8-&24!24%#!5*%/()5(&!C9>.-2!%55())()!%#&!5*%/()5(&!E9F>.-2!%55())()!&(/-,('!%!#*2-5(%./$!/*8('!.%#&8-&24!24%#!5*%/()5(&!C9>.-2!%55())()<!Q32=!84-/(!.%#&8-&24!+*'!#*#>5*%/()5(&!%55())()!-)!%'*3#&!%#!*'&('!*+!1%0#-23&(!/*8('!24%#!+*'!5*%/()5(&!%55())()!84(#!24()(!%55())()!%'(!C9>.-2=!-2!-)!*#/$!%'*3#&!+*3'!2-1()!/*8('!84(#!24($!%'(!;D>.-2!%#&!%'*3#&!28*!2-1()!84(#!24($!%'(!E9F>.-2<!

'$()"*#+%,-$%-."/+#"*-0+12-'$3451"-'(4(6+)+17-89;-(%&-<+,2"=-

G4(!0/*.%/!1(1*'$!%55())!.$!%//!24'(%&)!*+!%!4%/+>8%'?!-)!5*%/()5(&!-#2*!%!)-#0/(!1(1*'$!2'%#)%52-*#!%)!)**#!%)!24(!8*'&)!%55())(&!.$!%//!24'(%&)!/-(!-#!24(!)%1(!)(01(#2!*+!)-J(!(H3%/!2*K!

! C9!.$2()!-+!%//!24'(%&)!%55())!F>.-2!8*'&)=!

! ;D!.$2()!-+!%//!24'(%&)!%55())!E;>.-2!8*'&)=!

! E9F!.$2()!-+!%//!24'(%&)!%55())!C9>.-2!*'!;D>.-2!8*'&)<!

26Monday, 21 February 2011

Page 27: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

Chapter 5. Performance Guidelines

!

84 CUDA Programming Guide Version 2.2.1!

!

Left: coalesced float memory access, resulting in a single memory transaction.

Right: coalesced float memory access (divergent warp), resulting in a single memory transaction.

Figure 5-1. Examples of Coalesced Global Memory Access Patterns

Address 128

Address 132

Address 136

Address 140

Address 144

Address 148

Address 152

Address 156

Address 160

Address 164

Address 168

Address 172

Address 176

Address 180

Address 184

Address 188 Thread 15

Thread 14

Thread 13

Thread 12

Thread 11

Thread 10

Thread 9

Thread 8

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0 Address 128

Address 132

Address 136

Address 140

Address 144

Address 148

Address 152

Address 156

Address 160

Address 164

Address 168

Address 172

Address 176

Address 180

Address 184

Address 188

Thread 15

Thread 14

Thread 13

Thread 12

Thread 11

Thread 10

Thread 9

Thread 8

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

27Monday, 21 February 2011

Page 28: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

Chapter 5. Performance Guidelines

!

CUDA Programming Guide Version 2.2.1 85!

!

Left: non-sequential float memory access, resulting in 16 memory transactions.

Right: access with a misaligned starting address, resulting in 16 memory transactions.

Figure 5-2. Examples of Global Memory Access Patterns That Are Non-Coalesced for Devices of Compute Capability 1.0 or 1.1

Address 128

Address 132

Address 136

Address 140

Address 144

Address 148

Address 152

Address 156

Address 160

Address 164

Address 168

Address 172

Address 176

Address 180

Address 184

Address 188

Thread 15

Thread 14

Thread 13

Thread 12

Thread 11

Thread 10

Thread 9

Thread 8

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0 Address 128

Address 132

Address 136

Address 140

Address 144

Address 148

Address 152

Address 156

Address 160

Address 164

Address 168

Address 172

Address 176

Address 180

Address 184

Address 188 Thread 15

Thread 14

Thread 13

Thread 12

Thread 11

Thread 10

Thread 9

Thread 8

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

28Monday, 21 February 2011

Page 29: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

Chapter 5. Performance Guidelines

!

86 CUDA Programming Guide Version 2.2.1!

!

Left: non-contiguous float memory access, resulting in 16 memory transactions.

Right: non-coalesced float3 memory access, resulting in 16 memory transactions.

Figure 5-3. Examples of Global Memory Access Patterns That Are Non-Coalesced for Devices of Compute Capability 1.0 or 1.1

Address 128

Address 140

Address 152

Address 164

Address 176

Address 188

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0 Address 128

Address 132

Address 136

Address 140

Address 144

Address 148

Address 152

Address 156

Address 160

Address 164

Address 168

Address 172

Address 176

Address 180

Address 184

Address 188 Thread 15

Thread 14

Thread 13

Thread 12

Thread 11

Thread 10

Thread 9

Thread 8

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

29Monday, 21 February 2011

Page 30: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

Chapter 5. Performance Guidelines

!

82 CUDA Programming Guide Version 2.2.1!

"#$!%&&'())!*+!%!,%'-%./(!'()-&-#0!-#!0/*.%/!1(1*'$!*'!'(23'#(&!.$!*#(!*+!24(!1(1*'$!%//*5%2-*#!'*32-#()!+'*1!24(!&'-,('!*'!'3#2-1(!"67!-)!%/8%$)!%/-0#(&!2*!%2!/(%)2!9:;!.$2()<!

!"#$%&=!0/*.%/!1(1*'$!.%#&8-&24!-)!3)(&!1*)2!(++-5-(#2/$!84(#!24(!)-13/2%#(*3)!1(1*'$!%55())()!.$!24'(%&)!-#!%!4%/+>8%'?!@&3'-#0!24(!(A(532-*#!*+!%!)-#0/(!'(%&!*'!8'-2(!-#)2'352-*#B!5%#!.(!!"#$%&!%'!-#2*!%!)-#0/(!1(1*'$!2'%#)%52-*#!*+!C9=!;D=!*'!E9F!.$2()<!

G4(!'()2!*+!24-)!)(52-*#!&()5'-.()!24(!,%'-*3)!'(H3-'(1(#2)!+*'!1(1*'$!%55())()!2*!5*%/()5(!.%)(&!*#!24(!5*1?32(!5%?%.-/-2$!*+!24(!&(,-5(<!7+!%!4%/+>8%'?!+3/+-//)!24()(!'(H3-'(1(#2)=!5*%/()5-#0!-)!%54-(,(&!(,(#!-+!24(!8%'?!-)!&-,('0(#2!%#&!)*1(!24'(%&)!*+!24(!4%/+>8%'?!&*!#*2!%523%//$!%55())!1(1*'$<!

I*'!24(!?3'?*)(!*+!24(!+*//*8-#0!&-)53))-*#=!0/*.%/!1(1*'$!-)!5*#)-&('(&!2*!.(!?%'2-2-*#(&!-#2*!)(01(#2)!*+!)-J(!(H3%/!2*!C9=!;D=!*'!E9F!.$2()!#('!%/-0#(&!2*!24-)!)-J(<!

'$()"*#+%,-$%-."/+#"*-0+12-'$3451"-'(4(6+)+17-89:-(%&-898!-

G4(!0/*.%/!1(1*'$!%55())!.$!%//!24'(%&)!*+!%!4%/+>8%'?!-)!5*%/()5(&!-#2*!*#(!*'!28*!1(1*'$!2'%#)%52-*#)!-+!-2!)%2-)+-()!24(!+*//*8-#0!24'((!5*#&-2-*#)K!

! G4'(%&)!13)2!%55())!

! L-24('!C9>.-2!8*'&)=!'()3/2-#0!-#!*#(!;D>.$2(!1(1*'$!2'%#)%52-*#=!

! M'!;D>.-2!8*'&)=!'()3/2-#0!-#!*#(!E9F>.$2(!1(1*'$!2'%#)%52-*#=!

! M'!E9F>.-2!8*'&)=!'()3/2-#0!-#!28*!E9F>.$2(!1(1*'$!2'%#)%52-*#)N!

! "//!E;!8*'&)!13)2!/-(!-#!24(!)%1(!)(01(#2!*+!)-J(!(H3%/!2*!24(!1(1*'$!2'%#)%52-*#!)-J(!@*'!28-5(!24(!1(1*'$!2'%#)%52-*#!)-J(!84(#!%55())-#0!E9F>.-2!8*'&)BN!

! G4'(%&)!13)2!%55())!24(!8*'&)!-#!)(H3(#5(K!G4(!)24!24'(%&!-#!24(!4%/+>8%'?!13)2!%55())!24(!)24!8*'&<!

7+!%!4%/+>8%'?!&*()!#*2!+3/+-//!%//!24(!'(H3-'(1(#2)!%.*,(=!%!)(?%'%2(!1(1*'$!2'%#)%52-*#!-)!-))3(&!+*'!(%54!24'(%&!%#&!24'*304?32!-)!)-0#-+-5%#2/$!'(&35(&<!

I-03'(!:>E!)4*8)!)*1(!(A%1?/()!*+!5*%/()5(&!1(1*'$!%55())()=!84-/(!I-03'(!:>9!%#&!I-03'(!:>C!)4*8!)*1(!(A%1?/()!*+!1(1*'$!%55())()!24%2!%'(!#*#>5*%/()5(&!+*'!&(,-5()!*+!5*1?32(!5%?%.-/-2$!E<O!*'!E<E<!

P*%/()5(&!;D>.-2!%55())()!&(/-,('!%!/-22/(!/*8('!.%#&8-&24!24%#!5*%/()5(&!C9>.-2!%55())()!%#&!5*%/()5(&!E9F>.-2!%55())()!&(/-,('!%!#*2-5(%./$!/*8('!.%#&8-&24!24%#!5*%/()5(&!C9>.-2!%55())()<!Q32=!84-/(!.%#&8-&24!+*'!#*#>5*%/()5(&!%55())()!-)!%'*3#&!%#!*'&('!*+!1%0#-23&(!/*8('!24%#!+*'!5*%/()5(&!%55())()!84(#!24()(!%55())()!%'(!C9>.-2=!-2!-)!*#/$!%'*3#&!+*3'!2-1()!/*8('!84(#!24($!%'(!;D>.-2!%#&!%'*3#&!28*!2-1()!84(#!24($!%'(!E9F>.-2<!

'$()"*#+%,-$%-."/+#"*-0+12-'$3451"-'(4(6+)+17-89;-(%&-<+,2"=-

G4(!0/*.%/!1(1*'$!%55())!.$!%//!24'(%&)!*+!%!4%/+>8%'?!-)!5*%/()5(&!-#2*!%!)-#0/(!1(1*'$!2'%#)%52-*#!%)!)**#!%)!24(!8*'&)!%55())(&!.$!%//!24'(%&)!/-(!-#!24(!)%1(!)(01(#2!*+!)-J(!(H3%/!2*K!

! C9!.$2()!-+!%//!24'(%&)!%55())!F>.-2!8*'&)=!

! ;D!.$2()!-+!%//!24'(%&)!%55())!E;>.-2!8*'&)=!

! E9F!.$2()!-+!%//!24'(%&)!%55())!C9>.-2!*'!;D>.-2!8*'&)<!

30Monday, 21 February 2011

Page 31: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

Chapter 5. Performance Guidelines

!

CUDA Programming Guide Version 2.2.1 87!

!

Left: random float memory access within a 64B segment, resulting in one memory transaction.

Center: misaligned float memory access, resulting in one transaction.

Right: misaligned float memory access, resulting in two transactions.

Figure 5-4. Examples of Global Memory Access by Devices with Compute Capability 1.2 and Higher

Thread 15

Thread 14

Thread 13

Thread 12

Thread 11

Thread 10

Thread 9

Thread 8

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

Address

120

Address

124

Address

128

Address

132

Address

136

Address

140

Address

144

Address

148

Address

152

Address

156

Address

160

Address

164

Address

168

Address

172

Address

176

Address

180

Address

184

Address

188

Address

192

Address

196

Address

200

Address

204

Address

208

Address

212

64

B s

eg

me

nt

Thread 15

Thread 14

Thread 13

Thread 12

Thread 11

Thread 10

Thread 9

Thread 8

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

Address

96

Address

100

Address

104

Address

108

Address

112

Address

116

Address

120

Address

124

Address

128

Address

132

Address

136

Address

140

Address

144

Address

148

Address

152

Address

156

Address

160

Address

164

Address

168

Address

172

Address

176

Address

180

Address

184

Address

188

64

B s

eg

me

nt

Thread 15

Thread 14

Thread 13

Thread 12

Thread 11

Thread 10

Thread 9

Thread 8

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

Address

120

Address

124

Address

128

Address

132

Address

136

Address

140

Address

144

Address

148

Address

152

Address

156

Address

160

Address

164

Address

168

Address

172

Address

176

Address

180

Address

184

Address

188

Address

192

Address

196

Address

204

Address

252

Address

256

12

8B

se

gm

en

t

32

B s

eg

me

nt

31Monday, 21 February 2011

Page 32: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

Shared memory• as fast as register access

• if there are no bank conflicts

• 16 equally-sized modules in 1.x

• n addresses in n banks = one transaction

• n addresses in one bank = n transactions!

• bank conflict can be avoided on read-only access

• successive 32-bit words are assigned to successive banks

• each bank has a bandwidth of 32 bits per two clock cycles.

32Monday, 21 February 2011

Page 33: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

More on bank conflicts

• if warp size is 32 and number of banks 16 a shared memory request for a warp is split into one request for the first half of the warp and one request for the second half of the warp

• so: no bank conflict between threads belonging to the different halves of the same warp!

33Monday, 21 February 2011

Page 34: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

Typical case

__shared__ float shared[32]; float data = shared[BaseIndex + s * tid];

• tid vs. tid+n : bank conflicts?

• s*m % 16 <> s*n % 16

• for any m<>n, |m-n|<16

• s=1 (the only safe stride!)

34Monday, 21 February 2011

Page 35: CUDA Memory Model - University of Birminghamdrg/cuda/week6.pdf · 2011-02-21 · G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write

Beware of data size!

Chapter 5. Performance Guidelines

!

90 CUDA Programming Guide Version 2.2.1!

"#$%&%'(!)*!+$#!,--'%..%.!#*!,!/%/#'0!'%12%.+!*,33!)4!+5%!.,/%!/%/#'0!6,47(!+5%'%!).!,!6,47!8#4*3)8+!,4-!+5%!,88%..!5,.!+#!6%!.%'),3)9%-:!;5%!5,'-$,'%!.<3)+.!,!/%/#'0!'%12%.+!$)+5!6,47!8#4*3)8+.!)4+#!,.!/,40!.%<,',+%!8#4*3)8+=*'%%!'%12%.+.!,.!4%8%..,'0(!-%8'%,.)4>!+5%!%**%8+)&%!6,4-$)-+5!60!,!*,8+#'!%12,3!+#!+5%!42/6%'!#*!.%<,',+%!/%/#'0!'%12%.+.:!?*!+5%!42/6%'!#*!.%<,',+%!/%/#'0!'%12%.+.!).!!(!+5%!)4)+),3!/%/#'0!'%12%.+!).!.,)-!+#!8,2.%!!"$,0!6,47!8#4*3)8+.:!

;#!>%+!/,@)/2/!<%'*#'/,48%(!)+!).!+5%'%*#'%!)/<#'+,4+!+#!24-%'.+,4-!5#$!/%/#'0!,--'%..%.!/,<!+#!/%/#'0!6,47.!)4!#'-%'!+#!.85%-23%!+5%!/%/#'0!'%12%.+.!.#!,.!+#!/)4)/)9%!6,47!8#4*3)8+.:!

?4!+5%!8,.%!#*!+5%!.5,'%-!/%/#'0!.<,8%(!+5%!6,47.!,'%!#'>,4)9%-!.285!+5,+!.288%..)&%!AB=6)+!$#'-.!,'%!,..)>4%-!+#!.288%..)&%!6,47.!,4-!%,85!6,47!5,.!,!6,4-$)-+5!#*!AB!6)+.!<%'!+$#!83#87!8083%.:!

C#'!-%&)8%.!#*!8#/<2+%!8,<,6)3)+0!D:@(!+5%!$,'<!.)9%!).!AB!,4-!+5%!42/6%'!#*!6,47.!).!DE!F.%%!G%8+)#4!H:DIJ!,!.5,'%-!/%/#'0!'%12%.+!*#'!,!$,'<!).!.<3)+!)4+#!#4%!'%12%.+!*#'!+5%!*)'.+!5,3*!#*!+5%!$,'<!,4-!#4%!'%12%.+!*#'!+5%!.%8#4-!5,3*!#*!+5%!$,'<:!K.!,!8#4.%12%48%(!+5%'%!8,4!6%!4#!6,47!8#4*3)8+!6%+$%%4!,!+5'%,-!6%3#4>)4>!+#!+5%!*)'.+!5,3*!#*!,!$,'<!,4-!,!+5'%,-!6%3#4>)4>!+#!+5%!.%8#4-!5,3*!#*!+5%!.,/%!$,'<:!

K!8#//#4!8,.%!).!*#'!%,85!+5'%,-!+#!,88%..!,!AB=6)+!$#'-!*'#/!,4!,'',0!)4-%@%-!60!+5%!+5'%,-!?L!tid!,4-!$)+5!.#/%!.+')-%!sM!

__shared__ float shared[32];

float data = shared[BaseIndex + s * tid];

?4!+5).!8,.%(!+5%!+5'%,-.!tid!,4-!tid+n!,88%..!+5%!.,/%!6,47!$5%4%&%'!s*n!).!,!/23+)<3%!#*!+5%!42/6%'!#*!6,47.!m!#'!%12)&,3%4+30(!$5%4%&%'!n!).!,!/23+)<3%!#*!m/d!$5%'%!d!).!+5%!>'%,+%.+!8#//#4!-)&).#'!#*!m!,4-!s:!K.!,!8#4.%12%48%(!+5%'%!$)33!6%!4#!6,47!8#4*3)8+!#430!)*!5,3*!+5%!$,'<!.)9%!).!3%..!+5,4!#'!%12,3!+#!m/d:!C#'!-%&)8%.!#*!8#/<2+%!8,<,6)3)+0!D:@(!+5).!+',4.3,+%.!+#!4#!6,47!8#4*3)8+!#430!)*!d!).!%12,3!+#!D(!#'!)4!#+5%'!$#'-.(!#430!)*!s!).!#--!.)48%!m!).!,!<#$%'!#*!+$#:!

C)>2'%!H=H!,4-!C)>2'%!H=E!.5#$!.#/%!%@,/<3%.!#*!8#4*3)8+=*'%%!/%/#'0!,88%..%.!$5)3%!C)>2'%!H=N!.5#$.!.#/%!%@,/<3%.!#*!/%/#'0!,88%..%.!+5,+!8,2.%!6,47!8#4*3)8+.:!

O+5%'!8,.%.!$#'+5!/%4+)#4)4>!,'%!$5%4!%,85!+5'%,-!,88%..%.!,4!%3%/%4+!+5,+!).!./,33%'!#'!3,'>%'!+5,4!AB!6)+.!)4!.)9%:!C#'!%@,/<3%(!+5%'%!,'%!6,47!8#4*3)8+.!)*!,4!,'',0!#*!char!).!,88%..%-!+5%!*#33#$)4>!$,0M!

__shared__ char shared[32];

char data = shared[BaseIndex + tid];

6%8,2.%!shared[0](!shared[1](!shared[2](!,4-!shared[3](!*#'!%@,/<3%(!6%3#4>!+#!+5%!.,/%!6,47:!;5%'%!,'%!4#!6,47!8#4*3)8+.!5#$%&%'(!)*!+5%!.,/%!,'',0!).!,88%..%-!+5%!*#33#$)4>!$,0M!

char data = shared[BaseIndex + 4 * tid];

;5%'%!,'%!,3.#!B=$,0!6,47!8#4*3)8+.!*#'!,'',0.!#*!doubleM!

__shared__ double shared[32];

double data = shared[BaseIndex + tid];

.)48%!+5%!/%/#'0!'%12%.+!).!8#/<)3%-!)4+#!+$#!.%<,',+%!AB=6)+!'%12%.+.:!O4%!$,0!+#!,&#)-!6,47!8#4*3)8+.!)4!+5).!8,.%!).!+$#!.<3)+!+5%!double!#<%',4-.!3)7%!)4!+5%!*#33#$)4>!.,/<3%!8#-%M!

__shared__ int shared_lo[32];

Char is too small

35Monday, 21 February 2011