CUDA Memory Model Some material © David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 Monday, 21 February 2011
CUDA Memory Model
Some material © David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission)1Monday, 21 February 2011
G80 Implementation of CUDA Memories
• Each thread can:
• Read/write per-thread registers
• Read/write per-thread local memory
• Read/write per-block shared memory
• Read/write per-grid global memory
• Read/only per-grid constant memory
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
2Monday, 21 February 2011
• __device__ is optional when used with __local__, __shared__, or __constant__
• Automatic variables without any qualifier reside in a register
• Except arrays that reside in local memory
CUDA Variable Type Qualifiers• Variable declaration • Memory • Scope • Lifetime
• __device__ __local__ int LocalVar; • local • thread • thread
• __device__ __shared__ int SharedVar; • shared • block • block
• __device__ int GlobalVar; • global • grid • application
• __device__ __constant__ int ConstantVar; • constant • grid • application
3Monday, 21 February 2011
Where to Declare Variables?
Can host access it?
Outside of any Function In the kernel
yes noglobalconstant
sharedlocal
register/automatic
4Monday, 21 February 2011
Variable Type Restrictions
• Pointers can only point to memory allocated or declared in global memory:
• Allocated in the host and passed to the kernel:
• __global__ void KernelFun (float* ptr)
• Obtained as the address of a global variable:
• float* ptr = &GlobalVar;
5Monday, 21 February 2011
• Global memory resides in device memory (DRAM) - much slower access than shared memory
• Tile data to take advantage of fast shared memory:
• Partition data into subsets that fit into shared memory
• Handle each data subset with one thread block by:
• Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism
• Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element
• Copying results from shared memory to global memory
A Common Programming Strategy
6Monday, 21 February 2011
• Constant memory also resides in device memory (DRAM) - much slower access than shared memory
• But… cached!
• Highly efficient access for read-only data
• Carefully divide data according to access patterns
• R/Only constant memory (very fast if in cache)
• R/W shared within Block shared memory (very fast)
• R/W within each thread registers (very fast)
• R/W inputs/results global memory (very slow)
A Common Programming Strategy (Cont.)
7Monday, 21 February 2011
Matrix Multiplication using
Shared Memory
8Monday, 21 February 2011
Matrix Multiplication Kernel__global__ void MatrixMulKernel(float* Md, float* Nd,
float* Pd, int Width) {
// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];
Pd[Row*Width+Col] = Pvalue;
}9Monday, 21 February 2011
•All threads access global memory for their input matrix elements
• Two memory accesses (8 bytes) per floating point multiply-add
• 4B/s of memory bandwidth/FLOPS
• 4*346.5 = 1386 GB/s required to achieve peak FLOP rating
• 86.4 GB/s limits the code at 21.6 GFLOPS
•The actual code runs at about 15 GFLOPS
•Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPS
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
How about performance on G80?
10Monday, 21 February 2011
Idea: Use Shared Memory to reuse global memory data
M
N
P
WIDTH
WIDTH
WIDTH WIDTH
ty
tx
• Each input element is read by Width threads.
• Load each element into Shared Memory and have several threads use the local version to reduce memory bandwidth
11Monday, 21 February 2011
Tiled matrix
A =
!""#
!1,1 !1,2 · · ·!2,1 !2,2 · · ·...
.... . .
$%%& =
!""#
'!1,1 !1,2!2,1 !2,2
(· · ·
.... . .
$%%& =)
A1,1 · · ·...
. . .
*
(!b)",j =p!
k=1!",kbk,j (AB)",j =
p/m!
k=1A",kBk,j
12Monday, 21 February 2011
Md
Nd
Pd
Pdsub
TILE_WIDTH
WIDTHWIDTH
TILE_WIDTHTILE_WIDTH
bx
tx0 1 TILE_WIDTH-12
0 1 2
by ty 210
TILE_WIDTH-1
2
1
0
TILE_WIDTH
TILE_WIDTH
TILE_WIDTHE
WIDTH
WIDTH
Tiled Multiply
• Break up the execution of the kernel into phases so that the data accesses in each phase is focused on one subset (tile) of Md and Nd
13Monday, 21 February 2011
Pd1,0Md2,0
Md1,1
Md1,0Md0,0
Md0,1
Md3,0
Md2,1
Pd0,0
Md3,1 Pd0,1
Pd2,0 Pd3,0
Nd0,3 Nd1,3
Nd1,2
Nd1,1
Nd1,0Nd0,0
Nd0,1
Nd0,2
Pd1,1
Pd0,2 Pd2,2 Pd3,2Pd1,2
Pd3,1Pd2,1
Pd0,3 Pd2,3 Pd3,3Pd1,3
Tiled multiply
14Monday, 21 February 2011
Every Md and Nd Element is used exactly twice in generating a 2X2 tile of P
P0,0
thread0,0
P1,0
thread1,0
P0,1
thread0,1
P1,1
thread1,1
M0,0 * N0,0 M0,0 * N1,0 M0,1 * N0,0 M0,1 * N1,0
M1,0 * N0,1 M1,0 * N1,1 M1,1 * N0,1 M1,1 * N1,1
M2,0 * N0,2 M2,0 * N1,2 M2,1 * N0,2 M2,1 * N1,2
M3,0 * N0,3 M3,0 * N1,3 M3,1 * N0,3 M3,1 * N1,3
Accessorder
15Monday, 21 February 2011
Each phase of a Thread Block uses one tile from Md and one from Nd
Step 4 Step 5 Step 6
T0,0
Md0,0 ↓
Mds0,0
Nd0,0↓
Nds0,0
PValue0,0 += Mds0,0*Nds0,0 + Mds1,0*Nds0,1
Md2,0 ↓
Mds0,0
Nd0,2↓
Nds0,0
PValue0,0 += Mds0,0*Nds0,0 + Mds1,0*Nds0,1
T1,0
Md1,0↓
Mds1,0
Nd1,0↓
Nds1,0
PValue1,0 += Mds0,0*Nds1,0 + Mds1,0*Nds1,1
Md3,0 ↓
Mds1,0
Nd1,2↓
Nds1,0
PValue1,0 += Mds0,0*Nds1,0 + Mds1,0*Nds1,1
T0,1
Md0,1↓
Mds0,1
Nd0,1↓
Nds0,1
PdValue0,1 += Mds0,1*Nds0,0 + Mds1,1*Nds0,1
Md2,1↓
Mds0,1
Nd0,3↓
Nds0,1
PdValue0,1 += Mds0,1*Nds0,0 + Mds1,1*Nds0,1
T1,1
Md1,1↓
Mds1,1
Nd1,1↓
Nds1,1
PdValue1,1 += Mds0,1*Nds1,0 + Mds1,1*Nds1,1
Md3,1 ↓
Mds1,1
Nd1,3↓
Nds1,1
PdValue1,1 += Mds0,1*Nds1,0 + Mds1,1*Nds1,1
Phase 1 Phase 2
time
16Monday, 21 February 2011
First-order Size Considerations in G80
• Each thread block should have many threads
• TILE_WIDTH of 16 gives 16*16 = 256 threads
• There should be many thread blocks
• A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks
• Each thread block perform 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations.
• Memory bandwidth no longer a limiting factor17Monday, 21 February 2011
CUDA Code – Kernel Execution Configuration
// Setup the execution configuration
dim3 dimBlock(TILE_WIDTH, TILE_WIDTH);
dim3 dimGrid(Width / TILE_WIDTH,
Width / TILE_WIDTH);
18Monday, 21 February 2011
__global__ void MatrixMulKernel(float* Md, float* Nd,
float* Pd, int Width){
__shared__float Mds[TILE_WIDTH][TILE_WIDTH];
__shared__float Nds[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
// Identify row and column of Pd to work on
int Row = by * TILE_WIDTH + ty;
int Col = bx * TILE_WIDTH + tx;
float Pvalue = 0;
Tiled Matrix Multiplication Kernel
19Monday, 21 February 2011
// Loop over Md and Nd tiles required to compute Pd
for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Collaborative loading of Md and Nd tiles into shared memory
Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];
__syncthreads();
for (int k = 0; k < TILE_WIDTH; ++k)
Pvalue += Mds[ty][k] * Nds[k][tx];
__synchthreads();
}
Pd[Row*Width+Col] = Pvalue;
}
20Monday, 21 February 2011
• Each SM in G80 has 16KB shared memory
• shared memory size is implementation dependent!
• For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory.
• Can potentially have up to 8 Thread Blocks actively executing
• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
• The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same time
• Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16
• The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS!
G80 Shared Memory and Threading
21Monday, 21 February 2011
Block size effect on throughput
0 5 10 15 20
Block Size22Monday, 21 February 2011
Loop unrolling effects
23Monday, 21 February 2011
CUDA Memory Model
24Monday, 21 February 2011
• memory not cached so using the right access pattern is crucial
• memory access (to global memory) is very costly
• variables must be aligned for efficient access (usually are)
• simultaneous memory accesses in half-warps can be coalesced in one transaction
• coalescing depends on device compute capabilities
Global Memory
25Monday, 21 February 2011
Chapter 5. Performance Guidelines
!
82 CUDA Programming Guide Version 2.2.1!
"#$!%&&'())!*+!%!,%'-%./(!'()-&-#0!-#!0/*.%/!1(1*'$!*'!'(23'#(&!.$!*#(!*+!24(!1(1*'$!%//*5%2-*#!'*32-#()!+'*1!24(!&'-,('!*'!'3#2-1(!"67!-)!%/8%$)!%/-0#(&!2*!%2!/(%)2!9:;!.$2()<!
!"#$%&=!0/*.%/!1(1*'$!.%#&8-&24!-)!3)(&!1*)2!(++-5-(#2/$!84(#!24(!)-13/2%#(*3)!1(1*'$!%55())()!.$!24'(%&)!-#!%!4%/+>8%'?!@&3'-#0!24(!(A(532-*#!*+!%!)-#0/(!'(%&!*'!8'-2(!-#)2'352-*#B!5%#!.(!!"#$%&!%'!-#2*!%!)-#0/(!1(1*'$!2'%#)%52-*#!*+!C9=!;D=!*'!E9F!.$2()<!
G4(!'()2!*+!24-)!)(52-*#!&()5'-.()!24(!,%'-*3)!'(H3-'(1(#2)!+*'!1(1*'$!%55())()!2*!5*%/()5(!.%)(&!*#!24(!5*1?32(!5%?%.-/-2$!*+!24(!&(,-5(<!7+!%!4%/+>8%'?!+3/+-//)!24()(!'(H3-'(1(#2)=!5*%/()5-#0!-)!%54-(,(&!(,(#!-+!24(!8%'?!-)!&-,('0(#2!%#&!)*1(!24'(%&)!*+!24(!4%/+>8%'?!&*!#*2!%523%//$!%55())!1(1*'$<!
I*'!24(!?3'?*)(!*+!24(!+*//*8-#0!&-)53))-*#=!0/*.%/!1(1*'$!-)!5*#)-&('(&!2*!.(!?%'2-2-*#(&!-#2*!)(01(#2)!*+!)-J(!(H3%/!2*!C9=!;D=!*'!E9F!.$2()!#('!%/-0#(&!2*!24-)!)-J(<!
'$()"*#+%,-$%-."/+#"*-0+12-'$3451"-'(4(6+)+17-89:-(%&-898!-
G4(!0/*.%/!1(1*'$!%55())!.$!%//!24'(%&)!*+!%!4%/+>8%'?!-)!5*%/()5(&!-#2*!*#(!*'!28*!1(1*'$!2'%#)%52-*#)!-+!-2!)%2-)+-()!24(!+*//*8-#0!24'((!5*#&-2-*#)K!
! G4'(%&)!13)2!%55())!
! L-24('!C9>.-2!8*'&)=!'()3/2-#0!-#!*#(!;D>.$2(!1(1*'$!2'%#)%52-*#=!
! M'!;D>.-2!8*'&)=!'()3/2-#0!-#!*#(!E9F>.$2(!1(1*'$!2'%#)%52-*#=!
! M'!E9F>.-2!8*'&)=!'()3/2-#0!-#!28*!E9F>.$2(!1(1*'$!2'%#)%52-*#)N!
! "//!E;!8*'&)!13)2!/-(!-#!24(!)%1(!)(01(#2!*+!)-J(!(H3%/!2*!24(!1(1*'$!2'%#)%52-*#!)-J(!@*'!28-5(!24(!1(1*'$!2'%#)%52-*#!)-J(!84(#!%55())-#0!E9F>.-2!8*'&)BN!
! G4'(%&)!13)2!%55())!24(!8*'&)!-#!)(H3(#5(K!G4(!)24!24'(%&!-#!24(!4%/+>8%'?!13)2!%55())!24(!)24!8*'&<!
7+!%!4%/+>8%'?!&*()!#*2!+3/+-//!%//!24(!'(H3-'(1(#2)!%.*,(=!%!)(?%'%2(!1(1*'$!2'%#)%52-*#!-)!-))3(&!+*'!(%54!24'(%&!%#&!24'*304?32!-)!)-0#-+-5%#2/$!'(&35(&<!
I-03'(!:>E!)4*8)!)*1(!(A%1?/()!*+!5*%/()5(&!1(1*'$!%55())()=!84-/(!I-03'(!:>9!%#&!I-03'(!:>C!)4*8!)*1(!(A%1?/()!*+!1(1*'$!%55())()!24%2!%'(!#*#>5*%/()5(&!+*'!&(,-5()!*+!5*1?32(!5%?%.-/-2$!E<O!*'!E<E<!
P*%/()5(&!;D>.-2!%55())()!&(/-,('!%!/-22/(!/*8('!.%#&8-&24!24%#!5*%/()5(&!C9>.-2!%55())()!%#&!5*%/()5(&!E9F>.-2!%55())()!&(/-,('!%!#*2-5(%./$!/*8('!.%#&8-&24!24%#!5*%/()5(&!C9>.-2!%55())()<!Q32=!84-/(!.%#&8-&24!+*'!#*#>5*%/()5(&!%55())()!-)!%'*3#&!%#!*'&('!*+!1%0#-23&(!/*8('!24%#!+*'!5*%/()5(&!%55())()!84(#!24()(!%55())()!%'(!C9>.-2=!-2!-)!*#/$!%'*3#&!+*3'!2-1()!/*8('!84(#!24($!%'(!;D>.-2!%#&!%'*3#&!28*!2-1()!84(#!24($!%'(!E9F>.-2<!
'$()"*#+%,-$%-."/+#"*-0+12-'$3451"-'(4(6+)+17-89;-(%&-<+,2"=-
G4(!0/*.%/!1(1*'$!%55())!.$!%//!24'(%&)!*+!%!4%/+>8%'?!-)!5*%/()5(&!-#2*!%!)-#0/(!1(1*'$!2'%#)%52-*#!%)!)**#!%)!24(!8*'&)!%55())(&!.$!%//!24'(%&)!/-(!-#!24(!)%1(!)(01(#2!*+!)-J(!(H3%/!2*K!
! C9!.$2()!-+!%//!24'(%&)!%55())!F>.-2!8*'&)=!
! ;D!.$2()!-+!%//!24'(%&)!%55())!E;>.-2!8*'&)=!
! E9F!.$2()!-+!%//!24'(%&)!%55())!C9>.-2!*'!;D>.-2!8*'&)<!
26Monday, 21 February 2011
Chapter 5. Performance Guidelines
!
84 CUDA Programming Guide Version 2.2.1!
!
Left: coalesced float memory access, resulting in a single memory transaction.
Right: coalesced float memory access (divergent warp), resulting in a single memory transaction.
Figure 5-1. Examples of Coalesced Global Memory Access Patterns
Address 128
Address 132
Address 136
Address 140
Address 144
Address 148
Address 152
Address 156
Address 160
Address 164
Address 168
Address 172
Address 176
Address 180
Address 184
Address 188 Thread 15
Thread 14
Thread 13
Thread 12
Thread 11
Thread 10
Thread 9
Thread 8
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0 Address 128
Address 132
Address 136
Address 140
Address 144
Address 148
Address 152
Address 156
Address 160
Address 164
Address 168
Address 172
Address 176
Address 180
Address 184
Address 188
Thread 15
Thread 14
Thread 13
Thread 12
Thread 11
Thread 10
Thread 9
Thread 8
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
27Monday, 21 February 2011
Chapter 5. Performance Guidelines
!
CUDA Programming Guide Version 2.2.1 85!
!
Left: non-sequential float memory access, resulting in 16 memory transactions.
Right: access with a misaligned starting address, resulting in 16 memory transactions.
Figure 5-2. Examples of Global Memory Access Patterns That Are Non-Coalesced for Devices of Compute Capability 1.0 or 1.1
Address 128
Address 132
Address 136
Address 140
Address 144
Address 148
Address 152
Address 156
Address 160
Address 164
Address 168
Address 172
Address 176
Address 180
Address 184
Address 188
Thread 15
Thread 14
Thread 13
Thread 12
Thread 11
Thread 10
Thread 9
Thread 8
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0 Address 128
Address 132
Address 136
Address 140
Address 144
Address 148
Address 152
Address 156
Address 160
Address 164
Address 168
Address 172
Address 176
Address 180
Address 184
Address 188 Thread 15
Thread 14
Thread 13
Thread 12
Thread 11
Thread 10
Thread 9
Thread 8
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
28Monday, 21 February 2011
Chapter 5. Performance Guidelines
!
86 CUDA Programming Guide Version 2.2.1!
!
Left: non-contiguous float memory access, resulting in 16 memory transactions.
Right: non-coalesced float3 memory access, resulting in 16 memory transactions.
Figure 5-3. Examples of Global Memory Access Patterns That Are Non-Coalesced for Devices of Compute Capability 1.0 or 1.1
Address 128
Address 140
Address 152
Address 164
Address 176
Address 188
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0 Address 128
Address 132
Address 136
Address 140
Address 144
Address 148
Address 152
Address 156
Address 160
Address 164
Address 168
Address 172
Address 176
Address 180
Address 184
Address 188 Thread 15
Thread 14
Thread 13
Thread 12
Thread 11
Thread 10
Thread 9
Thread 8
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
29Monday, 21 February 2011
Chapter 5. Performance Guidelines
!
82 CUDA Programming Guide Version 2.2.1!
"#$!%&&'())!*+!%!,%'-%./(!'()-&-#0!-#!0/*.%/!1(1*'$!*'!'(23'#(&!.$!*#(!*+!24(!1(1*'$!%//*5%2-*#!'*32-#()!+'*1!24(!&'-,('!*'!'3#2-1(!"67!-)!%/8%$)!%/-0#(&!2*!%2!/(%)2!9:;!.$2()<!
!"#$%&=!0/*.%/!1(1*'$!.%#&8-&24!-)!3)(&!1*)2!(++-5-(#2/$!84(#!24(!)-13/2%#(*3)!1(1*'$!%55())()!.$!24'(%&)!-#!%!4%/+>8%'?!@&3'-#0!24(!(A(532-*#!*+!%!)-#0/(!'(%&!*'!8'-2(!-#)2'352-*#B!5%#!.(!!"#$%&!%'!-#2*!%!)-#0/(!1(1*'$!2'%#)%52-*#!*+!C9=!;D=!*'!E9F!.$2()<!
G4(!'()2!*+!24-)!)(52-*#!&()5'-.()!24(!,%'-*3)!'(H3-'(1(#2)!+*'!1(1*'$!%55())()!2*!5*%/()5(!.%)(&!*#!24(!5*1?32(!5%?%.-/-2$!*+!24(!&(,-5(<!7+!%!4%/+>8%'?!+3/+-//)!24()(!'(H3-'(1(#2)=!5*%/()5-#0!-)!%54-(,(&!(,(#!-+!24(!8%'?!-)!&-,('0(#2!%#&!)*1(!24'(%&)!*+!24(!4%/+>8%'?!&*!#*2!%523%//$!%55())!1(1*'$<!
I*'!24(!?3'?*)(!*+!24(!+*//*8-#0!&-)53))-*#=!0/*.%/!1(1*'$!-)!5*#)-&('(&!2*!.(!?%'2-2-*#(&!-#2*!)(01(#2)!*+!)-J(!(H3%/!2*!C9=!;D=!*'!E9F!.$2()!#('!%/-0#(&!2*!24-)!)-J(<!
'$()"*#+%,-$%-."/+#"*-0+12-'$3451"-'(4(6+)+17-89:-(%&-898!-
G4(!0/*.%/!1(1*'$!%55())!.$!%//!24'(%&)!*+!%!4%/+>8%'?!-)!5*%/()5(&!-#2*!*#(!*'!28*!1(1*'$!2'%#)%52-*#)!-+!-2!)%2-)+-()!24(!+*//*8-#0!24'((!5*#&-2-*#)K!
! G4'(%&)!13)2!%55())!
! L-24('!C9>.-2!8*'&)=!'()3/2-#0!-#!*#(!;D>.$2(!1(1*'$!2'%#)%52-*#=!
! M'!;D>.-2!8*'&)=!'()3/2-#0!-#!*#(!E9F>.$2(!1(1*'$!2'%#)%52-*#=!
! M'!E9F>.-2!8*'&)=!'()3/2-#0!-#!28*!E9F>.$2(!1(1*'$!2'%#)%52-*#)N!
! "//!E;!8*'&)!13)2!/-(!-#!24(!)%1(!)(01(#2!*+!)-J(!(H3%/!2*!24(!1(1*'$!2'%#)%52-*#!)-J(!@*'!28-5(!24(!1(1*'$!2'%#)%52-*#!)-J(!84(#!%55())-#0!E9F>.-2!8*'&)BN!
! G4'(%&)!13)2!%55())!24(!8*'&)!-#!)(H3(#5(K!G4(!)24!24'(%&!-#!24(!4%/+>8%'?!13)2!%55())!24(!)24!8*'&<!
7+!%!4%/+>8%'?!&*()!#*2!+3/+-//!%//!24(!'(H3-'(1(#2)!%.*,(=!%!)(?%'%2(!1(1*'$!2'%#)%52-*#!-)!-))3(&!+*'!(%54!24'(%&!%#&!24'*304?32!-)!)-0#-+-5%#2/$!'(&35(&<!
I-03'(!:>E!)4*8)!)*1(!(A%1?/()!*+!5*%/()5(&!1(1*'$!%55())()=!84-/(!I-03'(!:>9!%#&!I-03'(!:>C!)4*8!)*1(!(A%1?/()!*+!1(1*'$!%55())()!24%2!%'(!#*#>5*%/()5(&!+*'!&(,-5()!*+!5*1?32(!5%?%.-/-2$!E<O!*'!E<E<!
P*%/()5(&!;D>.-2!%55())()!&(/-,('!%!/-22/(!/*8('!.%#&8-&24!24%#!5*%/()5(&!C9>.-2!%55())()!%#&!5*%/()5(&!E9F>.-2!%55())()!&(/-,('!%!#*2-5(%./$!/*8('!.%#&8-&24!24%#!5*%/()5(&!C9>.-2!%55())()<!Q32=!84-/(!.%#&8-&24!+*'!#*#>5*%/()5(&!%55())()!-)!%'*3#&!%#!*'&('!*+!1%0#-23&(!/*8('!24%#!+*'!5*%/()5(&!%55())()!84(#!24()(!%55())()!%'(!C9>.-2=!-2!-)!*#/$!%'*3#&!+*3'!2-1()!/*8('!84(#!24($!%'(!;D>.-2!%#&!%'*3#&!28*!2-1()!84(#!24($!%'(!E9F>.-2<!
'$()"*#+%,-$%-."/+#"*-0+12-'$3451"-'(4(6+)+17-89;-(%&-<+,2"=-
G4(!0/*.%/!1(1*'$!%55())!.$!%//!24'(%&)!*+!%!4%/+>8%'?!-)!5*%/()5(&!-#2*!%!)-#0/(!1(1*'$!2'%#)%52-*#!%)!)**#!%)!24(!8*'&)!%55())(&!.$!%//!24'(%&)!/-(!-#!24(!)%1(!)(01(#2!*+!)-J(!(H3%/!2*K!
! C9!.$2()!-+!%//!24'(%&)!%55())!F>.-2!8*'&)=!
! ;D!.$2()!-+!%//!24'(%&)!%55())!E;>.-2!8*'&)=!
! E9F!.$2()!-+!%//!24'(%&)!%55())!C9>.-2!*'!;D>.-2!8*'&)<!
30Monday, 21 February 2011
Chapter 5. Performance Guidelines
!
CUDA Programming Guide Version 2.2.1 87!
!
Left: random float memory access within a 64B segment, resulting in one memory transaction.
Center: misaligned float memory access, resulting in one transaction.
Right: misaligned float memory access, resulting in two transactions.
Figure 5-4. Examples of Global Memory Access by Devices with Compute Capability 1.2 and Higher
Thread 15
Thread 14
Thread 13
Thread 12
Thread 11
Thread 10
Thread 9
Thread 8
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
Address
120
Address
124
Address
128
Address
132
Address
136
Address
140
Address
144
Address
148
Address
152
Address
156
Address
160
Address
164
Address
168
Address
172
Address
176
Address
180
Address
184
Address
188
Address
192
Address
196
Address
200
Address
204
Address
208
Address
212
64
B s
eg
me
nt
Thread 15
Thread 14
Thread 13
Thread 12
Thread 11
Thread 10
Thread 9
Thread 8
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
Address
96
Address
100
Address
104
Address
108
Address
112
Address
116
Address
120
Address
124
Address
128
Address
132
Address
136
Address
140
Address
144
Address
148
Address
152
Address
156
Address
160
Address
164
Address
168
Address
172
Address
176
Address
180
Address
184
Address
188
64
B s
eg
me
nt
Thread 15
Thread 14
Thread 13
Thread 12
Thread 11
Thread 10
Thread 9
Thread 8
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
Address
120
Address
124
Address
128
Address
132
Address
136
Address
140
Address
144
Address
148
Address
152
Address
156
Address
160
Address
164
Address
168
Address
172
Address
176
Address
180
Address
184
Address
188
Address
192
Address
196
…
Address
204
Address
252
Address
256
12
8B
se
gm
en
t
32
B s
eg
me
nt
31Monday, 21 February 2011
Shared memory• as fast as register access
• if there are no bank conflicts
• 16 equally-sized modules in 1.x
• n addresses in n banks = one transaction
• n addresses in one bank = n transactions!
• bank conflict can be avoided on read-only access
• successive 32-bit words are assigned to successive banks
• each bank has a bandwidth of 32 bits per two clock cycles.
32Monday, 21 February 2011
More on bank conflicts
• if warp size is 32 and number of banks 16 a shared memory request for a warp is split into one request for the first half of the warp and one request for the second half of the warp
• so: no bank conflict between threads belonging to the different halves of the same warp!
33Monday, 21 February 2011
Typical case
__shared__ float shared[32]; float data = shared[BaseIndex + s * tid];
• tid vs. tid+n : bank conflicts?
• s*m % 16 <> s*n % 16
• for any m<>n, |m-n|<16
• s=1 (the only safe stride!)
34Monday, 21 February 2011
Beware of data size!
Chapter 5. Performance Guidelines
!
90 CUDA Programming Guide Version 2.2.1!
"#$%&%'(!)*!+$#!,--'%..%.!#*!,!/%/#'0!'%12%.+!*,33!)4!+5%!.,/%!/%/#'0!6,47(!+5%'%!).!,!6,47!8#4*3)8+!,4-!+5%!,88%..!5,.!+#!6%!.%'),3)9%-:!;5%!5,'-$,'%!.<3)+.!,!/%/#'0!'%12%.+!$)+5!6,47!8#4*3)8+.!)4+#!,.!/,40!.%<,',+%!8#4*3)8+=*'%%!'%12%.+.!,.!4%8%..,'0(!-%8'%,.)4>!+5%!%**%8+)&%!6,4-$)-+5!60!,!*,8+#'!%12,3!+#!+5%!42/6%'!#*!.%<,',+%!/%/#'0!'%12%.+.:!?*!+5%!42/6%'!#*!.%<,',+%!/%/#'0!'%12%.+.!).!!(!+5%!)4)+),3!/%/#'0!'%12%.+!).!.,)-!+#!8,2.%!!"$,0!6,47!8#4*3)8+.:!
;#!>%+!/,@)/2/!<%'*#'/,48%(!)+!).!+5%'%*#'%!)/<#'+,4+!+#!24-%'.+,4-!5#$!/%/#'0!,--'%..%.!/,<!+#!/%/#'0!6,47.!)4!#'-%'!+#!.85%-23%!+5%!/%/#'0!'%12%.+.!.#!,.!+#!/)4)/)9%!6,47!8#4*3)8+.:!
?4!+5%!8,.%!#*!+5%!.5,'%-!/%/#'0!.<,8%(!+5%!6,47.!,'%!#'>,4)9%-!.285!+5,+!.288%..)&%!AB=6)+!$#'-.!,'%!,..)>4%-!+#!.288%..)&%!6,47.!,4-!%,85!6,47!5,.!,!6,4-$)-+5!#*!AB!6)+.!<%'!+$#!83#87!8083%.:!
C#'!-%&)8%.!#*!8#/<2+%!8,<,6)3)+0!D:@(!+5%!$,'<!.)9%!).!AB!,4-!+5%!42/6%'!#*!6,47.!).!DE!F.%%!G%8+)#4!H:DIJ!,!.5,'%-!/%/#'0!'%12%.+!*#'!,!$,'<!).!.<3)+!)4+#!#4%!'%12%.+!*#'!+5%!*)'.+!5,3*!#*!+5%!$,'<!,4-!#4%!'%12%.+!*#'!+5%!.%8#4-!5,3*!#*!+5%!$,'<:!K.!,!8#4.%12%48%(!+5%'%!8,4!6%!4#!6,47!8#4*3)8+!6%+$%%4!,!+5'%,-!6%3#4>)4>!+#!+5%!*)'.+!5,3*!#*!,!$,'<!,4-!,!+5'%,-!6%3#4>)4>!+#!+5%!.%8#4-!5,3*!#*!+5%!.,/%!$,'<:!
K!8#//#4!8,.%!).!*#'!%,85!+5'%,-!+#!,88%..!,!AB=6)+!$#'-!*'#/!,4!,'',0!)4-%@%-!60!+5%!+5'%,-!?L!tid!,4-!$)+5!.#/%!.+')-%!sM!
__shared__ float shared[32];
float data = shared[BaseIndex + s * tid];
?4!+5).!8,.%(!+5%!+5'%,-.!tid!,4-!tid+n!,88%..!+5%!.,/%!6,47!$5%4%&%'!s*n!).!,!/23+)<3%!#*!+5%!42/6%'!#*!6,47.!m!#'!%12)&,3%4+30(!$5%4%&%'!n!).!,!/23+)<3%!#*!m/d!$5%'%!d!).!+5%!>'%,+%.+!8#//#4!-)&).#'!#*!m!,4-!s:!K.!,!8#4.%12%48%(!+5%'%!$)33!6%!4#!6,47!8#4*3)8+!#430!)*!5,3*!+5%!$,'<!.)9%!).!3%..!+5,4!#'!%12,3!+#!m/d:!C#'!-%&)8%.!#*!8#/<2+%!8,<,6)3)+0!D:@(!+5).!+',4.3,+%.!+#!4#!6,47!8#4*3)8+!#430!)*!d!).!%12,3!+#!D(!#'!)4!#+5%'!$#'-.(!#430!)*!s!).!#--!.)48%!m!).!,!<#$%'!#*!+$#:!
C)>2'%!H=H!,4-!C)>2'%!H=E!.5#$!.#/%!%@,/<3%.!#*!8#4*3)8+=*'%%!/%/#'0!,88%..%.!$5)3%!C)>2'%!H=N!.5#$.!.#/%!%@,/<3%.!#*!/%/#'0!,88%..%.!+5,+!8,2.%!6,47!8#4*3)8+.:!
O+5%'!8,.%.!$#'+5!/%4+)#4)4>!,'%!$5%4!%,85!+5'%,-!,88%..%.!,4!%3%/%4+!+5,+!).!./,33%'!#'!3,'>%'!+5,4!AB!6)+.!)4!.)9%:!C#'!%@,/<3%(!+5%'%!,'%!6,47!8#4*3)8+.!)*!,4!,'',0!#*!char!).!,88%..%-!+5%!*#33#$)4>!$,0M!
__shared__ char shared[32];
char data = shared[BaseIndex + tid];
6%8,2.%!shared[0](!shared[1](!shared[2](!,4-!shared[3](!*#'!%@,/<3%(!6%3#4>!+#!+5%!.,/%!6,47:!;5%'%!,'%!4#!6,47!8#4*3)8+.!5#$%&%'(!)*!+5%!.,/%!,'',0!).!,88%..%-!+5%!*#33#$)4>!$,0M!
char data = shared[BaseIndex + 4 * tid];
;5%'%!,'%!,3.#!B=$,0!6,47!8#4*3)8+.!*#'!,'',0.!#*!doubleM!
__shared__ double shared[32];
double data = shared[BaseIndex + tid];
.)48%!+5%!/%/#'0!'%12%.+!).!8#/<)3%-!)4+#!+$#!.%<,',+%!AB=6)+!'%12%.+.:!O4%!$,0!+#!,&#)-!6,47!8#4*3)8+.!)4!+5).!8,.%!).!+$#!.<3)+!+5%!double!#<%',4-.!3)7%!)4!+5%!*#33#$)4>!.,/<3%!8#-%M!
__shared__ int shared_lo[32];
Char is too small
35Monday, 21 February 2011