Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Weekly Report-Matrix multiplications

Ph.D. Student: Leo Leedate: Oct. 16, 2009

Outline

• Matrix multiplication

• Implementation

• Experiments

• Work plan

© David Kirk/NVIDIA and Wen-mei W. HwuTaiwan, June 30-July 2, 2008

Matrix Multiplication

• A: M*N

• B: N*P

• C=A*B:M*P

A

B

C

NM

N P

Matrix Multiplication• // Matrix multiplication on the (CPU) host • void MatrixMulOnHost (float* A, float* B, float* C, int hA, int wA, int wB)• { • for (int i = 0; i < hA; ++i)• {• for (int j = 0; j < wB; ++j)• {• double sum = 0;• for (int k = 0; k < wA; ++k) • {• double a = A[i * wA + k];• double b = B[k * wB + j];• sum += a * b;• }• P[i * wB + j] = sum;• }• }• }

Implementation_1

• One thread calculates one element of C– dim3 grid(1, 1);– dim3 thread(WC, HC);– __global__ void matrixMul_low( float* C, float* A, float* B, int wA,

int wB)– {– int tx = threadIdx.x;– int ty = threadIdx.y;– float Csub = 0;– for(int k=0; k<wA; ++k)– {– Csub += A[ty*wA+k] * B[k*wB+tx]; – }– C[ty*wB+tx] = Csub;– }

Experiments_1

10000 times

Experiments_1

Brief analysis• Less efficient than CPU;

• Data transfer occupies most of the time, each thread– Loads a row of matrix A– Loads a column of matrix B– Perform one multiply and addition for each pair of A and B elements– Compute to off-chip memory access ratio close to 1:1 (not very high)

• Size of matrix limited by the number of threads allowed in a thread block– 1*2*2 is not ok?

• Try to increase the Compute to off-chip memory access ratio !

Ad

Bd

Cd

Pdsub

TILE_WIDTH

WIDTHWIDTH

TILE_WIDTHTILE_WIDTH

bx

tx01 TILE_WIDTH-12

0 1 2

by ty 210

TILE_WIDTH-1

2

1

0

TILE_WIDTH

TILE_WIDTH

TILE_WIDTHE

WIDTH

WIDTH

Implementation_2

• Tiled Multiply– Each block computes one square sub-matrix

Pdsub of size TILE_BLOCK_SIZE

– Each thread computes one element of Csub

– Assume that the dimensions of A and B are multiples of TILE_BLOCK_SIZE

Implementation_2• dim3 thread(BLOCK_SIZE, BLOCK_SIZE);• dim3 grid(WC/thread.x, HC/thread.y);• In kernel function

– __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];– __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];– //Load the matrices from device memory to shared memory– AS(ty, tx) = A[a + ty*wA + tx];– BS(ty, tx) = B[b + ty*wB + tx];– //Synchronize to make sure the matrices are loaded– __syncthreads();– for(int k=0; k<BLOCK_SIZE; ++k)– {– Csub += AS(ty,k)*BS(k,tx);– }– __syncthreads();

– int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;– C[c + wB *ty +tx] = Csub;

Experiments_2

• Improvement by tile

Experiments_2

10000 times

Experiments_2

• Thanks for your listening

Experiments_2

Experiments_2

• Improvement by GPU compared with CPU

Experiments_2

Experiments_2

Experiments_2

Experiments_2

Experiments_2

Experiments_2WA, HA, WB

GPU

CPU

Comput time (ms) total time (ms)

16,16,16

GPU

CPU

45

15

24678

78

32,32,32

GPU

CPU

60

62

27250

203

48,80,128

GPU

CPU

225

861

26625

1203

128,256,512

GPU

CPU

4249

45829

35531

49328

512,512,512

GPU

CPU

27441

364232

70359

382062

Brief analysis

• Using shared memory to increase Compute to off-chip memory access ratio– 256 access, (16+16)*16*16 computations.

• Data transfer still occupies much time– Coalesced accesses

Implementation_3

• Transpose matrix B– Then read B is the same as read A;– C[i, j] = ∑ A[i, k]*B[j, k];

Experiments_3

Coalesced accesses Implementation_2

Brief analysis

• No big change– Review the code– Try a new method~

Work plan

• Further experiments on Matrix Multiplication

• Learn Reduction

Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Documents

int wb

int wa

int ha

int tx

j wb j

double b

wb j sum

n p slide