Top Banner
Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009
26

Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Weekly Report-Matrix multiplications

Ph.D. Student: Leo Leedate: Oct. 16, 2009

Page 2: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Outline

• Matrix multiplication

• Implementation

• Experiments

• Work plan

Page 3: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

© David Kirk/NVIDIA and Wen-mei W. HwuTaiwan, June 30-July 2, 2008

Matrix Multiplication

• A: M*N

• B: N*P

• C=A*B:M*P

A

B

C

NM

N P

Page 4: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Matrix Multiplication• // Matrix multiplication on the (CPU) host • void MatrixMulOnHost (float* A, float* B, float* C, int hA, int wA, int wB)• { • for (int i = 0; i < hA; ++i)• {• for (int j = 0; j < wB; ++j)• {• double sum = 0;• for (int k = 0; k < wA; ++k) • {• double a = A[i * wA + k];• double b = B[k * wB + j];• sum += a * b;• }• P[i * wB + j] = sum;• }• }• }

Page 5: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Implementation_1

• One thread calculates one element of C– dim3 grid(1, 1);– dim3 thread(WC, HC);– __global__ void matrixMul_low( float* C, float* A, float* B, int wA,

int wB)– {– int tx = threadIdx.x;– int ty = threadIdx.y;– float Csub = 0;– for(int k=0; k<wA; ++k)– {– Csub += A[ty*wA+k] * B[k*wB+tx]; – }– C[ty*wB+tx] = Csub;– }

Page 6: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Experiments_1

10000 times

Page 7: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Experiments_1

Page 8: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Brief analysis• Less efficient than CPU;

• Data transfer occupies most of the time, each thread– Loads a row of matrix A– Loads a column of matrix B– Perform one multiply and addition for each pair of A and B elements– Compute to off-chip memory access ratio close to 1:1 (not very high)

• Size of matrix limited by the number of threads allowed in a thread block– 1*2*2 is not ok?

• Try to increase the Compute to off-chip memory access ratio !

Page 9: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Ad

Bd

Cd

Pdsub

TILE_WIDTH

WIDTHWIDTH

TILE_WIDTHTILE_WIDTH

bx

tx01 TILE_WIDTH-12

0 1 2

by ty 210

TILE_WIDTH-1

2

1

0

TILE_WIDTH

TILE_WIDTH

TILE_WIDTHE

WIDTH

WIDTH

Implementation_2

• Tiled Multiply– Each block computes one square sub-matrix

Pdsub of size TILE_BLOCK_SIZE

– Each thread computes one element of Csub

– Assume that the dimensions of A and B are multiples of TILE_BLOCK_SIZE

Page 10: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Implementation_2• dim3 thread(BLOCK_SIZE, BLOCK_SIZE);• dim3 grid(WC/thread.x, HC/thread.y);• In kernel function

– __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];– __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];– //Load the matrices from device memory to shared memory– AS(ty, tx) = A[a + ty*wA + tx];– BS(ty, tx) = B[b + ty*wB + tx];– //Synchronize to make sure the matrices are loaded– __syncthreads();– for(int k=0; k<BLOCK_SIZE; ++k)– {– Csub += AS(ty,k)*BS(k,tx);– }– __syncthreads();

– int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;– C[c + wB *ty +tx] = Csub;

Page 11: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Experiments_2

• Improvement by tile

Page 12: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Experiments_2

10000 times

Page 13: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Experiments_2

• Thanks for your listening

Page 14: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Experiments_2

Page 15: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Experiments_2

• Improvement by GPU compared with CPU

Page 16: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Experiments_2

Page 17: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Experiments_2

Page 18: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Experiments_2

Page 19: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Experiments_2

Page 20: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Experiments_2

Page 21: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Experiments_2WA, HA, WB

GPU

CPU

Comput time (ms) total time (ms)

16,16,16

GPU

CPU

45

15

24678

78

32,32,32

GPU

CPU

60

62

27250

203

48,80,128

GPU

CPU

225

861

26625

1203

128,256,512

GPU

CPU

4249

45829

35531

49328

512,512,512

GPU

CPU

27441

364232

70359

382062

Page 22: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Brief analysis

• Using shared memory to increase Compute to off-chip memory access ratio– 256 access, (16+16)*16*16 computations.

• Data transfer still occupies much time– Coalesced accesses

Page 23: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Implementation_3

• Transpose matrix B– Then read B is the same as read A;– C[i, j] = ∑ A[i, k]*B[j, k];

Page 24: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Experiments_3

Coalesced accesses Implementation_2

Page 25: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Brief analysis

• No big change– Review the code– Try a new method~

Page 26: Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Work plan

• Further experiments on Matrix Multiplication

• Learn Reduction