Top Banner
A FRAMEWORK FOR PRACTICAL FAST MATRIX MULTIPLICATION Austin Benson ([email protected]), ICME, Stanford Grey Ballard, Sandia National Laboratories Sandia National Laboratories, October 21, 2014 1 arXiv: 1409.2908
33

Sandia Fast Matmul

Jun 27, 2015

Download

Engineering

Austin Benson

Slides from my talk at Sandia.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Sandia Fast Matmul

1

A FRAMEWORK FOR PRACTICAL FAST MATRIX MULTIPLICATION

Austin Benson ([email protected]), ICME, Stanford

Grey Ballard, Sandia National Laboratories

Sandia National Laboratories, October 21, 2014

arXiv: 1409.2908

Page 2: Sandia Fast Matmul

2

Fast matrix multiplication:bridging theory and practice

• There are a number of Strassen-like algorithms for matrix multiplication that have only been “discovered” recently. [Smirnov13], [Benson&Ballard14]

• We show that they can achieve higher performance with respect to Intel’s Math Kernel Library (MKL)

• We use code generation for extensive prototyping.

32 2.81[Strassen79]

2.37[Williams12]

xxx xx xx xx

Page 3: Sandia Fast Matmul

3

Strassen’s algorithm

Page 4: Sandia Fast Matmul

4

Key ingredients of Strassen’s algorithm

• 1. Block partitioning of matrices (<2, 2, 2>)• 2. Seven linear combinations of sub-blocks of A• 3. Seven linear combinations of sub-blocks of B• 4. Seven matrix multiplies to form Mr (recursive)

• 5. Linear combinations of Mr to form Cij

Page 5: Sandia Fast Matmul

5

Key ingredients of fast matmul algorithms

• 1. Block partitioning of matrices (<M, K, N>)• 2. R linear combinations of sub-blocks of A

• 3. R linear combinations of sub-blocks of B• 4. R matrix multiplies to form Mr (recursive)

R < MKN faster than classical

• 5. Linear combinations of Mr to form Cij

Main idea: Trade N^3 computations (matmul) for more N^2 computations (matrix addition)

Page 6: Sandia Fast Matmul

6

“Outer product” fast algorithm

• <4, 2, 4> partitioning• R = 26 multiplies (< 4 * 2 * 4 = 32)

23% speedup per recursive step (if everything else free)• Linear combinations of Aij to form Sr: 68 terms

• Linear combinations of Bij to form Tr: 52 terms

• Linear combinations of Mr to form Cij: 69 terms

Page 7: Sandia Fast Matmul

7

Discovering fast algorithms is anumerical challenge

• Low-rank tensor decompositions lead to fast algorithms• Tensors are small, but we need exact decompositions

NP-hard• Use alternating least squares with regularization and

rounding tricks [Smirnov13], [Benson&Ballard14]

• We have around 10 fast algorithms for <M, K, N> partitions. Also have permutations, e.g., <K, M, N>.

Page 8: Sandia Fast Matmul

8

[Smirnov13]

[Strassen69]

Page 9: Sandia Fast Matmul

9

Code generation lets us prototype algorithms quickly

• We have compact representation of many fast algorithms: 1. dimensions of block partitioning (<M, K, N>) 2. linear combinations of sub-blocks (Sr, Tr) 3. linear combinations of Mr to form Cij

• We use code generation to rapidly prototype fast algorithms

• Our approach: test all algorithms on a bunch of different problem sizes and look for patterns

Page 10: Sandia Fast Matmul

10

Sequential performance =

Effective GFLOPS for M x P x Q multiplies = 1e-9 * 2 * MPQ / time in seconds

True peak

Page 11: Sandia Fast Matmul

11

Sequential performance

• All algorithms beat MKL on large problems• Strassen’s algorithm is hard to beat

=

Page 12: Sandia Fast Matmul

12

Sequential performance =

• Almost all algorithms beat MKL• <4, 2, 4> and <3, 2, 3> tend to perform the best

Page 13: Sandia Fast Matmul

13

Sequential performance =

• Almost all algorithms beat MKL• <4, 3, 3> and <4, 2, 3> tend to perform the best

Page 14: Sandia Fast Matmul

14

Practical issues, or what is not obvious when

…you just think about the theory

…your performance models are too simple

Page 15: Sandia Fast Matmul

15

Practical issue #1: when to stop recursion?Look at the gemm curve.

Basic idea: take another recursive step if the sub-problems will still operate at high performance

<M, K, N> = <4, 2, 3>

Page 16: Sandia Fast Matmul

16

Practical issue #2: matrix additions

A11 A12 A21 A22

S1 S2S7S6S5S4S3

“Pairwise”

2xDAXPY

2xDAXPY

Y := AX + Y

Page 17: Sandia Fast Matmul

17

A11 A12 A21 A22

S1 S2S7S6S5S4S3

“Write once”

custom“DAXPY”

custom“DAXPY”

Practical issue #2: matrix additions

Page 18: Sandia Fast Matmul

18

A11 A12 A21 A22

S1 S2S7S6S5S4S3

“Streaming”

Entry-wise updates

Practical issue #2: matrix additions

Page 19: Sandia Fast Matmul

19

Practical issue #3: redundant linear combinations

• Example in <4, 2, 4> algorithm (R = 26 multiplies):

B24B12 B22 B23

T11 T25

Four additions, six reads, two writes

Page 20: Sandia Fast Matmul

20

Common subexpression elimination

• Example in <4, 2, 4> algorithm (R = 26 multiplies):

B24B12 B22 B23

T11 T25

Y

Three additions, six reads, three writes Net increase in communication!

Page 21: Sandia Fast Matmul

21

Common subexpression elimination (CSE)does not really help

Page 22: Sandia Fast Matmul

22

Practical issue #4: parallelization on shared memory

Page 23: Sandia Fast Matmul

23

Fast matmul recursion tree

C

M1 M7

+

M2 …

M1 M7

+

M2 …

M1 M7

+

M2 …

Page 24: Sandia Fast Matmul

24

DFS ParallelizationC

M1 M7

+

M2 …

M1 M7

+

M2 …

All threads

Use parallel MKL

+ Easy to implement+ Load balanced+ Same memory footprint as sequential- Need large base cases for high performance

Page 25: Sandia Fast Matmul

25

BFS ParallelizationC

M1 M7

+

M2 …

M1 M7

+

M2 …

omp taskwait

omp taskwait

1 thread

+ High performance for smaller base cases- Sometimes harder to load balance: 24 threads, 49 subproblems- More memory

1 thread 1 thread

Page 26: Sandia Fast Matmul

26

HYBRID parallelization

C

M1 M7

+

M2 …

M1 M7

+

M2 …

omp taskwait

omp taskwait

1 thread 1 thread all threads

+ Better load balancing- Explicit synchronization or else we can over-subscribe threads

Page 27: Sandia Fast Matmul

27

Page 28: Sandia Fast Matmul

28

Bandwidth problems• We rely on the cost of matrix multiplications to be much

more expensive than the cost of matrix additions• Parallel dgemm on 24 cores: easily get 50-75% of peak• STREAM benchmark: < 6x speedup in read/write

performance on 24 cores

C

M1 M7

+

M2 …

Page 29: Sandia Fast Matmul

29

Parallel performance =

• 6 cores: similar performance to sequential• 24 cores: can sometimes beat MKL, but barely

Page 30: Sandia Fast Matmul

30

Parallel performance =Bad MKLperformance

• 6 cores: similar performance to sequential• 24 cores: MKL best for large problems

Page 31: Sandia Fast Matmul

31

Parallel performance =

• 6 cores: similar performance to sequential• 24 cores: MKL usually the best

Page 32: Sandia Fast Matmul

32

High-level conclusions• For square matrix multiplication, Strassen’s algorithm is

hard to beat• For rectangular matrix multiplication, use a fast algorithm

that “matches the shape”• Bandwidth limits the performance of shared memory

parallel fast matrix multiplication should be less of an issue in distributed memory

Future work:• Numerical stability• Using fast matmul as a kernel for other algorithms in

numerical linear algebra

Page 33: Sandia Fast Matmul

33

A FRAMEWORK FOR PRACTICAL FAST MATRIX MULTIPLICATION

Austin Benson ([email protected]), ICME, Stanford

Grey Ballard, Sandia National Laboratories

Sandia National Laboratories, October 21, 2014

arXiv: 1409.2908