Top Banner
0 5000 10000 15000 0 5 10 15 20 25 Dimension (N) Effective GFLOPS / core Parallel performance of Strassen on <N,N,N> MKL, 6 cores MKL, 24 cores DFS, 6 cores BFS, 6 cores HYBRID, 6 cores DFS, 24 cores BFS, 24 cores HYBRID, 24 cores A FRAMEWORK FOR PRACTICAL FAST MATRIX MULTIPLICATION Austin Benson ([email protected]), ICME, Stanford Grey Ballard, Sandia National Laboratories BLIS Retreat, September 26, 2014 arXiv: 1409.2908 1
32

A framework for practical fast matrix multiplication (BLIS retreat)

Jul 07, 2015

Download

Software

Austin Benson

Slides from my talk at the BLIS retreat in Austin, TX.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A framework for practical fast matrix multiplication (BLIS retreat)

0 5000 10000 150000

5

10

15

20

25

Dimension (N)

Eff

ective

GF

LO

PS

/ c

ore

Parallel performance of Strassen on <N,N,N>

MKL, 6 cores

MKL, 24 cores

DFS, 6 cores

BFS, 6 cores

HYBRID, 6 cores

DFS, 24 cores

BFS, 24 cores

HYBRID, 24 cores

A FRAMEWORK FOR

PRACTICAL FAST MATRIX

MULTIPLICATION

Austin Benson ([email protected]), ICME, Stanford

Grey Ballard, Sandia National Laboratories

BLIS Retreat, September 26, 2014

arXiv: 1409.2908

1

Page 2: A framework for practical fast matrix multiplication (BLIS retreat)

Fast matrix multiplication:

bridging theory and practice

• There are a number of Strassen-like algorithms for matrix

multiplication that have only been “discovered” recently.

[Smirnov13], [Benson&Ballard14]

• We show that they can achieve higher performance with

respect to MKL (sequential and sometimes in parallel).

• We use code generation to do extensive prototyping. There

are several practical issues, and there is plenty of room for

improvement (lots of expertise at UT to help here!)

2

32 2.81[Strassen79]

2.37[Williams12]

xxx xx xx xx

Page 3: A framework for practical fast matrix multiplication (BLIS retreat)

Strassen’s algorithm

3

Page 4: A framework for practical fast matrix multiplication (BLIS retreat)

Key ingredients of Strassen’s algorithm

• 1. Block partitioning of matrices (<2, 2, 2>)

• 2. Seven linear combinations of sub-blocks of A

• 3. Seven linear combinations of sub-blocks of B

• 4. Seven matrix multiplies to form Mr (recursive)

• 5. Linear combinations of Mr to form Cij

4

Page 5: A framework for practical fast matrix multiplication (BLIS retreat)

Key ingredients of fast matmul algorithms

• 1. Block partitioning of matrices (<M, K, N>)

• 2. R linear combinations of sub-blocks of A

• 3. R linear combinations of sub-blocks of B

• 4. R matrix multiplies to form Mr (recursive)

R < MKN faster than classical

• 5. Linear combinations of Mr to form Cij

5

Page 6: A framework for practical fast matrix multiplication (BLIS retreat)

“Outer product” fast algorithm

• <4, 2, 4> partitioning

• R = 26 multiplies (< 4 * 2 * 4 = 32)

23% speedup per recursive step (if everything else free)

• Linear combinations of Aij to form Sr: 68 terms

• Linear combinations of Bij to form Tr: 52 terms

• Linear combinations of Mr to form Cij: 69 terms

6

Page 7: A framework for practical fast matrix multiplication (BLIS retreat)

Discovering fast algorithms is a

numerical challenge

• Low-rank tensor decompositions lead to fast algorithms

• Tensors are small, but we need exact decompositions

NP-hard

• Use alternating least squares with regularization and

rounding tricks [Smirnov13], [Benson&Ballard14]

• We have around 10 fast algorithms for <M, K, N>

decompositions. Also have permutations, e.g., <K, M, N>.

7

Page 8: A framework for practical fast matrix multiplication (BLIS retreat)

8

[Smirnov13]

[Strassen69]

Page 9: A framework for practical fast matrix multiplication (BLIS retreat)

Code generation lets us prototype

algorithms quickly

• We have compact representation of many fast algorithms:

1. dimensions of block partitioning (<M, K, N>)

2. linear combinations of sub-blocks (Sr, Tr)

3. linear combinations of Mr to form Cij

• We use code generation to rapidly prototype fast algorithms

• Our approach: test all algorithms on a bunch of different

problem sizes and look for patterns

9

Page 10: A framework for practical fast matrix multiplication (BLIS retreat)

Practical issues

• Best way to do matrix additions? (in paper)

• Can we eliminate redundant linear combinations? (in paper)

• Different problem shapes other than square (this talk)

• When to stop recursion? (this talk)

• How to parallelize? (this talk)

10

=

Page 11: A framework for practical fast matrix multiplication (BLIS retreat)

0 1000 2000 300010

15

20

25

Dimension (N)

GF

LO

PS

Sequential dgemm performance

N x 800 x 800

N x 800 x N

N x N x N

peak

0 2000 4000 6000 800010

15

20

25

Dimension (N)G

FLO

PS

/ c

ore

Parallel dgemm performance (24 cores)

Recursion cutoff: look at gemm curve

Basic idea: take another

recursive step if the sub-

problems will still operate at

high performance

11

<M, K, N> = <4, 2, 3>

Page 12: A framework for practical fast matrix multiplication (BLIS retreat)

Sequential performance

0 2000 4000 6000 800016

18

20

22

24

26

28

Dimension (N)

Effe

ctive

GF

LO

PS

Sequential performance on N x N x N

MKL

STRASSEN

<3,2,2>

<3,2,4>

<4,2,3>

<3,4,2>

<3,3,3>

<4,2,4>

<2,3,4>

=

12

Effective GFLOPS for M x K x N multiplies

= 1e-9 * 2 * MKN / time in seconds

True peak

Page 13: A framework for practical fast matrix multiplication (BLIS retreat)

Sequential performance

0 2000 4000 6000 800016

18

20

22

24

26

28

Dimension (N)

Effe

ctive

GF

LO

PS

Sequential performance on N x N x N

MKL

STRASSEN

<4,4,2>

<4,3,3>

<3,4,3>

<3,3,6>

<3,6,3>

<6,3,3>

• All algorithms beat MKL on large problems

• Strassen’s algorithm is hard to beat

=

13

Page 14: A framework for practical fast matrix multiplication (BLIS retreat)

2000 4000 6000 8000 10000 1200022

23

24

25

26

27

dimension (N)

Eff

ective

GF

LO

PS

Sequential performance on N x 1600 x N

MKL

<4,2,4>

<4,3,3>

<3,2,3>

<4,2,3>

STRASSEN

Sequential performance =

• Almost all algorithms beat MKL

• <4, 2, 4> and <3, 2, 3> tend to perform the best

14

Page 15: A framework for practical fast matrix multiplication (BLIS retreat)

Sequential performance =

• Almost all algorithms beat MKL

• <4, 3, 3> and <4, 2, 3> tend to perform the best

15

10000 12000 14000 16000 1800022

23

24

25

26

dimension (N)

Eff

ective

GF

LO

PS

Sequential performance on N x 2400 x 2400

MKL

<4,2,4>

<4,3,3>

<3,2,3>

<4,2,3>

STRASSEN

Page 16: A framework for practical fast matrix multiplication (BLIS retreat)

Parallelization

C

M1 M7

+

M2…

M1 M7

+

M2…

M1 M7

+

M2…

16

Page 17: A framework for practical fast matrix multiplication (BLIS retreat)

DFS Parallelization

C

M1 M7

+

M2…

M1 M7

+

M2…

All threads

Use parallel MKL

+ Easy to implement

+ Load balanced

+ Same memory

footprint as sequential

- Need large base

cases for high

performance

17

Page 18: A framework for practical fast matrix multiplication (BLIS retreat)

BFS Parallelization

C

M1 M7

+

M2…

M1 M7

+

M2 …

omp taskwait

omp taskwait

1 thread

+ High performance for smaller base cases

- Sometimes harder to load balance: 24 threads, 49 subproblems

- More memory

1 thread 1 thread

18

Page 19: A framework for practical fast matrix multiplication (BLIS retreat)

HYBRID parallelization

C

M1 M7

+

M2…

M1 M7

+

M2…

omp taskwait

omp taskwait

1 thread 1 thread all threads

+ Better load balancing

- Explicit synchronization or else we can over-subscribe threads

19

Page 20: A framework for practical fast matrix multiplication (BLIS retreat)

20

0 5000 10000 15000 20000 0 5000 10000 150000

5

10

15

20

25

Dimension (N)

Effe

ctive

GF

LO

PS

/ c

ore

Parallel performance of <4,2,4> on <N,2800,N>

MKL, 6 cores

MKL, 24 cores

DFS, 6 cores

BFS, 6 cores

HYBRID, 6 cores

DFS, 24 cores

BFS, 24 cores

HYBRID, 24 cores

Page 21: A framework for practical fast matrix multiplication (BLIS retreat)

Bandwidth problems• We rely on the cost of matrix multiplications to be much

more expensive than the cost of matrix additions

• Parallel dgemm on 24 cores: easily get 50-75% of peak

• STREAM benchmark: < 6x speedup in read/write

performance on 24 cores

C

M1 M7

+

M2…

21

Page 22: A framework for practical fast matrix multiplication (BLIS retreat)

Parallel performance =

• 6 cores: similar performance to sequential

• 24 cores: can sometimes beat MKL, but barely

22

9000 10000 11000 12000 1300018

20

22

24

26

28

Dimension (N)

Effe

ctive G

FL

OP

S /

core

Performance (6 cores) on N x N x N

MKL

STRASSEN

<3,2,2>

<3,2,4>

<4,2,3>

<3,4,2>

<3,3,3>

<4,2,4>

<2,3,4> 9000 10000 11000 12000 1300016

18

20

22

Dimension (N)

Effe

ctive G

FL

OP

S /

core

Performance (24 cores) on N x N x N

MKL

STRASSEN

<3,2,2>

<3,2,4>

<4,2,3>

<3,4,2>

<3,3,3>

<4,2,4>

<2,3,4>

Page 23: A framework for practical fast matrix multiplication (BLIS retreat)

10000 15000 20000 10000 1500018

19

20

21

22

23

24

dimension (N)

Eff

ective

GF

LO

PS

/ c

ore

Performance (6 cores) on N x 2800 x N

MKL

<4,2,4>

<4,3,3>

<3,2,3>

<4,2,3>

STRASSEN

5000 10000 15000 2000012

14

16

18

20

dimension (N)

Eff

ective

GF

LO

PS

/ c

ore

Performance (24 cores) on N x 2800 x N

MKL

<4,2,4>

<4,3,3>

<3,2,3>

<4,2,3>

STRASSEN

Parallel performance =Bad MKL

performance

• 6 cores: similar performance to sequential

• 24 cores: MKL best for large problems

23

Page 24: A framework for practical fast matrix multiplication (BLIS retreat)

Parallel performance =

• 6 cores: similar performance to sequential

• 24 cores: MKL usually the best

24

10000 15000 20000 10000 1500018

19

20

21

22

23

dimension (N)

Eff

ective

GF

LO

PS

/ c

ore

Performance (6 cores) on N x 3000 x 3000

MKL

<4,2,4>

<4,3,3>

<3,2,3>

<4,2,3>

STRASSEN

16000 18000 20000 22000 24000 2600012

13

14

15

16

17

18

dimension (N)

Eff

ective

GF

LO

PS

/ c

ore

Performance (24 cores) on N x 3000 x 3000

MKL

<4,2,4>

<4,3,3>

<3,2,3>

<4,2,3>

STRASSEN

Page 25: A framework for practical fast matrix multiplication (BLIS retreat)

High-level conclusions

• For square matrix multiplication, Strassen’s algorithm is

hard to beat

• For rectangular matrix multiplication, use a fast algorithm

that “matches the shape”

• Bandwidth limits the performance of shared memory

parallel fast matrix multiplication

should be less of an issue in distributed memory

Future work:

• Numerical stability

• Using fast matmul as a kernel for other algorithms in

numerical linear algebra

25

Page 26: A framework for practical fast matrix multiplication (BLIS retreat)

0 5000 10000 150000

5

10

15

20

25

Dimension (N)

Eff

ective

GF

LO

PS

/ c

ore

Parallel performance of Strassen on <N,N,N>

MKL, 6 cores

MKL, 24 cores

DFS, 6 cores

BFS, 6 cores

HYBRID, 6 cores

DFS, 24 cores

BFS, 24 cores

HYBRID, 24 cores

A FRAMEWORK FOR

PRACTICAL FAST MATRIX

MULTIPLICATION

Austin Benson ([email protected]), ICME, Stanford

Grey Ballard, Sandia National Laboratories

BLIS Retreat, September 26, 2014

arXiv: 1409.2908

26

Page 27: A framework for practical fast matrix multiplication (BLIS retreat)

Matrix additions (linear combinations)

A11 A12 A21 A22

S1 S2S7S6S5S4S3

“Pairwise”

2x

DAXPY

2x

DAXPY

27

Page 28: A framework for practical fast matrix multiplication (BLIS retreat)

Matrix additions (linear combinations)

A11 A12 A21 A22

S1 S2S7S6S5S4S3

“Write once”

custom

“DAXPY”

custom

“DAXPY”

28

Page 29: A framework for practical fast matrix multiplication (BLIS retreat)

Matrix additions (linear combinations)

A11 A12 A21 A22

S1 S2S7S6S5S4S3

“Streaming”

Entry-wise

updates

29

Page 30: A framework for practical fast matrix multiplication (BLIS retreat)

Common subexpression elimination (CSE)

• Example in <4, 2, 4> algorithm (R = 26 multiples):

B24B12 B22 B23

T11 T25

Four additions, six reads, two writes

30

Page 31: A framework for practical fast matrix multiplication (BLIS retreat)

Common subexpression elimination (CSE)

• Example in <4, 2, 4> algorithm (R = 26 multiples):

B24B12 B22 B23

T11 T25

Y

Three additions, six reads, three writes

Net increase in communication!

31

Page 32: A framework for practical fast matrix multiplication (BLIS retreat)

CSE does not really help

Effective GFLOPS for M x K x N multiplies

= 1e-9 * 2 * MKN / time in seconds

32