NUMA-aware Matrix-Matrix-Multiplication...About this talk •Objective: Show how to improve performance of algorithms in a NUMA-system with MMM as an example •Code was written in

NUMA-aware Matrix-Matrix-Multiplication

Max Reimann, Philipp Otto

1

About this talk

• Objective: Show how to improve performance of algorithms in a NUMA-system with MMM as an example

• Code was written in C with numa.h, pthread.h • Tested on FSOC

– ubuntu-0101 • 2 Nodes, 24 Cores

– dl980 • 8 Nodes, 128 Cores

• Compiled with gcc – –O3

2

Naïve Matrix-Matrix-Multiplication

• We will examine MMM for large n x n matrices

• 𝒪 𝑛3

3 http://www.mathematrix.de/wp-content/uploads/matrixmul2.png

Naïve MMM implementation

4

Performance of Naive vs. MKL

5

0,38

11,79

98,14

0,02

0,13

1,02

0,015625

0,03125

0,0625

0,125

0,25

0,5

1

2

4

8

16

32

64

128

512 1024 2048

Naive

MKL

dl980 on one core

Intel Math Kernel Library (MKL)

• BLAS: Basic Linear Algebra Subprograms

– Standard for Linear Algebra

• MKL:

– Implements BLAS for Intel hardware

– Vectorized and threaded for highest performance

6

Analysis of Naïve MMM

• Testsetup • Use ubuntu-numa machine

• No thread or memory pinning

• Use numatop/pcm

• Performance tools show:

– Unused cores (obvious)

– QPI cannot be fully loaded with one thread

8

Parallelization I

• How can the work be divided?

– 1. Partition computation of matrixC by rows or columns

• Problem: All threads need matrixA and matrixB

• Solution: – Accept overhead for remote memory access or

– Copy input/output matrices to the other nodes (preprocessing)

9

* =

Parallelization – Partition by rows

10

Parallelization – Partition by rows

11

0,38

11,79

98,14

0,05

0,26

2,54

0,19

0,27 0,28

0,03125

0,0625

0,125

0,25

0,5

1

2

4

8

16

32

64

128

512 1024 2048

Naive Sequential

Naive Parallel

MKL Parallel

dl980 on 128 cores

Parallelization II

• How can the work be divided?

– 2. Partition computation of matrixC by summands

• Benefit: – for computing the i-th summand, only the i-th row of matrixA

/ column of matrixB is needed

– This allows to only copy the needed parts to the other nodes

• Disadvantage: – matrixB has to be transposed to be able to partition the

memory (preprocessing)

– locking or merging of matrixC is needed

12

Parallelization II

13

Performance of „Parallel Sum“ Method

14

1,59

2,81 3,34

14,91

218,84

0,27

1,41

2,94

17,24

186,39

0,19

0,27 0,28

0,43

2,41

0,13

0,25

0,50

1,00

2,00

4,00

8,00

16,00

32,00

64,00

128,00

256,00

512 1024 2048 4096 8192

Parallel sum

Naive Parallel

MKL Parallel

dl980 on 128 cores

Strassen

• Runtime Complexity: – Naive algorithm 𝒪 𝑛3

• Can we get better? – Strassens algorithm, published 1969, was the first

to improve asymptotic complexity

– Runtime 𝒪 𝑛log2 7 ≈ 𝒪 𝑛2.8 • Algorithms today can get O(𝑛2.35), but are not pratical

– Uses only 7 multiplications instead of 8 per recursion step

15

Matrix definition

16

A = 𝐴1,1 𝐴1,2𝐴2,1 𝐴2,2

, 𝐵 = 𝐵1,1 𝐵1,2𝐵2,1 𝐵2,2

, 𝐶 = 𝐶1,1 𝐶1,2𝐶2,1 𝐶2,2

For matrices A,B,C with dimension 𝑛 = 4𝑘, 𝑘 ∈ ℕ A,B,C can be viewed as 2x2 block matrices:

𝐶1,1 = 𝐴1,1 ∙ 𝐵1,1 + 𝐴1,2 ∙ 𝐵2,1 𝐶1,2 = 𝐴1,1 ∙ 𝐵1,2 + 𝐴1,2 ∙ 𝐵2,2

𝐶2,1 = 𝐴2,1 ∙ 𝐵1,1 + 𝐴2,2 ∙ 𝐵2,1 𝐶2,2 = 𝐴2,1 ∙ 𝐵1,2 + 𝐴2,2 ∙ 𝐵2,2

Conventional Algorithm uses 8 (expensive) multiplications:

Strassen’s algorithm

17

𝑀1 ∶= 𝐴1,1 + 𝐴2,2 ∙ 𝐵1,1 + 𝐵2,2

𝑀2 ∶= 𝐴2,1 + 𝐴2,2 ∙ 𝐵1,1

𝑀3 ∶= 𝐴1,1 ∙ 𝐵1,2 − 𝐵2,2

𝑀4 ∶= 𝐴2,2 ∙ 𝐵2,1 − 𝐵1,1

𝑀5 ∶= 𝐴1,1 + 𝐴1,2 ∙ 𝐵2,2

𝑀6 ∶= 𝐴2,1 − 𝐴1,1 ∙ 𝐵1,1 + 𝐵1,2 𝑀7 ∶= (𝐴1,2 − 𝐴2,2) ∙ (B2,1 + 𝐵2,2

Define temporary matrices:

𝐶1,1 = 𝑀1 + 𝑀4 − 𝑀5 + 𝑀7 𝐶1,2 = 𝑀3 + 𝑀5 𝐶2,1 = 𝑀2 + 𝑀4 𝐶2,2 = 𝑀1 − 𝑀2 + 𝑀3 + 𝑀6

Compose final matrix

Only 7 multiplications!

Strassen - Example

18

𝐶1,2 = 𝑀3 + 𝑀5

= 𝐴1,1𝐵1,2 + 𝐴1,2𝐵2,2

= 𝐴1,1𝐵1,2 − 𝐴1,1𝐵2,2 + 𝐴1,1𝐵2,2 + 𝐴1,2𝐵2,2

= 𝐴1,1 ∙ 𝐵1,2 − 𝐵2,2 + 𝐴1,1 + 𝐴1,2 ∙ 𝐵2,2

𝐴1,1 𝐴1,2𝐴2,1 𝐴2,2

∙𝐵1,1 𝐵1,2𝐵2,1 𝐵2,2

= 𝐶1,1 𝐶1,2𝐶2,1 𝐶2,2

Substituting the 𝑀𝑖𝑠 by their term gives back the original formula:

Strassen - Analysis

• Cost: 7 multiplications and 18 additions

– 8 multiplications and 4 additions for naïve

• Only practical for large matrices n > 1000

– Although our results indicate otherwise (later)

• Define cutoff point for recursion

– If n is sufficiently small, do naïve multiplication

19

Strassen - Implementation

20

Execution Time: Single-threaded

0,00

0,00

0,01

0,05

0,38

11,79

98,14

0,00

0,00

0,00

0,02

0,12

0,87

6,12

0,00 0,00

0,00

0,00

0,02

0,13

1,02

0,00010,00010,00020,00050,00100,00200,00390,00780,01560,03130,06250,12500,25000,50001,00002,00004,00008,0000

16,000032,000064,0000

128,0000

32 64 128 256 512 1024 2048

Seco

nd

s

N-dimension

Naive Strassen MKL

21

strassen: BREAK = 64

dl980 on 1 core

Parallelization of Strassen I

• Data dependencies:

– Have to do additions in 𝑀𝑖 before multiplication

• e.g. M1 = 𝐴1,1 + 𝐴2,2 ∙ 𝐵1,1 + 𝐵2,2

– Have to calculate 𝑀𝑖 before calculating C

• 𝐶1,2 = 𝑀3 + 𝑀5

• Easiest solution:

– Calculate in 𝑀𝑖s in parallel

– Then calculate 𝐶𝑖,𝑗 in parallel

22

Parallelization of Strassen II

• Level 1 can be scheduled to 7 threads • Level n can be scheduled to 7𝑛 threads

– Most systems have number of processors on base 2

• We used manual parallelization – 49 distinct functions for Ms and 16 for Cs – Code bloating and not scalable, BUT:

• Automatic parallelization is hard – Thread load becomes very unbalanced – Every level needs 7 temporary matrices

• Exponential rising memory requirements

23

Execution Time – 49 Threads

0,05

0,26

2,54

27,61

228,57

0,05

0,14

0,49

2,06

13,53

0,19 0,27 0,28

0,44

1,84

0,03125

0,0625

0,125

0,25

0,5

1

2

4

8

16

32

64

128

256

512 1024 2048 4096 8192

seco

nd

s

N-dimension

Naive Strassen MKL

24 dl980 on 49 cores

NUMA-Optimizations

• Try to have as much memory local as possible to avoid remote memory access

– Because it is slower by a factor of ~ 1.4

• Partition data and work depending on #nodes and #cores

• Pin threads to nodes with the memory they need

• (Topology for other algorithms)

25

Distributing memory and threads

0,34

11,39

101,12

0,35

18,34

182,45

0,35

21,96

204,85

0,37

14,33

143,44

0,25

0,5

1

2

4

8

16

32

64

128

256

1024 2048 4096

Distributed Memory andThreads

Neither distributed

Distributed threads

Distributed memory

26 Parallel naive on ubuntu-numa0101 on 24 cores

DEMO

27

Application of NUMA-Optimizations

• Copy all data to every node: – Duration of preprocessing:

• 11.11s for a 8192x8192 matrix to 8 nodes

• Partition data and move to corresponding nodes – Duration of preprocessing:

• 1.03s for a 8192x8192 matrix to 8 nodes

• Pin threads to nodes – int numa_run_on_node(int node);

28

Parallelization – Partition by rows Copying memory to different nodes

29

Strassen Memory Distribution Effects

22,147083 19,611477

21,316545

14,671332

0

5

10

15

20

25

30

35

40

6 7 8 distributed

Dimension: 16384

memory copy multiplication result combination

30 dl980 on 128 core

Other optimization techniques

• Tiling

• Vectorization

• Scalar replacement

• Precomputation of constants

• (unrolling)

31

Tiling

• Divide computational work into tiles to leverage cache

• Tile size depends on cache size • gcc -DCLS=$(getconf LEVEL1_DCACHE_LINESIZE)

33

Performance of Tiling perf stat -e L1-dcache-misses,LLC-misses,DTLB-misses bin/matrixmult –n 2048

34

1

8

64

512

4096

32768

262144

2097152

16777216

134217728

1,074E+09

8,59E+09

Not Tiled, not TransposedNot Tiled, TransposedTiled, not TransposedTiled, Transposed

97

39

13 12

0

20

40

60

80

100

120

Time

s

dl980 on 128 cores

Vectorization

• SIMD : Single Instruction Multiple Data • All recent Intel and AMD Processors have

Streaming Instructions Extensions (SSE) • An instruction is simultaneously applied to

multiple floats • Can only operate efficiently on aligned data (16

bit aligned) • SSE operate on 128bit registers

– Newer Intel processors have Advanced Vector Instructions (AVX) with 256 bit

– Dl980 machine only support 128bit operations

35

Auto-Vectorization

• Can this be done automatically?

– Gcc –O3 tries to auto-vectorize

• only possible for simple statements

36

Assembler

37

Aligned Malloc

38

Example: • Numa_alloc returns adrr: 0x1232, not 16bit aligned • We add 15, so addr = 0x1241 or 0b1001001000001 • Now we clear last 4 bits by ANDing ~0x0f (=0xfff0) • => result 0x1240 is now 16bit aligned

Intrinsics

39

Example

Source: https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Use Parallelism for MMM

• We try to construct a 4x4 matrix multiplication

• How to process rows ?

40

continuous memory

X Can’t be loaded in one instr.

Use parallelism for MMM

• We try to construct a 4x4 matrix multiplication

• How to process rows ?

• Idea: process all elements of row of B in parallel

41

X =

A11 𝐵11 𝐵12 𝐵13 𝐵14 A12 A13 A14

Add up results

4x4 Kernel

42

SSE – Single Threaded

1024 2048 4096

naiveSSE 0,27 2 20

tiledSSE 0,48 5 41

tiled 2 24 213

naive 11 97 879

0,25

0,5

1

2

4

8

16

32

64

128

256

512

1024

Seco

nd

s

N-dimensions

naiveSSE tiledSSE tiled naive

43 dl980 on 1 core

Cache Misses of SSE Variants

0

1.000.000.000

2.000.000.000

3.000.000.000

4.000.000.000

5.000.000.000

6.000.000.000

L1 cache misses dTLB misses

naiveSSE tiledSSE

44

Performance for Small Matrices

0,00

0,05

0,10

0,15

0,20

0,25

64 128 256 512

seco

nd

s

naiveSSE tiled strassen MKL


Performance for Large Matrices

0,79

7,29

3,90

0,17 0,34

1,20

5,09

0,20 0,39 0,53

1,94

0

1

2

3

4

5

6

7

8

9

10

1024 2048 4096 8192

seco

nd

s

naiveSSE tiled strassenSSE MKL

28,3


Summary

• Analyze algorithm for bottlenecks

– IO optimization

– Hardware specific optimization

• Cache size

• NUMA architecture

• Specific instructions (SSE)

• Try to minimize remote memory access

• Visualisations can facilitate understanding

47

NUMA-aware Matrix-Matrix-Multiplication...About this talk •Objective: Show how to improve performance of algorithms in a NUMA-system with MMM as an example •Code was written in

Documents