Top Banner
NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1
45

NUMA-aware Matrix-Matrix-Multiplication...About this talk •Objective: Show how to improve performance of algorithms in a NUMA-system with MMM as an example •Code was written in

Feb 03, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • NUMA-aware Matrix-Matrix-Multiplication

    Max Reimann, Philipp Otto

    1

  • About this talk

    • Objective: Show how to improve performance of algorithms in a NUMA-system with MMM as an example

    • Code was written in C with numa.h, pthread.h • Tested on FSOC

    – ubuntu-0101 • 2 Nodes, 24 Cores

    – dl980 • 8 Nodes, 128 Cores

    • Compiled with gcc – –O3

    2

  • Naïve Matrix-Matrix-Multiplication

    • We will examine MMM for large n x n matrices

    • 𝒪 𝑛3

    3 http://www.mathematrix.de/wp-content/uploads/matrixmul2.png

  • Naïve MMM implementation

    4

  • Performance of Naive vs. MKL

    5

    0,38

    11,79

    98,14

    0,02

    0,13

    1,02

    0,015625

    0,03125

    0,0625

    0,125

    0,25

    0,5

    1

    2

    4

    8

    16

    32

    64

    128

    512 1024 2048

    Naive

    MKL

    dl980 on one core

  • Intel Math Kernel Library (MKL)

    • BLAS: Basic Linear Algebra Subprograms

    – Standard for Linear Algebra

    • MKL:

    – Implements BLAS for Intel hardware

    – Vectorized and threaded for highest performance

    6

  • Analysis of Naïve MMM

    • Testsetup • Use ubuntu-numa machine

    • No thread or memory pinning

    • Use numatop/pcm

    • Performance tools show:

    – Unused cores (obvious)

    – QPI cannot be fully loaded with one thread

    8

  • Parallelization I

    • How can the work be divided?

    – 1. Partition computation of matrixC by rows or columns

    • Problem: All threads need matrixA and matrixB

    • Solution: – Accept overhead for remote memory access or

    – Copy input/output matrices to the other nodes (preprocessing)

    9

    * =

  • Parallelization – Partition by rows

    10

  • Parallelization – Partition by rows

    11

    0,38

    11,79

    98,14

    0,05

    0,26

    2,54

    0,19

    0,27 0,28

    0,03125

    0,0625

    0,125

    0,25

    0,5

    1

    2

    4

    8

    16

    32

    64

    128

    512 1024 2048

    Naive Sequential

    Naive Parallel

    MKL Parallel

    dl980 on 128 cores

  • Parallelization II

    • How can the work be divided?

    – 2. Partition computation of matrixC by summands

    • Benefit: – for computing the i-th summand, only the i-th row of matrixA

    / column of matrixB is needed

    – This allows to only copy the needed parts to the other nodes

    • Disadvantage: – matrixB has to be transposed to be able to partition the

    memory (preprocessing)

    – locking or merging of matrixC is needed

    12

  • Parallelization II

    13

  • Performance of „Parallel Sum“ Method

    14

    1,59

    2,81 3,34

    14,91

    218,84

    0,27

    1,41

    2,94

    17,24

    186,39

    0,19

    0,27 0,28

    0,43

    2,41

    0,13

    0,25

    0,50

    1,00

    2,00

    4,00

    8,00

    16,00

    32,00

    64,00

    128,00

    256,00

    512 1024 2048 4096 8192

    Parallel sum

    Naive Parallel

    MKL Parallel

    dl980 on 128 cores

  • Strassen

    • Runtime Complexity: – Naive algorithm 𝒪 𝑛3

    • Can we get better? – Strassens algorithm, published 1969, was the first

    to improve asymptotic complexity

    – Runtime 𝒪 𝑛log2 7 ≈ 𝒪 𝑛2.8 • Algorithms today can get O(𝑛2.35), but are not pratical

    – Uses only 7 multiplications instead of 8 per recursion step

    15

  • Matrix definition

    16

    A = 𝐴1,1 𝐴1,2𝐴2,1 𝐴2,2

    , 𝐵 = 𝐵1,1 𝐵1,2𝐵2,1 𝐵2,2

    , 𝐶 = 𝐶1,1 𝐶1,2𝐶2,1 𝐶2,2

    For matrices A,B,C with dimension 𝑛 = 4𝑘, 𝑘 ∈ ℕ A,B,C can be viewed as 2x2 block matrices:

    𝐶1,1 = 𝐴1,1 ∙ 𝐵1,1 + 𝐴1,2 ∙ 𝐵2,1 𝐶1,2 = 𝐴1,1 ∙ 𝐵1,2 + 𝐴1,2 ∙ 𝐵2,2

    𝐶2,1 = 𝐴2,1 ∙ 𝐵1,1 + 𝐴2,2 ∙ 𝐵2,1 𝐶2,2 = 𝐴2,1 ∙ 𝐵1,2 + 𝐴2,2 ∙ 𝐵2,2

    Conventional Algorithm uses 8 (expensive) multiplications:

  • Strassen’s algorithm

    17

    𝑀1 ∶= 𝐴1,1 + 𝐴2,2 ∙ 𝐵1,1 + 𝐵2,2

    𝑀2 ∶= 𝐴2,1 + 𝐴2,2 ∙ 𝐵1,1

    𝑀3 ∶= 𝐴1,1 ∙ 𝐵1,2 − 𝐵2,2

    𝑀4 ∶= 𝐴2,2 ∙ 𝐵2,1 − 𝐵1,1

    𝑀5 ∶= 𝐴1,1 + 𝐴1,2 ∙ 𝐵2,2

    𝑀6 ∶= 𝐴2,1 − 𝐴1,1 ∙ 𝐵1,1 + 𝐵1,2 𝑀7 ∶= (𝐴1,2 − 𝐴2,2) ∙ (B2,1 + 𝐵2,2

    Define temporary matrices:

    𝐶1,1 = 𝑀1 + 𝑀4 − 𝑀5 + 𝑀7 𝐶1,2 = 𝑀3 + 𝑀5 𝐶2,1 = 𝑀2 + 𝑀4 𝐶2,2 = 𝑀1 − 𝑀2 + 𝑀3 + 𝑀6

    Compose final matrix

    Only 7 multiplications!

  • Strassen - Example

    18

    𝐶1,2 = 𝑀3 + 𝑀5

    = 𝐴1,1𝐵1,2 + 𝐴1,2𝐵2,2

    = 𝐴1,1𝐵1,2 − 𝐴1,1𝐵2,2 + 𝐴1,1𝐵2,2 + 𝐴1,2𝐵2,2

    = 𝐴1,1 ∙ 𝐵1,2 − 𝐵2,2 + 𝐴1,1 + 𝐴1,2 ∙ 𝐵2,2

    𝐴1,1 𝐴1,2𝐴2,1 𝐴2,2

    ∙𝐵1,1 𝐵1,2𝐵2,1 𝐵2,2

    = 𝐶1,1 𝐶1,2𝐶2,1 𝐶2,2

    Substituting the 𝑀𝑖𝑠 by their term gives back the original formula:

  • Strassen - Analysis

    • Cost: 7 multiplications and 18 additions

    – 8 multiplications and 4 additions for naïve

    • Only practical for large matrices n > 1000

    – Although our results indicate otherwise (later)

    • Define cutoff point for recursion

    – If n is sufficiently small, do naïve multiplication

    19

  • Strassen - Implementation

    20

  • Execution Time: Single-threaded

    0,00

    0,00

    0,01

    0,05

    0,38

    11,79

    98,14

    0,00

    0,00

    0,00

    0,02

    0,12

    0,87

    6,12

    0,00 0,00

    0,00

    0,00

    0,02

    0,13

    1,02

    0,00010,00010,00020,00050,00100,00200,00390,00780,01560,03130,06250,12500,25000,50001,00002,00004,00008,0000

    16,000032,000064,0000

    128,0000

    32 64 128 256 512 1024 2048

    Seco

    nd

    s

    N-dimension

    Naive Strassen MKL

    21

    strassen: BREAK = 64

    dl980 on 1 core

  • Parallelization of Strassen I

    • Data dependencies:

    – Have to do additions in 𝑀𝑖 before multiplication

    • e.g. M1 = 𝐴1,1 + 𝐴2,2 ∙ 𝐵1,1 + 𝐵2,2

    – Have to calculate 𝑀𝑖 before calculating C

    • 𝐶1,2 = 𝑀3 + 𝑀5

    • Easiest solution:

    – Calculate in 𝑀𝑖s in parallel

    – Then calculate 𝐶𝑖,𝑗 in parallel

    22

  • Parallelization of Strassen II

    • Level 1 can be scheduled to 7 threads • Level n can be scheduled to 7𝑛 threads

    – Most systems have number of processors on base 2

    • We used manual parallelization – 49 distinct functions for Ms and 16 for Cs – Code bloating and not scalable, BUT:

    • Automatic parallelization is hard – Thread load becomes very unbalanced – Every level needs 7 temporary matrices

    • Exponential rising memory requirements

    23

  • Execution Time – 49 Threads

    0,05

    0,26

    2,54

    27,61

    228,57

    0,05

    0,14

    0,49

    2,06

    13,53

    0,19 0,27 0,28

    0,44

    1,84

    0,03125

    0,0625

    0,125

    0,25

    0,5

    1

    2

    4

    8

    16

    32

    64

    128

    256

    512 1024 2048 4096 8192

    seco

    nd

    s

    N-dimension

    Naive Strassen MKL

    24 dl980 on 49 cores

  • NUMA-Optimizations

    • Try to have as much memory local as possible to avoid remote memory access

    – Because it is slower by a factor of ~ 1.4

    • Partition data and work depending on #nodes and #cores

    • Pin threads to nodes with the memory they need

    • (Topology for other algorithms)

    25

  • Distributing memory and threads

    0,34

    11,39

    101,12

    0,35

    18,34

    182,45

    0,35

    21,96

    204,85

    0,37

    14,33

    143,44

    0,25

    0,5

    1

    2

    4

    8

    16

    32

    64

    128

    256

    1024 2048 4096

    Distributed Memory andThreads

    Neither distributed

    Distributed threads

    Distributed memory

    26 Parallel naive on ubuntu-numa0101 on 24 cores

  • DEMO

    27

  • Application of NUMA-Optimizations

    • Copy all data to every node: – Duration of preprocessing:

    • 11.11s for a 8192x8192 matrix to 8 nodes

    • Partition data and move to corresponding nodes – Duration of preprocessing:

    • 1.03s for a 8192x8192 matrix to 8 nodes

    • Pin threads to nodes – int numa_run_on_node(int node);

    28

  • Parallelization – Partition by rows Copying memory to different nodes

    29

  • Strassen Memory Distribution Effects

    22,147083 19,611477

    21,316545

    14,671332

    0

    5

    10

    15

    20

    25

    30

    35

    40

    6 7 8 distributed

    Dimension: 16384

    memory copy multiplication result combination

    30 dl980 on 128 core

  • Other optimization techniques

    • Tiling

    • Vectorization

    • Scalar replacement

    • Precomputation of constants

    • (unrolling)

    31

  • Tiling

    • Divide computational work into tiles to leverage cache

    • Tile size depends on cache size • gcc -DCLS=$(getconf LEVEL1_DCACHE_LINESIZE)

    33

  • Performance of Tiling perf stat -e L1-dcache-misses,LLC-misses,DTLB-misses bin/matrixmult –n 2048

    34

    1

    8

    64

    512

    4096

    32768

    262144

    2097152

    16777216

    134217728

    1,074E+09

    8,59E+09

    Not Tiled, not TransposedNot Tiled, TransposedTiled, not TransposedTiled, Transposed

    97

    39

    13 12

    0

    20

    40

    60

    80

    100

    120

    Time

    s

    dl980 on 128 cores

  • Vectorization

    • SIMD : Single Instruction Multiple Data • All recent Intel and AMD Processors have

    Streaming Instructions Extensions (SSE) • An instruction is simultaneously applied to

    multiple floats • Can only operate efficiently on aligned data (16

    bit aligned) • SSE operate on 128bit registers

    – Newer Intel processors have Advanced Vector Instructions (AVX) with 256 bit

    – Dl980 machine only support 128bit operations

    35

  • Auto-Vectorization

    • Can this be done automatically?

    – Gcc –O3 tries to auto-vectorize

    • only possible for simple statements

    36

  • Assembler

    37

  • Aligned Malloc

    38

    Example: • Numa_alloc returns adrr: 0x1232, not 16bit aligned • We add 15, so addr = 0x1241 or 0b1001001000001 • Now we clear last 4 bits by ANDing ~0x0f (=0xfff0) • => result 0x1240 is now 16bit aligned

  • Intrinsics

    39

    Example

    Source: https://software.intel.com/sites/landingpage/IntrinsicsGuide/

  • Use Parallelism for MMM

    • We try to construct a 4x4 matrix multiplication

    • How to process rows ?

    40

    continuous memory

    X Can’t be loaded in one instr.

  • Use parallelism for MMM

    • We try to construct a 4x4 matrix multiplication

    • How to process rows ?

    • Idea: process all elements of row of B in parallel

    41

    X =

    A11 𝐵11 𝐵12 𝐵13 𝐵14 A12 A13 A14

    Add up results

  • 4x4 Kernel

    42

  • SSE – Single Threaded

    1024 2048 4096

    naiveSSE 0,27 2 20

    tiledSSE 0,48 5 41

    tiled 2 24 213

    naive 11 97 879

    0,25

    0,5

    1

    2

    4

    8

    16

    32

    64

    128

    256

    512

    1024

    Seco

    nd

    s

    N-dimensions

    naiveSSE tiledSSE tiled naive

    43 dl980 on 1 core

  • Cache Misses of SSE Variants

    0

    1.000.000.000

    2.000.000.000

    3.000.000.000

    4.000.000.000

    5.000.000.000

    6.000.000.000

    L1 cache misses dTLB misses

    naiveSSE tiledSSE

    44

  • Performance for Small Matrices

    0,00

    0,05

    0,10

    0,15

    0,20

    0,25

    64 128 256 512

    seco

    nd

    s

    naiveSSE tiled strassen MKL

    45 dl980 on 128 cores

  • Performance for Large Matrices

    0,79

    7,29

    3,90

    0,17 0,34

    1,20

    5,09

    0,20 0,39 0,53

    1,94

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    1024 2048 4096 8192

    seco

    nd

    s

    naiveSSE tiled strassenSSE MKL

    28,3

    46 dl980 on 128 cores

  • Summary

    • Analyze algorithm for bottlenecks

    – IO optimization

    – Hardware specific optimization

    • Cache size

    • NUMA architecture

    • Specific instructions (SSE)

    • Try to minimize remote memory access

    • Visualisations can facilitate understanding

    47