A Vector Implementation of Gaussian Elimination over GF(2): Exploring the Design-Space of Strassen’s Algorithm as a Case Study Enric Morancho Departament d’Arquitectura de Computadors Universitat Politècnica de Catalunya, BarcelonaTech Barcelona, Spain [email protected]23 rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing Turku, Finland, March 4 th - 6 th , 2015 Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 1 / 45
45
Embed
A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A Vector Implementation of Gaussian Eliminationover GF(2): Exploring the Design-Space of
Strassen’s Algorithm as a Case Study
Enric Morancho
Departament d’Arquitectura de ComputadorsUniversitat Politècnica de Catalunya, BarcelonaTech
23rd Euromicro International Conference onParallel, Distributed, and Network-Based Processing
Turku, Finland, March 4th − 6th, 2015
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 1 / 45
Outline
1 Introduction
2 Background
3 Vector implementation of Gaussian Elimination
4 Case study
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 2 / 45
Outline
1 Introduction
2 Background
3 Vector implementation of Gaussian Elimination
4 Case study
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 3 / 45
Introduction
Gaussian Elimination (GE) is one of the key algorithms in linearalgebraWe discuss a vector implementation of GE over GF(2)We apply this implementation to a case study:
Enumerating all matrix-multiply algorithms over GF(2) similar toStrassen’s algorithm
Strassen’s algorithm: first known sub-cubic matrix-multiply algorithm
The search engine relies on solving more than 1012 GaussianEliminations over GF(2) matrices
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 4 / 45
Outline
1 Introduction
2 BackgroundGaussian EliminationGaussian Elimination over GF(2)Vector extensions
3 Vector implementation of Gaussian Elimination
4 Case study
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 5 / 45
Outline
1 Introduction
2 BackgroundGaussian EliminationGaussian Elimination over GF(2)Vector extensions
3 Vector implementation of Gaussian Elimination
4 Case study
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 6 / 45
Gaussian Elimination (GE)
One of the key algorithms in linear algebraApplications: Solving LSE’s, inverting nonsingular matrices,...
GE transforms a matrix into a matrix in row (column) echelon formForward elimination
2 1 4 32 4 10 36 6 18 94 2 8 3
2 1 4 00 3 6 30 0 0 30 0 0 0
Gauss-Jordan Elimination (GJE) transforms a matrix into a matrixin reduced row (column) echelon form
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 7 / 45
Gaussian Elimination (GE)
GE iteratively applies three elementary transformations:Swapping rows (columns)Scaling rows (columns)Adding to a row (column) a scalar multiple of another row (column)
GE is defined over an algebraic fieldInfinite fields: Q, R, ...
Computer arithmetic over infinite fields introduces round-off errorsGE implementations use pivoting techniques to minimize them
Finite fields: the Galois Field of two elements (GF(2)), ...Computer arithmetic over finite fields is always exact
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 8 / 45
Outline
1 Introduction
2 BackgroundGaussian EliminationGaussian Elimination over GF(2)Vector extensions
3 Vector implementation of Gaussian Elimination
4 Case study
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 9 / 45
Gaussian Elimination over GF(2)
GF(2) is the Galois field of two elements (aka F2, binary field)GF (2) = {0,1}addition ≡ bitwise XOR
subtraction and addition are the same operation (+1 = -1)
Scalar processors may emulate vector instructionsSWAR: SIMD within a registerInterprets general-purpose registers as a bitvectorsBitwise operations (AND, shifts,...) can be seen as SIMD operations
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 12 / 45
Outline
1 Introduction
2 Background
3 Vector implementation of Gaussian EliminationFinding the pivot elementRow swappingForward elimination and Back substitution
4 Case study
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 13 / 45
Preliminaries
We focus in row echelon formsVector instructions allow us to exploit the parallelism available inthe matrix transformationsWe represent GF(2) matrices as vector registers
Both row-major and column-major layoutsMain steps of GE
For each column:Finding the pivotRow swappingForward elimination and Back substitution
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 14 / 45
Outline
1 Introduction
2 Background
3 Vector implementation of Gaussian EliminationFinding the pivot elementRow swappingForward elimination and Back substitution
4 Case study
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 15 / 45
Finding the pivot element
For each column, GE must locate a pivot (an element 6= 0)First: column must be extracted
Depends on matrix layoutSecond: pivot must be located
Naïve solution: iterating bit by bitOptimized solution: using bit-scanning machine instructions
tzcnt (AVX2)
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 16 / 45
Outline
1 Introduction
2 Background
3 Vector implementation of Gaussian EliminationFinding the pivot elementRow swappingForward elimination and Back substitution
4 Case study
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 17 / 45
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 18 / 45
Row swapping
Example of swapping rows 3 and 5 on a 6× 4 GF(2) matrix storedin column-major layout
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 19 / 45
Outline
1 Introduction
2 Background
3 Vector implementation of Gaussian EliminationFinding the pivot elementRow swappingForward elimination and Back substitution
4 Case study
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 20 / 45
Forward elimination and Back substitution
Add pivot row to the rows with non-zero entries in the pivot columnWe perform all additions at the same timeExample on a 4× 6 GF(2) matrix stored in row-major layout
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 21 / 45
Outline
1 Introduction
2 Background
3 Vector implementation of Gaussian Elimination
4 Case studyPreliminariesAdaptation to the case studyExperimental setupResults
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 22 / 45
Outline
1 Introduction
2 Background
3 Vector implementation of Gaussian Elimination
4 Case studyPreliminariesAdaptation to the case studyExperimental setupResults
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 23 / 45
Block-recursive matrix-multiply algorithms
A · B =
(A11 A12A21 A22
)·(
B11 B12B21 B22
)=
(C11 C12C21 C22
)= C
P1 = A11 · B11
P2 = A12 · B21
P3 = A11 · B12
P4 = A12 · B22
P5 = A21 · B11
P6 = A22 · B21
P7 = A21 · B12
P8 = A22 · B22
C11 = P1 + P2
C12 = P3 + P4
C21 = P5 + P6
C22 = P7 + P8
a) Conventional
n 3
P1 = (A11 + A22) · (B11 + B22)
P2 = (A21 + A22) · B11
P3 = A11 · (B12 − B22)
P4 = A22 · (−B11 + B21)
P5 = (A11 + A12) · B22
P6 = (−A11 + A21) · (B11 + B12)
P7 = (A12 − A22) · (B21 + B22)
C11 = P1 + P4 − P5 + P7
C12 = P3 + P5
C21 = P2 + P4
C22 = P1 − P2 + P3 + P6
b) Strassen’s algorithm
n 2.807
P1 = A11 · B11
P2 = A12 · B21
P3 = A22 · (B11 − B12 − B21 + B22)
P4 = (A11 − A21) · (−B12 + B22)
P5 = (A21 + A22) · (−B11 + B12)
P6 = (A11 + A12 − A21 − A22) · B22
P7 = (A11 − A21 − A22) · (B11 − B12 + B22)
C11 = P1 + P2
C12 = P1 + P5 + P6 − P7
C21 = P1 − P3 + P4 − P7
C22 = P1 + P4 + P5 − P7
c) Winograd’s variant
n 2.807
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 24 / 45
Block-recursive matrix-multiply algorithms
A · B =
(A11 A12A21 A22
)·(
B11 B12B21 B22
)=
(C11 C12C21 C22
)= C
P1 = A11 · B11
P2 = A12 · B21
P3 = A11 · B12
P4 = A12 · B22
P5 = A21 · B11
P6 = A22 · B21
P7 = A21 · B12
P8 = A22 · B22
C11 = P1 + P2
C12 = P3 + P4
C21 = P5 + P6
C22 = P7 + P8
a) Conventional
n 3
P1 = (A11 + A22) · (B11 + B22)
P2 = (A21 + A22) · B11
P3 = A11 · (B12 − B22)
P4 = A22 · (−B11 + B21)
P5 = (A11 + A12) · B22
P6 = (−A11 + A21) · (B11 + B12)
P7 = (A12 − A22) · (B21 + B22)
C11 = P1 + P4 − P5 + P7
C12 = P3 + P5
C21 = P2 + P4
C22 = P1 − P2 + P3 + P6
b) Strassen’s algorithm
n 2.807
P1 = A11 · B11
P2 = A12 · B21
P3 = A22 · (B11 − B12 − B21 + B22)
P4 = (A11 − A21) · (−B12 + B22)
P5 = (A21 + A22) · (−B11 + B12)
P6 = (A11 + A12 − A21 − A22) · B22
P7 = (A11 − A21 − A22) · (B11 − B12 + B22)
C11 = P1 + P2
C12 = P1 + P5 + P6 − P7
C21 = P1 − P3 + P4 − P7
C22 = P1 + P4 + P5 − P7
c) Winograd’s variant
n 2.807
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 25 / 45
Block-recursive matrix-multiply algorithms
A · B =
(A11 A12A21 A22
)·(
B11 B12B21 B22
)=
(C11 C12C21 C22
)= C
P1 = A11 · B11
P2 = A12 · B21
P3 = A11 · B12
P4 = A12 · B22
P5 = A21 · B11
P6 = A22 · B21
P7 = A21 · B12
P8 = A22 · B22
C11 = P1 + P2
C12 = P3 + P4
C21 = P5 + P6
C22 = P7 + P8
a) Conventional
n 3
P1 = (A11 + A22) · (B11 + B22)
P2 = (A21 + A22) · B11
P3 = A11 · (B12 − B22)
P4 = A22 · (−B11 + B21)
P5 = (A11 + A12) · B22
P6 = (−A11 + A21) · (B11 + B12)
P7 = (A12 − A22) · (B21 + B22)
C11 = P1 + P4 − P5 + P7
C12 = P3 + P5
C21 = P2 + P4
C22 = P1 − P2 + P3 + P6
b) Strassen’s algorithm
n 2.807
P1 = A11 · B11
P2 = A12 · B21
P3 = A22 · (B11 − B12 − B21 + B22)
P4 = (A11 − A21) · (−B12 + B22)
P5 = (A21 + A22) · (−B11 + B12)
P6 = (A11 + A12 − A21 − A22) · B22
P7 = (A11 − A21 − A22) · (B11 − B12 + B22)
C11 = P1 + P2
C12 = P1 + P5 + P6 − P7
C21 = P1 − P3 + P4 − P7
C22 = P1 + P4 + P5 − P7
c) Winograd’s variant
n 2.807
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 26 / 45
Block-recursive matrix-multiply algorithms
Several matrix-multiply algorithms with sub-cubic complexityn2.81 [Strassen, 1969]n2.79 [Pan, 1979]n2.78 [Bini, 1979]n2.55 [Schönhage, 1981]n2.373 [Coppersmith and Winograd, 1987]n2.37286 [LeGall, 2014]
But Strassen’s algorithm is optimal for 2× 2 matricesThere are other algorithms similar to Strassen’s?
Using a genetic algorithm, [Oh and Moon, 2010] searched forStrassen-like algorithms over R
Their partial search discovered 608 Strassen-like algorithms
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 27 / 45
Case Study
Enumerating all the Strassen-like algorithms for 2× 2 GF(2)matrices
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 28 / 45
Outline
1 Introduction
2 Background
3 Vector implementation of Gaussian Elimination
4 Case studyPreliminariesAdaptation to the case studyExperimental setupResults
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 29 / 45
Formulation
Strassen-like algorithms make 7 recursive callsEach call computes a bilinear form
2 search engines for each scenarioGeneric: applies GJE/GE to all candidate algorithmsSpecialized: implement specializationsIn M4RI scenarios, specialization includes only the filteringmechanism
Search engines differ on the implementation of Gauss-Jordanelimination
Impact of code specialization: almost 4XLarger than doubling or even quadruplicating vector lengthImpact break down:
Discarding algorithms if a pivot is not found: negligibleApplying elimination just on seven columns: 1.4X - 1.6XAvoiding row interchange: 1.2X - 1.4XFiltering-out algorithms without eight subproducts: 1.9X
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 40 / 45
Case-study results
We have found 20 Strassen-like algorithmsExcluding permutations and symmetric versionsDetailed in our paper
Results coherent with [Oh and Moon, 2010]They found additional algorithms where coefficient 0.5 is involved
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 41 / 45
Outline
1 Introduction
2 Background
3 Vector implementation of Gaussian Elimination
4 Case study
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 42 / 45
Conclusions
We have discussed a vector implementation of GaussianEliminationOur evaluation develops a case study
Requires solving more than 1012 GE’s over 16× 11 GF(2) matricesVector implementations clearly outperform scalar implementationWe point out the impact of code specialization for our case study
Speedup: almost 4XImpact larger than doubling or quadruplicating vector-register length
Performance of analyzed algebra libraries is not competitiveLibraries are optimized for larger matrices
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 43 / 45
Future work
Developing efficient implementations of Gaussian Elimination forlarger matrices
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 44 / 45
A Vector Implementation of Gaussian Eliminationover GF(2): Exploring the Design-Space of
Strassen’s Algorithm as a Case Study
Enric Morancho
Departament d’Arquitectura de ComputadorsUniversitat Politècnica de Catalunya, BarcelonaTech