Parallel Gaussian Elimination 1 LU and Cholesky Factorization factoring a square matrix tiled Cholesky factorization 2 Blocked LU Factorization deriving blocked formulations of LU right and left looking LU factorizations tiled algorithm for LU factorization 3 The PLASMA Software Library Parallel Linear Algebra Software for Multicore Architectures running an example MCS 572 Lecture 21 Introduction to Supercomputing Jan Verschelde, 10 October 2016 Introduction to Supercomputing (MCS 572) parallel Gaussian elimination L-21 10 October 2016 1 / 27
27
Embed
parallel Gaussian elimination - homepages.math.uic.eduhomepages.math.uic.edu/~jan/mcs572/parallelLU.pdf · Parallel Gaussian Elimination 1 LU and Cholesky Factorization factoring
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Parallel Gaussian Elimination
1 LU and Cholesky Factorization
factoring a square matrix
tiled Cholesky factorization
2 Blocked LU Factorization
deriving blocked formulations of LU
right and left looking LU factorizations
tiled algorithm for LU factorization
3 The PLASMA Software Library
Parallel Linear Algebra Software for Multicore Architectures
running an example
MCS 572 Lecture 21
Introduction to Supercomputing
Jan Verschelde, 10 October 2016
Introduction to Supercomputing (MCS 572) parallel Gaussian elimination L-21 10 October 2016 1 / 27
Parallel Gaussian Elimination
1 LU and Cholesky Factorization
factoring a square matrix
tiled Cholesky factorization
2 Blocked LU Factorization
deriving blocked formulations of LU
right and left looking LU factorizations
tiled algorithm for LU factorization
3 The PLASMA Software Library
Parallel Linear Algebra Software for Multicore Architectures
running an example
Introduction to Supercomputing (MCS 572) parallel Gaussian elimination L-21 10 October 2016 2 / 27
solving Ax = b with the LU factorization
To solve an n-dimensional linear system Ax = b
we factor A as a product of two triangular matrices, A = LU:
L is lower triangular, L = [ℓi ,j ], ℓi ,j = 0 if j > i and ℓi ,i = 1.
U is upper triangular U = [ui ,j ], ui ,j = 0 if i > j .
Solving Ax = b is equivalent to solving L(Ux) = b:
1 Forward substitution: Ly = b.
2 Backward substitution: Ux = y.
Factoring A costs O(n3), solving triangular systems costs O(n2).
For numerical stability, we apply partial pivoting and compute PA = LU,
where P is a permutation matrix.
Introduction to Supercomputing (MCS 572) parallel Gaussian elimination L-21 10 October 2016 3 / 27
LU factorization of the matrix A
for column j = 1, 2, . . . , n − 1 in A do
1 find the largest element ai ,j in column j (for i ≥ j);
2 if i 6= j , then swap rows i and j ;
3 for i = j + 1, . . . n, for k = j + 1, . . . , n do ai ,k := ai ,k −(
ai ,j
aj ,j
)
aj ,k .
Introduction to Supercomputing (MCS 572) parallel Gaussian elimination L-21 10 October 2016 4 / 27
Cholesky factorizationIf A is symmetric, AT = A, and positive semidefinite: ∀x : xT Ax ≥ 0,
then we better compute a Cholesky factorization: A = LLT ,
where L is a lower triangular matrix.
Because A is positive semidefinite, no pivoting is needed,
and we need about half as many operations as LU.
for j = 1, 2, . . . , n do
for k = 1, 2, . . . , j − 1 do
aj ,j := aj ,j − a2j ,k ;
aj ,j :=√
aj ,j ;
for i = j + 1, . . . , n do
for k = 1, 2, . . . , j do
ai ,j := ai ,j − ai ,kaj ,k
ai ,j := ai ,j/aj ,j
Introduction to Supercomputing (MCS 572) parallel Gaussian elimination L-21 10 October 2016 5 / 27
Parallel Gaussian Elimination
1 LU and Cholesky Factorization
factoring a square matrix
tiled Cholesky factorization
2 Blocked LU Factorization
deriving blocked formulations of LU
right and left looking LU factorizations
tiled algorithm for LU factorization
3 The PLASMA Software Library
Parallel Linear Algebra Software for Multicore Architectures
running an example
Introduction to Supercomputing (MCS 572) parallel Gaussian elimination L-21 10 October 2016 6 / 27
tiled matrices
Let A be a symmetric, positive definite n-by-n matrix.
For tile size b, let n = p × b and consider
A =
A1,1 A2,1 · · · Ap,1
A2,1 A2,2 · · · Ap,2...
.... . .
...
Ap,1 Ap,2 · · · Ap,p
,
where Ai ,j is an b-by-b matrix.
A crude classification of memory hierarchies distinguishes between
registers (small), cache (medium), and main memory (large).
To reduce data movements, we want to keep data in registers
and cache as much as possible.
Introduction to Supercomputing (MCS 572) parallel Gaussian elimination L-21 10 October 2016 7 / 27
tiled Cholesky factorization
for k = 1, 2, . . . , p do
DPOTF2(Ak ,k , Lk ,k); — — Lk ,k := Cholesky(Ak ,k)for i = k + 1, . . . , p do
DTRSM(Lk ,k , Ai ,k , Li ,k); — — Li ,k := Ai ,kL−Tk ,k
end for;
for i = k + 1, . . . , p do
for j = k + 1, . . . , p do
DGSMM(Li ,k , Lj ,k , Ai ,j); — — Ai ,j := Ai ,j − Li ,kLTj ,k
end for;
end for.
Introduction to Supercomputing (MCS 572) parallel Gaussian elimination L-21 10 October 2016 8 / 27
Parallel Gaussian Elimination
1 LU and Cholesky Factorization
factoring a square matrix
tiled Cholesky factorization
2 Blocked LU Factorization
deriving blocked formulations of LU
right and left looking LU factorizations
tiled algorithm for LU factorization
3 The PLASMA Software Library
Parallel Linear Algebra Software for Multicore Architectures
running an example
Introduction to Supercomputing (MCS 572) parallel Gaussian elimination L-21 10 October 2016 9 / 27
blocked LU factorization
The optimal size of the blocks is machine dependent.
A1,1 A1,2 A1,3
A2,1 A2,2 A2,3
A3,1 A3,2 A3,3
=
L1,1
L2,1 L2,2
L3,1 L3,2 L3,3
U1,1 U1,2 U1,3
U2,2 U2,3
U3,3
Expanding the right hand side and equating to the matrix at the left
2 Level-2 BLAS: matrix-vector operations, O(mn) cost.◮ y = αAx + βy◮ A = A + αxyT , rank one update◮ x = T−1b, for T a triangular matrix
3 Level-3 BLAS: matrix-matrix operations, O(kmn) cost.◮ C = αAB + βC◮ C = αAAT + βC, rank k update of symmetric matrix◮ B = αTB, for T a triangular matrix◮ B = αT−1B, solve linear system with many right hand sides
Introduction to Supercomputing (MCS 572) parallel Gaussian elimination L-21 10 October 2016 19 / 27
graph driven asynchronous execution
We view a blocked algorithm as a Directed Acyclic Graph (DAG):
nodes are computational tasks performed in kernel subroutines;
edges represent the dependencies among the tasks.
Given a DAG, tasks are scheduled asynchronously and independently,
considering the dependencies imposed by the edges in the DAG.
A critical path in the DAG connects those nodes that have the highest
number of outgoing edges.
The scheduling policy assigns higher priority to those tasks