-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Parallel Numerical AlgorithmsChapter 4 – Sparse Linear
Systems
Section 4.1 – Direct Methods
Michael T. Heath and Edgar Solomonik
Department of Computer ScienceUniversity of Illinois at
Urbana-Champaign
CS 554 / CSE 512
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 1 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Outline
1 Sparse Matrices
2 Sparse Triangular Solve
3 Cholesky Factorization
4 Sparse Cholesky Factorization
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 2 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse Matrix DefinitionsSparse Matrix Products
Sparse Matrices
Matrix is sparse if most of its entries are zero
For efficiency, store and operate on only nonzero entries,e.g.,
ajk · xk need not be done if ajk = 0
But more complicated data structures required incur
extraoverhead in storage and arithmetic operations
Matrix is “usefully” sparse if it contains enough zero entriesto
be worth taking advantage of them to reduce storageand work
required
In practice, grid discretizations often yield matrices withΘ(n)
nonzero entries, i.e., (small) constant number ofnonzeros per row
or column
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 3 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse Matrix DefinitionsSparse Matrix Products
Graph Model
Adjacency Graph G(A) of symmetric n× n matrix A isundirected
graph having n vertices, with edge betweenvertices i and j if aij
6= 0
Number of edges in G(A) is the number of nonzeros in A
For a grid-based discretization, G(A) is the grid
Adjacency graph provides visual representation ofalgorithms and
highlights connections between numericaland combinatorial
algorithms
For nonsymmetric A, G(A) would be directed
Often convenient to think of aij as the weight of edge (i,
j)
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 4 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse Matrix DefinitionsSparse Matrix Products
Sparse Matrix Representations
Coordinate (COO) (naive) format – store each nonzeroalong with
its row and column indexCompressed-sparse-row (CSR) format
Store value and column index for each nonzero
Store index of first nonzero for each row
Storage for CSR is less than COO and CSR ordering isoften
convenient
CSC (compressed-sparse column), blocked versions (e.g.CSB), and
other storage formats are also used
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 5 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse Matrix DefinitionsSparse Matrix Products
Sparse Matrix Distributions
Dense matrix mappings can be adapted to sparse matrices
1-D blocked mapping – store all nonzeros in n/pconsecutive rows
on each processor1-D cyclic or randomized mapping – store all
nonzeros insome subset of n/p rows on each processor2-D block
mapping – store all nonzeros in a n/
√p× n/√p
block of matrix
1-D blocked mappings are best for exploiting locality ingraph,
especially when there are Θ(n) nonzeros
Row ordering matters for all mappings, randomization
andcyclicity yield load balance, blocking can yield locality
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 6 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse Matrix DefinitionsSparse Matrix Products
Sparse Matrix Vector Multiplication
Sparse matrix vector multiplication (SpMV) is
y = Ax
where A is sparse and x is dense
CSR-based matrix-vector product, for all i (in parallel) do
xi =∑j
ai,c(j)xc(j) =n∑
j=1
aijxj
where c(j) is the index of the jth nonzero in row i
For random 1-D or 2-D mapping, cost of vectorcommunication is
same as in corresponding dense case
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 7 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse Matrix DefinitionsSparse Matrix Products
SpMV with 1-D Mapping
For 1D blocking (each processor owns n/p rows), numberof
elements of x needed by a processor is the number ofcolumns with a
nonzero in the rows it ownsIn general, want to order rows to
minimize maximumnumber of vector elements needed on any
processorGraphically, we want to partition the graph into p subsets
ofn/p nodes, to minimize the maximum number of nodes towhich any
subset is connected, i.e., for G(A) = (V,E),
V = V1 ∪ · · · ∪ Vp, |Vi| = n/p
is selected to minimize
maxi
(|{v : v ∈ V \ Vi,∃w ∈ Vi, (v, w) ∈ E}|)
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 8 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse Matrix DefinitionsSparse Matrix Products
Surface Area to Volume Ratio in SpMV
The number of external vertices the maximum partition isadjacent
to depends on the expansion of the graph
Expansion can be interpreted as a measure of thesurface-area to
volume ratio of the subgraphs
For example, for a k × k × k grid, a subvolume ofk/p1/3 × k/p1/3
× k/p1/3 has surface area Θ(k2/p2/3)
Communication for this case becomes a neighbor haloexchange on a
3-D processor mesh
Thus, finding the best 1-D partitioning for SpMV
oftencorresponds to domain partitioning and depends on thephysical
geometry of the problem
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 9 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse Matrix DefinitionsSparse Matrix Products
Other Sparse Matrix Products
SpMV is of critical importance to many numerical methods,but
suffers from a low flop-to-byte ratio and a potentiallyhigh
communication bandwidth cost
In graph algorithms SpMSpV (x and y are sparse) isprevalent,
which is even harder to perform efficiently (e.g.,to minimize work
need layout other than CSR, like CSC)
SpMM (x becomes dense matrix X) provides a higherflop-to-byte
ratio and is much easier to do efficiently
SpGEMM (SpMSpM) (matrix multiplication where allmatrices are
sparse) arises in e.g., algebraic multigrid andgraph algorithms,
efficiency is highly dependent on sparsity
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 10 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sequential Sparse Triangular SolveParallel Sparse Triangular
Solve
Solving Triangular Sparse Linear Systems
Given sparse lower-triangular matrix L and vector b, solve
Lx = b
all nonzeros of L must be in its lower-triangular part
Sequential algorithm: take xi = bi/lii, update
bj = bj − ljixi for all j ∈ {i + 1, . . . , n}
If L has m > n nonzeros, require Q1 ≈ 2m operations
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 11 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sequential Sparse Triangular SolveParallel Sparse Triangular
Solve
Parallelism in Sparse Triangular Solve
We can adapt any dense parallel triangular solve algorithmto the
sparse case
Again have fan-in (left-looking) and fan-out
(right-looking)variantsCommunication cost stays the same,
computational costdecreases
In fact there may be additional sources of parallelism, e.g.,if
l21 = 0, we can solve for x1 and x2 concurrently
More generally, can concurrently prune leaves of directedacyclic
adjacency graph (DAG) G(A) = (V,E), where(i, j) ∈ E if lij 6= 0
Depth of algorithm corresponds to diameter of this DAG
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 12 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sequential Sparse Triangular SolveParallel Sparse Triangular
Solve
Parallel Algorithm for Sparse Triangular Solve
Partition: associate fine-grain tasks with each (i, j) suchthat
lij 6= 0Communicate: task (i, i) communicates with task (j, i)
and(i, j) for all possible jAgglomerate: form coarse-grain tasks
for each column ofL, i.e., do 1-D agglomeration, combining
fine-grain tasks(?, i) into agglomerated task iMap: assign
coarse-grain tasks (columns of L) toprocessors with blocking (for
locality) and/or cyclicity (forload balance and concurrency)
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 13 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sequential Sparse Triangular SolveParallel Sparse Triangular
Solve
Analysis of 1-D Parallel Sparse Triangular Solve
Cost of 1-D algorithm will clearly be less than thecorresponding
algorithm for the dense case
Load balance depends on distribution of nonzeros, cyclicitycan
help distribute dense blocks
Naive algorithm with 1-D column blocking exploitsconcurrency
only in fan-out updates
Communication bandwidth cost depends onsurface-to-volume ratio
of each subset of verticesassociated with a block of columns
Higher concurrency and better performance possible
withdynamic/adaptive algorithms
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 14 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky
AlgorithmParallel Algorithm
Cholesky Factorization
Symmetric positive definite matrix A has
Choleskyfactorization
A = LLT
where L is lower triangular matrix with positive
diagonalentries
Linear systemAx = b
can then be solved by forward-substitution in lowertriangular
system Ly = b, followed by back-substitution inupper triangular
system LTx = y
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 15 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky
AlgorithmParallel Algorithm
Computing Cholesky Factorization
Algorithm for computing Cholesky factorization can bederived by
equating corresponding entries of A and LLT
and generating them in correct order
For example, in 2× 2 case[a11 a21a21 a22
]=
[`11 0`21 `22
] [`11 `210 `22
]so we have
`11 =√a11, `21 = a21/`11, `22 =
√a22 − `221
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 16 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky
AlgorithmParallel Algorithm
Cholesky Factorization Algorithm
for k = 1 to nakk =
√akk
for i = k + 1 to naik = aik/akk
endfor j = k + 1 to n
for i = j to naij = aij − aik ajk
endend
end
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 17 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky
AlgorithmParallel Algorithm
Cholesky Factorization Algorithm
All n square roots are of positive numbers, so algorithmwell
defined
Only lower triangle of A is accessed, so strict uppertriangular
portion need not be stored
Factor L computed in place, overwriting lower triangle of
APivoting is not required for numerical stability
About n3/6 multiplications and similar number of additionsare
required (about half as many as for LU)
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 18 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky
AlgorithmParallel Algorithm
Parallel Algorithm
Partition
For i, j = 1, . . . , n, fine-grain task (i, j) stores aij
andcomputes and stores {
`ij , if i ≥ j`ji, if i < j
yielding 2-D array of n2 fine-grain tasks
Zero entries in upper triangle of L need not be computedor
stored, so for convenience in using 2-D mesh network,`ij can be
redundantly computed as both task (i, j) andtask (j, i) for i >
j
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 19 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky
AlgorithmParallel Algorithm
Fine-Grain Tasks and Communication
a11ℓ11
a21ℓ21
a21ℓ21
a22ℓ22
a31ℓ31
a41ℓ41
a32ℓ32
a42ℓ42
a31ℓ31
a32ℓ32
a41ℓ41
a42ℓ42
a33ℓ33
a43ℓ43
a43ℓ43
a44ℓ44
a51ℓ51
a52ℓ52
a61ℓ61
a62ℓ62
a53ℓ53
a54ℓ54
a63ℓ63
a64ℓ64
a51ℓ51
a61ℓ61
a52ℓ52
a62ℓ62
a53ℓ53
a63ℓ63
a54ℓ54
a64ℓ64
a55ℓ55
a65ℓ65
a65ℓ65
a66ℓ66
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 20 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky
AlgorithmParallel Algorithm
Fine-Grain Parallel Algorithmfor k = 1 to min(i, j)− 1
recv broadcast of akj from task (k, j)recv broadcast of aik from
task (i, k)aij = aij − aik akj
endif i = j then
aii =√aii
broadcast aii to tasks (k, i) and (i, k), k = i+ 1, . . . ,
nelse if i < j then
recv broadcast of aii from task (i, i)aij = aij/aiibroadcast aij
to tasks (k, j), k = i+ 1, . . . , n
elserecv broadcast of ajj from task (j, j)aij = aij/ajjbroadcast
aij to tasks (i, k), k = j + 1, . . . , n
end
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 21 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky
AlgorithmParallel Algorithm
Agglomeration Schemes
Agglomerate
Agglomeration of fine-grain tasks produces
2-D1-D column1-D row
parallel algorithms analogous to those for LU factorization,with
similar performance and scalability
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 22 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky
AlgorithmParallel Algorithm
Loop Orderings for Cholesky
Each choice of i, j, or k index in outer loop yields
differentCholesky algorithm, named for portion of matrix updated
bybasic operation in inner loops
Submatrix-Cholesky : (fan-out) with k in outer loop, innerloops
perform rank-1 update of remaining unreducedsubmatrix using current
column
Column-Cholesky : (fan-in) with j in outer loop, inner
loopscompute current column using matrix-vector product
thataccumulates effects of previous columns
Row-Cholesky : (fan-in) with i in outer loop, inner loopscompute
current row by solving triangular system involvingprevious rows
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 23 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky
AlgorithmParallel Algorithm
Memory Access Patterns
read only read and write
Submatrix-Cholesky Column-Cholesky Row-Cholesky
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 24 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky
AlgorithmParallel Algorithm
Column-Oriented Cholesky Algorithms
Submatrix-Cholesky
for k = 1 to nakk =
√akk
for i = k + 1 to naik = aik/akk
endfor j = k + 1 to n
for i = j to naij = aij − aik ajk
endend
end
Column-Cholesky
for j = 1 to nfor k = 1 to j − 1
for i = j to naij = aij − aik ajk
endendajj =
√ajj
for i = j + 1 to naij = aij/ajj
endend
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 25 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky
AlgorithmParallel Algorithm
Column Operations
Column-oriented algorithms can be stated more compactly
byintroducing column operations
cdiv ( j ): column j is divided by square root of its
diagonalentry
ajj =√ajj
for i = j + 1 to naij = aij/ajj
end
cmod ( j, k): column j is modified by multiple of column k,with
k < j
for i = j to naij = aij − aik ajk
end
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 26 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky
AlgorithmParallel Algorithm
Column-Oriented Cholesky Algorithms
Submatrix-Cholesky
for k = 1 to ncdiv (k)for j = k + 1 to n
cmod ( j, k)end
end
right-lookingimmediate-updatedata-drivenfan-out
Column-Cholesky
for j = 1 to nfor k = 1 to j − 1
cmod ( j, k)endcdiv ( j)
end
left-lookingdelayed-updatedemand-drivenfan-in
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 27 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky
AlgorithmParallel Algorithm
Data Dependences
cmod (k + 1, k ) cmod (k + 2, k ) cmod (n, k )• • •
cdiv (k )
cmod (k, 1) cmod (k, 2 ) cmod (k, k - 1 )• • •
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 28 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky
AlgorithmParallel Algorithm
Data Dependences
cmod (k, ?) operations along bottom can be done in anyorder, but
they all have same target column, so updatingmust be coordinated to
preserve data integrity
cmod (?, k) operations along top can be done in any order,and
they all have different target columns, so updating canbe done
simultaneously
Performing cmods concurrently is most important sourceof
parallelism in column-oriented factorization algorithms
For dense matrix, each cdiv (k) depends on immediatelypreceding
column, so cdivs must be done sequentially
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 29 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Sparsity Structure
For sparse matrix M , let Mi? denote its ith row and M?jits jth
column
Define Struct (Mi?) = {k < i | mik 6= 0}, nonzero structureof
row i of strict lower triangle of M
Define Struct (M?j) = {k > j | mkj 6= 0}, nonzero structureof
column j of strict lower triangle of M
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 30 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Sparse Cholesky Algorithms
Submatrix-Cholesky
for k = 1 to ncdiv ( k)for j ∈ Struct (L?k)
cmod ( j, k)end
end
right-lookingimmediate-updatedata-drivenfan-out
Column-Cholesky
for j = 1 to nfor k ∈ Struct (Lj?)
cmod ( j, k)endcdiv ( j)
end
left-lookingdelayed-updatedemand-drivenfan-in
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 31 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Graph Model
Recall that adjacency graph G(A) of symmetric n× nmatrix A is
undirected graph with edge between vertices iand j if aij 6= 0
At each step of Cholesky factorization algorithm,corresponding
vertex is eliminated from graph
Neighbors of eliminated vertex in previous graph becomeclique
(fully connected subgraph) in modified graph
Entries of A that were initially zero may become nonzeroentries,
called fill
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 32 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Example: Graph Model of Elimination
××
×××
×××
× × ×× ×××× ×
×××
×××× ×××
× ×××××
A
××
×××
×××
×
××× ×××
× ×××××
L
+ ++ ++
3
7
1
6
9
5
4
8
2
3
7
6
9
5
4
8
2
3
7
6
9
5
4
8 7
6
9
5
4
8
7
6
9
5
89 8 7
6
9 87 9 89
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 33 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Elimination Tree
parent ( j) is row index of first offdiagonal nonzeroin column j
of L, if any, and j otherwise
Elimination tree T (A) is graph having n vertices, with
edgebetween vertices i and j, for i > j, if i = parent ( j)
If matrix is irreducible, then elimination tree is single
treewith root at vertex n; otherwise, it is more accuratelytermed
elimination forest
T (A) is spanning tree for filled graph, F (A), which is
G(A)with all fill edges added
Each column of Cholesky factor L depends only on itsdescendants
in elimination tree
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 34 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Example: Elimination Tree
××
×××
×××
× × ×× ×××× ×
×××
×××× ×××
× ×××××
A
××
×××
×××
×
××× ×××
× ×××××
L
+ ++ ++
3
7
1
6
9
5
4
8
2
3
7
1
6
9
5
4
8
23
7
1
6
9
5
4
8
2G (A) F (A) T (A)
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 35 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Effect of Matrix Ordering
Amount of fill depends on order in which variables
areeliminated
Example: “arrow” matrix — if first row and column aredense, then
factor fills in completely, but if last row andcolumn are dense,
then they cause no fill
××
×××
×××
× ×××××××
×
×× × ×××××
××
×× ×
×× ×××
××××××
×
×××××× × ×
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 36 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Ordering Heuristics
General problem of finding ordering that minimizes fill
isNP-complete, but there are relatively cheap heuristics that
limitfill effectively
Bandwidth or profile reduction : reduce distance of
nonzerodiagonals from main diagonal (e.g., RCM)
Minimum degree : eliminate node having fewest neighborsfirst
Nested dissection : recursively split graph into pieces usinga
vertex separator, numbering separator vertices last
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 37 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Symbolic Factorization
For symmetric positive definite (SPD) matrices, orderingcan be
determined in advance of numeric factorization
Only locations of nonzeros matter, not their numericalvalues,
since pivoting is not required for numerical stability
Once ordering is selected, locations of all fill entries in Lcan
be anticipated and efficient static data structure set upto
accommodate them prior to numeric factorization
Structure of column j of L is given by union of structures
oflower triangular portion of column j of A and prior columnsof L
whose first nonzero below diagonal is in row j
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 38 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Solving Sparse SPD Systems
Basic steps in solving sparse SPD systems by
Choleskyfactorization
1 Ordering : Symmetrically reorder rows and columns ofmatrix so
Cholesky factor suffers relatively little fill
2 Symbolic factorization : Determine locations of all
fillentries and allocate data structures in advance toaccommodate
them
3 Numeric factorization : Compute numeric values of entriesof
Cholesky factor
4 Triangular solve : Compute solution by forward-
andback-substitution
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 39 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Parallel Sparse Cholesky
In sparse submatrix- or column-Cholesky, if ajk = 0, thencmod (
j, k) is omitted
Sparse factorization thus has additional source ofparallelism,
since “missing” cmods may permit multiplecdivs to be done
simultaneously
Elimination tree shows data dependences among columnsof Cholesky
factor L, and hence identifies potentialparallelism
At any point in factorization process, all factor
columnscorresponding to leaves in the elimination tree can
becomputed simultaneously
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 40 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Parallel Sparse Cholesky
Height of elimination tree determines longest serial paththrough
computation, and hence parallel execution time
Width of elimination tree determines degree of
parallelismavailable
Short, wide, well-balanced elimination tree desirable
forparallel factorization
Structure of elimination tree depends on ordering of matrix
So ordering should be chosen both to preserve sparsityand to
enhance parallelism
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 41 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Levels of Parallelism in Sparse Cholesky
Fine-grainTask is one multiply-add pairAvailable in either dense
or sparse caseDifficult to exploit effectively in practice
Medium-grainTask is one cmod or cdivAvailable in either dense or
sparse caseAccounts for most of speedup in dense case
Large-grainTask computes entire set of columns in subtree
ofelimination treeAvailable only in sparse case
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 42 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Example: Band Ordering, 1-D Grid
×××
×××
××××
××
×
××
×××
×
A
×××
×××
×××
×××
×
L
4
1
6
3
5
7
2
T (A)
4
1
6
3
5
7
2
G (A)
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 43 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Example: Minimum Degree, 1-D Grid
×××
×××
× ×××
×××
××
×××
×
A
×××
×××
×××
×××
×
L
7
1
4
5
6
2
3
G (A)
4
6
7
2
T (A)
1
3
5
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 44 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Example: Nested Dissection, 1-D Grid
×××
×××
× ×
×
×
×××××
×××
×
A
×××
×××
×××
××
××
L
7
1
6
2
5
4
3
G (A)
+ +4
6
7
2
T (A)
1
3
5
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 45 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Example: Band Ordering, 2-D Grid
××
×××
×××
×× ×× ×
××× ×
×××
×
×
××
××××
××
××
×
A
××
×××
×××
××××
×××
×××
××
×
L
+ ++
++
7
4
1
8
5
2
9
6
3
G (A)
+++
4
1
6
3
5
7
9
8
2
T (A)
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 46 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Example: Minimum Degree, 2-D Grid
××
×××
×××
× × ×× ×××× ×
×××
×××× ×××
× ×××××
A
××
×××
×××
×
××× ×××
× ×××××
L
+ ++ ++
3
7
1
6
9
5
4
8
2
G (A)
3
7
1
6
9
5
4
8
2
T (A)
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 47 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Example: Nested Dissection, 2-D Grid
××
×××
×××
× × ×× ×
×××××××
××
×
×
×××
××
×××
×A
××
×××
×××
××
×
×
×××
× ××
×××L
+ ++ ++
4
7
1
6
8
3
5
9
2
G (A)
4
7
1
6
9
3
5
8
2
T (A)
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 48 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Mapping
Cyclic mapping of columns to processors works well fordense
problems, because it balances load andcommunication is global
anyway
To exploit locality in communication for sparsefactorization,
better approach is to map columns in subtreeof elimination tree
onto local subset of processors
Still use cyclic mapping within dense
submatrices(“supernodes”)
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 49 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Example: Subtree Mapping
00
00
00
0
0
0
1
1
1
1
1
1
1
1
1 2
2
2
2
2
2
2
2
2 3
3
3
3
3
3
3
3
3
0
0
1
2
2
3
2
3
0
1
3
1
0
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 50 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Fan-Out Sparse Choleskyfor j ∈ mycols
if j is leaf node in T (A) thencdiv ( j)send L?j to processes in
map (Struct (L?j))mycols = mycols − { j }
endendwhile mycols 6= ∅
receive any column of L, say L?kfor j ∈ mycols ∩ Struct
(L?k)
cmod ( j, k)if column j requires no more cmods then
cdiv ( j)send L?j to processes in map (Struct (L?j))mycols =
mycols − { j }
endend
end
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 51 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Fan-In Sparse Choleskyfor j = 1 to n
if j ∈ mycols or mycols ∩ Struct (Lj?) 6= ∅ thenu = 0for k ∈
mycols ∩ Struct (Lj?)
u = u+ `jk L?kif j ∈ mycols then
incorporate u into factor column jwhile any aggregated update
column
for column j remains, receive oneand incorporate it into factor
column j
endcdiv ( j)
elsesend u to process map ( j)
endend
end
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 52 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Multifrontal Sparse Cholesky
Multifrontal algorithm operates recursively, starting fromroot
of elimination tree for A
Dense frontal matrix Fj is initialized to have nonzeroentries
from corresponding row and column of A as its firstrow and column,
and zeros elsewhere
Fj is then updated by extend_add operations with updatematrices
from its children in elimination tree
extend_add operation, denoted by ⊕, merges matrices bytaking
union of their subscript sets and summing entries forany common
subscripts
After updating of Fj is complete, its partial
Choleskyfactorization is computed, producing corresponding rowand
column of L as well as update matrix Uj
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 53 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Example: extend_add
a11 a13 a15 a18a31 a33 a35 a38a51 a53 a55 a58a81 a83 a85 a88
⊕b11 b12 b15 b17b21 b22 b25 b27b51 b52 b55 b57b71 b72 b75
b77
=
a11 + b11 b12 a13 a15 + b15 b17 a18b21 b22 0 b25 b27 0a31 0 a33
a35 0 a38
a51 + b51 b52 a53 a55 + b55 b57 a58b71 b72 0 b75 b77 0a81 0 a83
a85 0 a88
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 54 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Multifrontal Sparse CholeskyFactor( j)
Let {i1, . . . , ir} = Struct (L?j)
Let Fj =
aj,j aj,i1 . . . aj,irai1,j 0 . . . 0
......
. . ....
air,j 0 . . . 0
for each child i of j in elimination tree
Factor(i)Fj = Fj ⊕Ui
endPerform one step of dense Cholesky:
Fj =
`j,j 0`i1,j
... I`ir,j
1 00 Uj
`j,j `i1,j . . . `ir,j0 I
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 55 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Advantages of Multifrontal Method
Most arithmetic operations performed on dense matrices,which
reduces indexing overhead and indirect addressing
Can take advantage of loop unrolling, vectorization,
andoptimized BLAS to run at near peak speed on many typesof
processors
Data locality good for memory hierarchies, such as cache,virtual
memory with paging, or explicit out-of-core solvers
Naturally adaptable to parallel implementation byprocessing
multiple independent fronts simultaneously ondifferent
processors
Parallelism can also be exploited in dense matrixcomputations
within each front
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 56 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Summary for Parallel Sparse Cholesky
Principal ingredients in efficient parallel algorithm for
sparseCholesky factorization
Reordering matrix to obtain relatively short and wellbalanced
elimination tree while also limiting fill
Multifrontal or supernodal approach to exploit densesubproblems
effectively
Subtree mapping to localize communication
Cyclic mapping of dense subproblems to achieve goodload
balance
2-D algorithm for dense subproblems to enhancescalability
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 57 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Scalability of Sparse Cholesky
Performance and scalability of sparse Cholesky depend onsparsity
structure of particular matrix
Sparse factorization with nested dissection
requiresfactorization of dense matrix of dimension Θ(
√n ) for 2-D
grid problem with n grid points (√n is the size of the root
vertex separator), for which unconditional weak scalabilityis
possible
However, efficiency often deteriorates as a result of therest of
the sparse factorization taking more time
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 58 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
References – Dense Cholesky
G. Ballard, J. Demmel, O. Holtz, and O.
Schwartz,Communication-optimal parallel and sequential
Choleskydecomposition, SIAM J. Sci. Comput. 32:3495-3523, 2010
J. W. Demmel, M. T. Heath, and H. A. van der Vorst,Parallel
numerical linear algebra, Acta Numerica2:111-197, 1993
D. O’Leary and G. W. Stewart, Data-flow algorithms forparallel
matrix computations, Comm. ACM 28:840-853,1985
D. O’Leary and G. W. Stewart, Assignment and schedulingin
parallel matrix factorization, Linear Algebra Appl.77:275-299,
1986
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 59 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
References – Sparse CholeskyM. T. Heath, Parallel direct methods
for sparse linearsystems, D. E. Keyes, A. Sameh, and V.
Venkatakrishnan,eds., Parallel Numerical Algorithms, pp. 55-90,
Kluwer,1997
M. T. Heath, E. Ng and B. W. Peyton, Parallel algorithmsfor
sparse linear systems, SIAM Review 33:420-460, 1991
J. Liu, Computational models and task scheduling forparallel
sparse Cholesky factorization, Parallel Computing3:327-342,
1986
J. Liu, Reordering sparse matrices for parallel
elimination,Parallel Computing 11:73-91, 1989
J. Liu, The role of elimination trees in sparse
factorization,SIAM J. Matrix Anal. Appl. 11:134-172, 1990
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 60 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
References – Multifrontal Methods
I. S. Duff, Parallel implementation of multifrontal
schemes,Parallel Computing 3:193-204, 1986
A. Gupta, Parallel sparse direct methods: a short tutorial,IBM
Research Report RC 25076, November 2010
J. Liu, The multifrontal method for sparse matrix
solution:theory and practice, SIAM Review 34:82-109, 1992
J. A. Scott, Parallel frontal solvers for large sparse
linearsystems, ACM Trans. Math. Software 29:395-417, 2003
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 61 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
References – Scalability
A. George, J. Lui, and E. Ng, Communication results forparallel
sparse Cholesky factorization on a hypercube,Parallel Computing
10:287-298, 1989
A. Gupta, G. Karypis, and V. Kumar, Highly scalableparallel
algorithms for sparse matrix factorization, IEEETrans. Parallel
Distrib. Systems 8:502-520, 1997
T. Rauber, G. Runger, and C. Scholtes, Scalability ofsparse
Cholesky factorization, Internat. J. High SpeedComputing 10:19-52,
1999
R. Schreiber, Scalability of sparse direct solvers,A. George, J.
R. Gilbert, and J. Liu, eds., Graph Theoryand Sparse Matrix
Computation, pp. 191-209,Springer-Verlag, 1993
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 62 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
References – Nonsymmetric Sparse Systems
I. S. Duff and J. A. Scott, A parallel direct solver for
largesparse highly unsymmetric linear systems, ACM Trans.Math.
Software 30:95-117, 2004
A. Gupta, A shared- and distributed-memory parallelgeneral
sparse direct solver, Appl. Algebra Engrg.Commun. Comput.,
18(3):263-277, 2007
X. S. Li and J. W. Demmel, SuperLU_Dist: A
scalabledistributed-memory sparse direct solver for
unsymmetriclinear systems, ACM Trans. Math. Software
29:110-140,2003
K. Shen, T. Yang, and X. Jiao, S+: Efficient 2D sparse
LUfactorization on parallel machines, SIAM J. Matrix Anal.Appl.
22:282-305, 2000
Michael T. Heath and Edgar Solomonik Parallel Numerical
Algorithms 63 / 63
Sparse MatricesSparse Matrix DefinitionsSparse Matrix
Products
Sparse Triangular SolveSequential Sparse Triangular
SolveParallel Sparse Triangular Solve
Cholesky FactorizationCholesky FactorizationComputing
CholeskyCholesky AlgorithmParallel Algorithm
Sparse Cholesky FactorizationSparse EliminationMatrix
OrderingsParallel Algorithms