Introduction into parallel computations Miroslav T ˚ uma Institute of Computer Science Academy of Sciences of the Czech Republic and Technical University in Liberec Presentation supported by the project “Information Society” of the Academy of Sciences of the Czech Republic under No. 1ET400300415 MFF UK, February, 2006
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Introduction into parallel computations
Miroslav Tuma
Institute of Computer Science
Academy of Sciences of the Czech Republic
and Technical University in Liberec
Presentation supported by the project
“Information Society” of the Academy of Sciences of the Czech Republic
under No. 1ET400300415
MFF UK, February, 2006
M. Tuma 2
Pre-introduction
Preliminaries
General knowledge of involved basic algorithms of NLA Simple ideas from direct and iterative solvers for solving large sparse
linear systems Complexities of algorithms
Not covered
Vectorization of basic linear algebra algorithms Parallelization of combinatorial algorithms FFT, parallel FFT, vectorized FFT Multigrid, multilevel algorithms Tools like PETSC etc. Eigenvalue problems
M. Tuma 3
Outline
Part I. A basic sketch on parallel processing1. Why to use parallel computers2. Classification (a very brief sketch)3. Some terminology; basic relations4. Parallelism for us5. Uniprocessor model6. Vector processor model7. Multiprocessor model
Part II. Parallel processing and numerical computations 8. Basic parallel operations 9. Parallel solvers of linear algebraic systems. 10. Approximate inverse preconditioners 11. Polynomial preconditioners 12. Element-by-element preconditioners 13. Vector / parallel preconditioners 14. Solving nonlinear systems
M. Tuma 4
1. Why to use parallel computers?
It might seem that
M. Tuma 4
1. Why to use parallel computers?
It might seem that
always better technologies
M. Tuma 4
1. Why to use parallel computers?
It might seem that
always better technologies
computers are still faster: Moore’s law
The number of transistors per square inch on integrated circuits doublesevery year since the integrated circuit was inventedThe observation made in 1965 by Gordon Moore, co-founder of Intel.(G.E. Moore, Electronics, April 1965).
M. Tuma 4
1. Why to use parallel computers?
It might seem that
always better technologies
computers are still faster: Moore’s law
The number of transistors per square inch on integrated circuits doublesevery year since the integrated circuit was inventedThe observation made in 1965 by Gordon Moore, co-founder of Intel.(G.E. Moore, Electronics, April 1965).
finite signal speed (speed of light; 300000 km s−1)
implies: cycle time (clock rate): MHz or ns:
100 MHz < −−−− > 10 ns
cycle time: 1 ns⇒ 30 cm per cycle time
Cray-1 (1976): 80 MHz
in any case: size of atoms and quantum effects seem to be ultimatelimits
M. Tuma 7
1. Why to use parallel computers? IV.
Further motivation: important and very time-consumingproblems to be solved
reentry into the terrestrial atmosphere⇒Boltzmann equations
combustion⇒ large ODE systems
deformations, crash-tests⇒large systems of nonlinear equations
turbulent flows⇒ large systems of PDEs in 3D
⇓accelerations of computations still needed
⇓parallel processing
M. Tuma 8
1. Why to use parallel computers? V.
High-speed computing seems to be cost efficient
“The power of computer systems increases as the square of their cost”(Grosch’s law; H.A. Grosch. High speed arithmetic: The digital computeras a research tool. J. Opt. Soc. Amer. 43 (1953); H.A. Grosch. Grosch’s
law revisited. Computerworld 8 (1975), p.24)
M. Tuma 9
2. Classification: a very brief sketch
a) How deep can we go: levels of parallelism
running jobs in parallel for reliabilityIBM AN/FSQ-31 (1958) – purely duplex machine
(time for operations 2.5µ s – 63.5 µ s; computer connected with thehistory of the word byte)
running parts of jobs on independent specialized unitsUNIVAC LARC (1960) – first I/O processor
running jobs in parallel for speedBurroughs D-825 (1962) – more modules, job scheduler
running parts of programs in parallelBendix G-21 (1963), CDC 6600 (1964)
– nonsymmetric multiprocessor
M. Tuma 10
2. Classification: a very brief sketch II.
a) How deep can we go: levels of parallelism (continued)
running matrix-intensive stuff separatelydevelopment of IBM 704x/709x (1963), ASC TI (1965)
parallelizing instructionsIBM 709 (1957), IBM 7094 (1963)
data synchronizer units DSU→ channels) – enables simultaneouslyread/write/compute
Amdahl’s law expresses natural surprise by the following fact:
if a process performs part of the work quickly and part of the workslowly then the overall (speedup, efficiency) is strongly limited by thepart performed slowly
M. Tuma 23
3. Some terminology; basic relations: VI.
Amdahl’s law expresses natural surprise by the following fact:
if a process performs part of the work quickly and part of the workslowly then the overall (speedup, efficiency) is strongly limited by thepart performed slowly
Notation: f : fraction of the slow (sequential) part (1− f): the rest (parallelized, vectorized) t: overall time
M. Tuma 23
3. Some terminology; basic relations: VI.
Amdahl’s law expresses natural surprise by the following fact:
if a process performs part of the work quickly and part of the workslowly then the overall (speedup, efficiency) is strongly limited by thepart performed slowly
Notation: f : fraction of the slow (sequential) part (1− f): the rest (parallelized, vectorized) t: overall time
f
1−f
sequential
parallel
M. Tuma 23
3. Some terminology; basic relations: VI.
Amdahl’s law expresses natural surprise by the following fact:
if a process performs part of the work quickly and part of the workslowly then the overall (speedup, efficiency) is strongly limited by thepart performed slowly
Notation: f : fraction of the slow (sequential) part (1− f): the rest (parallelized, vectorized) t: overall time
Then : S =f ∗ t + (1− f)t
f ∗ t + (1− f) ∗ (t/p)≤ 1
f
f
1−f
sequential
parallel
M. Tuma 23
3. Some terminology; basic relations: VI.
Amdahl’s law expresses natural surprise by the following fact:
if a process performs part of the work quickly and part of the workslowly then the overall (speedup, efficiency) is strongly limited by thepart performed slowly
Notation: f : fraction of the slow (sequential) part (1− f): the rest (parallelized, vectorized) t: overall time
Then : S =f ∗ t + (1− f)t
f ∗ t + (1− f) ∗ (t/p)≤ 1
f
E.g.: f = 1/10⇒ S ≤ 10
f
1−f
sequential
parallel
M. Tuma 24
3. Some terminology; basic relations: VII.
Amdahl’s law (continued)Described in: (Gene Amdahl: Interpretation of AMDAHL’s theorem,advertisement of IBM, 1967)
Gene Myron Amdahl (1922 —) worked on IBM 704/709, IBM/360 Series, Amdahl V470 (1975)
Amdahl’s law relevancy
Only a simple approximation of computer processing: dependence f(n)not considered: fully applies when there are absolute constraints for solution time
(weather prediction, financial transactions) Algorithm is effectively parallel if f → 0 for n→∞.
Speedup / efficiency anomalies: More processors may have more memory/cache Increasing chances to find a lucky solution in parallel combinatorial
algorithms
M. Tuma 25
3. Some terminology; basic relations: VIII.
Scalability program is scalable if:
larger efficiency comes with larger number of processors or longerpipeline
Isoefficiency function: size = KTo(size, p) such that E is constant,K = E/(1− E)
Adding n numbers on p processors: size = Θ(p log p).
M. Tuma 26
3. Some terminology; basic relations: IX.
Load balancing
Techniques to minimize Tp on multiprocessors by approximate equalizingtasks for individual processors.
static load balancing array distribution schemes (block, cyclic, block-cyclic, randomized
block)
graph partitioning
hierarchical mappings
dynamic load balancing centralized schemes
distributed schemesWill be discussed later
M. Tuma 27
3. Some terminology; basic relations: IX.
Semaphores
Signals operated by individual processes and not by central control
Shared memory computers’ feature
Introduced By Dijkstra.
Message passing
Mechanism to transfer data from one process to another.
Distributed memory computers’ feature
Blocking × non-blocking communication
M. Tuma 28
4. Parallelism for us
Mathematician’s point of view
We need to: convert algorithms into state-of-the-art codes
algorithms→ codes→ computers
Algorithm
Idealized computer
Computer
Implementation, Code
What is the idealized computer?
M. Tuma 29
4. Parallelism for us
Idealized computer
idealized vector processor
idealized uniprocessor
idealized computers with more processors
M. Tuma 30
5. Uniprocessor model
CPU
Memory
I/O
M. Tuma 31
5. Uniprocessor model: II.
Example: model and reality
Even simple Pentium III has on-chip
pipeline (at least 11 stages for each instruction) data parallelism (SIMD type) like MMX (64bit) and SSE (128bit) instruction level parallelism (up to 3 instructions) more threads at system level based on bus communication
M. Tuma 32
5. Uniprocessor model: III.
How to ...?: pipelined superscalar CPU: not for us
( pipelines; ability to issue more instructions at the same time)
detecting true data dependencies: dependencies in processing order
detecting resource dependencies: competition of data for computationalresources
— reordering instructions; most microprocessors enable out-of-orderscheduling
solving branch dependencies
— speculative scheduling across; typically every 5th-6th instruction is abranch
VLIW – compile time scheduling
M. Tuma 33
5. Uniprocessor model: IV.
How to ...?: memory and its connection to CPU
(should be considered by us)
1. memory latency—- delay between memory request and data retrieval
2. memory bandwidth—- rate at which data can be transferred from/to memory
M. Tuma 34
5. Uniprocessor model: V.
Memory latency and performance
Example: processor 2GHz, DRAM with latency 0.1µs; two FMA unit on theprocessor and 4-way superscalar (4 instructions in a cycle, e.g., two addsand two multiplies)
cycle time: 0.5ns
maximum processor rate: 8 GFLOPs
for every memory request: 0.1 µs waiting
it is: 200 cycles wasted for each operation
dot product: two data fetches for each multiply-add (2 ops)
consequently: one op for one fetch
resulting rate: 10 MFLOPs
M. Tuma 35
5. Uniprocessor model: VI.
Hiding / improving memory latency (I.)
a) Using cache
The same example cache of size 64kB
it can store matrices A,B and and C of dimension 50
matrix multiplication A ∗B = C
matrix fetch: 5000 bytes: 500 µs
ops: 2n3 time for ops: 2 ∗ 643 ∗ 0.5 ns = 262 µs
total: 762 µs
resulting rate: 688 MFLOPs
M. Tuma 36
5. Uniprocessor model: VII.
Hiding / improving memory latency (II.)
b) Using multithreading(Thread: A sequence of instructions in a program which runs a certain
Additional communication (with respect to uniprocessor P-M ): P-P
store and forward routing via l links between two processors
– tcomm = ts + l(mtw + th)
– ts: transfer startup time (includes startups for both nodes)
– m: message size
– th: node latency (header latency)
– tw: time to transfer a word
– simplification: tcomm = ts + lmtw
typically: poor efficiency of communication
M. Tuma 56
7. Multiprocessor model: III.
Communication (continued)
Single message
Message broken into two parts
M. Tuma 57
7. Multiprocessor model: IV.
Communication (continued 2)
packet routing: routing r packets via l links between two processors
subsequent sends after a part of the message (packet) received
– tcomm = ts + thl + tw1m + m/rtw2(r + s)
– ts: transfer startup time (includes startups for both nodes)
– tw1: time for packetizing the message, tw2: time to transfer a word, s:size of info on packetizing
– finally: tcomm = ts + thl + mtw
– stores overlapped by transfer cut through routing: message broken into flow control digits (fixed size
units)
– tcomm = ts + thl + mtw
supported by most current parallel machines and local networks
M. Tuma 58
7. Multiprocessor model: V.
Communication (shared memory issues)
avoid cache thrashing (degradation of performance due to insufficientcaches); much more important on multiprocessor architectures⇒typical deterioration of performance when a code is transferred to aparallel computer
more difficult to model prefetching
difficult to get and model spatial locality because of cache issues
cache sharing (sharing data for different processors in the same cachelines)
remote access latencies (data for a processor updated in a cache ofanother processor)
M. Tuma 59
7. Multiprocessor model: VI.
Optimizing communication
minimize amount of transferred data: better algorithms
message aggregation, communication granularity, communicationregularity: implementation
minimize distance of data transfer: efficient routing, physical platformorganizations (not treated here)(but tacitly used in some very generaland realistic assumptions)
M. Tuma 60
7. Multiprocessor model: VII.
Granularity of algorithms, implementation, computation
Rough classification of size of program sections executed withoutadditional communication
fine grain
medium grain
coarse grain
M. Tuma 61
7. Multiprocessor model: VIII.
Fine grain example 1: pointwise Jacobi iteration
x+ = (I −D−1A)x + D−1b
A =
B −I
−I B −I
. . . . . .
−I B
B =
4 −1
−1 4 −1
. . . . . .
−1 4
D =
4
4
. . .
4
M. Tuma 62
7. Multiprocessor model: IX.
Fine grain example 1: pointwise Jacobi iteration (continued)
Blocks of the size n/p, matrix block-rowwise stripped, vectors x and ysplit into subvectors of length n/p.
Communication: all-to-all communication among p processors(P0 −−Pp−1): time ts log(p) + tw(n/p)(p− 1) ≈ ts log(p) + twn (usingrather general assumption on implementation of collectivecommunications).
Multiplication: n2/p
Altogether: n2/p + ts log(p) + twn parallel time; cost optimal forp = O(n) (asymptotically the same number of operations as in thesequential case).
Operations:, each of O(n) time complexity Communication O(n) entries
Division O(n) entries by a scalar
Elimination step on O(n) entries
Parallel time: O(n2); Total time: O(n3).
Not the same constant in the asymptotic complexity as in thesequential case: some processors are idle.
M. Tuma 92
8. Basic parallel operations: XII.
Gaussian elimination: further issues
2-D partitioning: Θ(n3) total time for n2 processes. block 2-D partitioning: Θ(n3/p) total time for p processes. 2-D partitionings: generally more scalable (allow efficient use of more
processors) Pivoting: changing layout of the elimination
Partial pivoting: No problem in 1-D rowwise partitioning: O(n) search in each row It might seem that it is better with 1-D columnwise partitioning:
O(log p) search. But this strongly limits pipelining. Strong restrictions to pipelining.
Weaker variants of pivoting (e.g., pairwise pivoting: may result instrong degradation of the numerical quality of the algorithm.
M. Tuma 93
8. Basic parallel operations: XIII.
Solving triangular system: back-substitution
sequential back-substitution for U*x=y(one possible order of operations)
do k=n,1,-1 ! (backwards)x(i)=y(i)do i=k-1,1,-1 ! (backwards)y(i)=y(i)-x(k)*U(i,k)
end doend do
Sequential complexity: n2/2 + O(n)
Rowwise block 1-D partitioning: Constant communication, O(n/p) forcomputation, Θ(n) steps: total time Θ(n2/p).
Block 2-D partitioning: Θ(n2√p) total time.
M. Tuma 94
8. Basic parallel operations: XIV.
Solving linear recurrences: a case of parallel prefix operation
Parallel prefix operation:
Get y0 = x0, y1 = x0♥x1, . . . , yi = x0♥x1♥ . . .♥xi for associative ♥.
M. Tuma 94
8. Basic parallel operations: XIV.
Solving linear recurrences: a case of parallel prefix operation
Parallel prefix operation:
Get y0 = x0, y1 = x0♥x1, . . . , yi = x0♥x1♥ . . .♥xi for associative ♥.
0:7
0 1 2 3 4 5 6 7
0:1 2:3 4:5 6:7
0:3 4:7
0:5
0:60:40:2
M. Tuma 95
8. Basic parallel operations: XIV.
Parallel prefix operation (continued)
Application to zi+1 = aizi + bi
Get pi = a0 . . . ai using the parallel prefix operation Compute βi = bi/pi in parallel. Compute si = β0 + . . . + βi−1 using parallel prefix operation. Compute zi = sipi−1 in parallel.
M. Tuma 96
8. Basic parallel operations: XV.
Conclusion for basic parallel operations
Still far from contemporary scientific computing There are large dense matrices in practical problems, but: A lot can be performed by ready-made scientific software like
SCALAPACK Problems are:
Sparse: O(n) sequential steps may be too many But: contemporary sparse matrix software strongly relies on using
dense blocks connected by a general sparse structure Very often unstructured: operations with general graphs and
specialized combinatorial routines should be efficiently implementedwhouch would be generally efficient on a wide spectrum of computerarchitectures.
Not homogenous in the sense that completely different parallelizationtechniques should be used in implementations.
M. Tuma 97
9. Parallel solvers of linear algebraic equations
Basic classification of (sequential) solvers
Ax = b
Our case of interest
A is large
A is, fortunately, most often, sparse
Different classes of methods for solving the system with variousadvantages and disadvantages.
Gaussian elimination→ direct methods CG method→ Krylov space iterative methods (+) multilevel information transfer
M. Tuma 98
9. Parallel solvers of linear algebraic equations: II.
Hunt for extreme parallelism: Algorithm by Csanky
M. Tuma 98
9. Parallel solvers of linear algebraic equations: II.
9. Parallel solvers of linear algebraic equations: II.
Hunt for extreme parallelism: Algorithm by Csanky
Compute powers of A: A2, A3, . . . An−1 (O(log2 n) complexity) Compute traces sk = tr(Ak) of the powers (O(log n) complexity)
M. Tuma 98
9. Parallel solvers of linear algebraic equations: II.
Hunt for extreme parallelism: Algorithm by Csanky
Compute powers of A: A2, A3, . . . An−1 (O(log2 n) complexity) Compute traces sk = tr(Ak) of the powers (O(log n) complexity) Solve Newton identities for the coefficients ai of the characteristic
9. Parallel solvers of linear algebraic equations: II.
Hunt for extreme parallelism: Algorithm by Csanky
Compute powers of A: A2, A3, . . . An−1 (O(log2 n) complexity) Compute traces sk = tr(Ak) of the powers (O(log n) complexity) Solve Newton identities for the coefficients ai of the characteristic
9. Parallel solvers of linear algebraic equations: II.
Hunt for extreme parallelism: Algorithm by Csanky
Compute powers of A: A2, A3, . . . An−1 (O(log2 n) complexity) Compute traces sk = tr(Ak) of the powers (O(log n) complexity) Solve Newton identities for the coefficients ai of the characteristic
9. Parallel solvers of linear algebraic equations: II.
Hunt for extreme parallelism: Algorithm by Csanky
Compute powers of A: A2, A3, . . . An−1 (O(log2 n) complexity) Compute traces sk = tr(Ak) of the powers (O(log n) complexity) Solve Newton identities for the coefficients ai of the characteristic
Compute the inverse using Cayley-Hamilton theorem (O(log2 n))
M. Tuma 98
9. Parallel solvers of linear algebraic equations: II.
Hunt for extreme parallelism: Algorithm by Csanky
Compute powers of A: A2, A3, . . . An−1 (O(log2 n) complexity) Compute traces sk = tr(Ak) of the powers (O(log n) complexity) Solve Newton identities for the coefficients ai of the characteristic
Compute the inverse using Cayley-Hamilton theorem (O(log2 n)) Horribly unstable.
M. Tuma 99
9. Parallel solvers of linear algebraic equations: III.
Typical key operations in the skeleton of Krylov subspacemethods
1. Matrix-vector multiplication (one right-hand side) with a sparse matrix.
2. Matrix-matrix multiplications (more right-hand sides), first matrix issparse.
3. Sparse matrix-matrix multiplications.
4. Preconditioning operation (we will explaain preconditioning later).
5. Orthogonalization in some algorithms (GMRES).
6. Some standard dense stuff (saxpys, dot products, norm computations.
7. Overlapping communication and computation. It sometimes changesnumerical properties of the implementation.
M. Tuma 100
9. Parallel solvers of linear algebraic equations: IV.
System matrix goes to sparse: more possible data structures
M. Tuma 100
9. Parallel solvers of linear algebraic equations: IV.
System matrix goes to sparse: more possible data structures* ** *
**
**
**
*
**
**
**
*
**
**
*
**
**
**
**
*
**
* ***
*
Band 6
* ** *
**
**
**
*
**
**
**
*
**
**
*
**
**
**
**
*
**
* ***
*
Profile 6
* ** *
**
**
**
*
**
**
**
*
**
**
*
**
**
**
**
*
**
* ***
*
Frontal method - dynamic band
Movingwindow -
M. Tuma 101
9. Parallel solvers of linear algebraic equations: V.
General sparsity structure can be reasonably treated.
M. Tuma 101
9. Parallel solvers of linear algebraic equations: V.
General sparsity structure can be reasonably treated.
Banded and envelope paradigms often lead to slower algorithms e.g.,when matrices has to be decomposed.
M. Tuma 101
9. Parallel solvers of linear algebraic equations: V.
General sparsity structure can be reasonably treated.
Banded and envelope paradigms often lead to slower algorithms e.g.,when matrices has to be decomposed.
Machines often support gather-scatter useful with indirect addressingconnected to sparse matrices.
M. Tuma 101
9. Parallel solvers of linear algebraic equations: V.
General sparsity structure can be reasonably treated.
Banded and envelope paradigms often lead to slower algorithms e.g.,when matrices has to be decomposed.
Machines often support gather-scatter useful with indirect addressingconnected to sparse matrices.
Generally sparse data structures typically preferred.
M. Tuma 101
9. Parallel solvers of linear algebraic equations: V.
General sparsity structure can be reasonably treated.
Banded and envelope paradigms often lead to slower algorithms e.g.,when matrices has to be decomposed.
Machines often support gather-scatter useful with indirect addressingconnected to sparse matrices.
Generally sparse data structures typically preferred.
∗ ∗ ∗ ∗∗ ∗ ∗ ∗
∗ ∗ ∗∗ ∗ ∗ ∗
∗ ∗ ∗ ∗∗ ∗ ∗ ∗
∗ ∗ ∗∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗∗ ∗ f ∗ ∗
∗ ∗ ∗∗ ∗ ∗ ∗
∗ f ∗ ∗ f ∗∗ ∗ f ∗ f ∗
∗ f ∗ ∗∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
M. Tuma 102
9. Parallel solvers of linear algebraic equations: VI.
General sparsity structure can be reasonably treated.
Scheduling for parallel computation is not straightforward.
(Some) issues useful for linear algebraic solvers:
M. Tuma 102
9. Parallel solvers of linear algebraic equations: VI.
General sparsity structure can be reasonably treated.
Scheduling for parallel computation is not straightforward.
(Some) issues useful for linear algebraic solvers:
1. Sparse fill-in minimizing reorderings
M. Tuma 102
9. Parallel solvers of linear algebraic equations: VI.
General sparsity structure can be reasonably treated.
Scheduling for parallel computation is not straightforward.
(Some) issues useful for linear algebraic solvers:
1. Sparse fill-in minimizing reorderings
2. Graph partitioning
M. Tuma 102
9. Parallel solvers of linear algebraic equations: VI.
General sparsity structure can be reasonably treated.
Scheduling for parallel computation is not straightforward.
(Some) issues useful for linear algebraic solvers:
1. Sparse fill-in minimizing reorderings
2. Graph partitioning
3. Reordering matrix for matvecs for 1-D / 2-D partitioning
M. Tuma 102
9. Parallel solvers of linear algebraic equations: VI.
General sparsity structure can be reasonably treated.
Scheduling for parallel computation is not straightforward.
(Some) issues useful for linear algebraic solvers:
1. Sparse fill-in minimizing reorderings
2. Graph partitioning
3. Reordering matrix for matvecs for 1-D / 2-D partitioning
4. Sparse matrix-matrix multiplication
M. Tuma 102
9. Parallel solvers of linear algebraic equations: VI.
General sparsity structure can be reasonably treated.
Scheduling for parallel computation is not straightforward.
(Some) issues useful for linear algebraic solvers:
1. Sparse fill-in minimizing reorderings
2. Graph partitioning
3. Reordering matrix for matvecs for 1-D / 2-D partitioning
4. Sparse matrix-matrix multiplication
5. Some ideas from preconditioning
M. Tuma 103
9. Parallel solvers of linear algebraic equations: VII.
Sparse fill-in minimizing reorderings.
M. Tuma 103
9. Parallel solvers of linear algebraic equations: VII.
Sparse fill-in minimizing reorderings.
static differs them from dynamic reordering strategies (pivoting)
two basic types
M. Tuma 103
9. Parallel solvers of linear algebraic equations: VII.
Sparse fill-in minimizing reorderings.
static differs them from dynamic reordering strategies (pivoting)
two basic types
local reorderings: based on local greedy criterion
global reorderings: taking into account the whole graph / matrix
M. Tuma 104
9. Parallel solvers of linear algebraic equations: VIII.
Local fill-in minimizing reorderings: MD: the basic algorithm.
G = G(A)
for i = 1 to n do
find v such that degG(v) = minv∈V degG(v)
G = Gv
end i
The order of found vertices induces their new renumbering
deg(v) = |Adj(v)|; graph G as a superscript determines the currentgraph
M. Tuma 105
9. Parallel solvers of linear algebraic equations: IX.
MD: the basic algorithm: example.
v v
G G_v
M. Tuma 106
9. Parallel solvers of linear algebraic equations: X.
global reorderings: ND algorithm (George, 1973)
Find separator
Reorder the matrix numbering nodes in the separator last
Do it recursively
Vertex separator
C_1 C_2
S
M. Tuma 107
9. Parallel solvers of linear algebraic equations: XI.
ND algorithm after one level of recursion
C_1
C_2
S
SC_2C_1
M. Tuma 108
9. Parallel solvers of linear algebraic equations: XII.
ND algorithm after more levels of recursion
1 7 4 43 22 28 25
3 8 6 44 24 29 27
2 9 5 45 23 30 36
19 20 21 46 40 41 42
10 16 13 47 31 37 34
1712 15 48 33 38 36
11 18 14 49 32 39 35
M. Tuma 109
9. Parallel solvers of linear algebraic equations: XIII.
static reorderings: summary
M. Tuma 109
9. Parallel solvers of linear algebraic equations: XIII.
static reorderings: summary
the most useful strategy: combining local and global reorderings
M. Tuma 109
9. Parallel solvers of linear algebraic equations: XIII.
static reorderings: summary
the most useful strategy: combining local and global reorderings
modern nested dissections are based on graph partitioners: partition agraph such that components have very similar sizes
separator is small
can be correctly formulated and solved for a general graph
theoretical estimates for fill-in and number of operations
M. Tuma 109
9. Parallel solvers of linear algebraic equations: XIII.
static reorderings: summary
the most useful strategy: combining local and global reorderings
modern nested dissections are based on graph partitioners: partition agraph such that components have very similar sizes
separator is small
can be correctly formulated and solved for a general graph
theoretical estimates for fill-in and number of operations
modern local reorderings: used after a few steps of an incompletenested dissection
M. Tuma 110
9. Parallel solvers of linear algebraic equations: XIV
Graph partitioning
The goal: separate a given graph into pieces of similar sizes havingsmall separators.
M. Tuma 110
9. Parallel solvers of linear algebraic equations: XIV
Graph partitioning
The goal: separate a given graph into pieces of similar sizes havingsmall separators.
TH: Let G = (V, E) be a planar graph. Then we can find a vertexseparator S = (VS , ES) which divides V into two disjoint sets V1 and V2
such that max(|V1|, |V2|) ≤ 2/3|V | and VS | ≤ 2 ∗√
3|V |.
M. Tuma 110
9. Parallel solvers of linear algebraic equations: XIV
Graph partitioning
The goal: separate a given graph into pieces of similar sizes havingsmall separators.
TH: Let G = (V, E) be a planar graph. Then we can find a vertexseparator S = (VS , ES) which divides V into two disjoint sets V1 and V2
such that max(|V1|, |V2|) ≤ 2/3|V | and VS | ≤ 2 ∗√
3|V |.
Many different strategies for general cases
M. Tuma 110
9. Parallel solvers of linear algebraic equations: XIV
Graph partitioning
The goal: separate a given graph into pieces of similar sizes havingsmall separators.
TH: Let G = (V, E) be a planar graph. Then we can find a vertexseparator S = (VS , ES) which divides V into two disjoint sets V1 and V2
such that max(|V1|, |V2|) ≤ 2/3|V | and VS | ≤ 2 ∗√
3|V |.
Many different strategies for general cases
Recursive bisecections or k-sections
M. Tuma 110
9. Parallel solvers of linear algebraic equations: XIV
Graph partitioning
The goal: separate a given graph into pieces of similar sizes havingsmall separators.
TH: Let G = (V, E) be a planar graph. Then we can find a vertexseparator S = (VS , ES) which divides V into two disjoint sets V1 and V2
such that max(|V1|, |V2|) ≤ 2/3|V | and VS | ≤ 2 ∗√
3|V |.
Many different strategies for general cases
Recursive bisecections or k-sections
Sometimes for weighted graphs
M. Tuma 111
9. Parallel solvers of linear algebraic equations: XV
Graph partitioning: classification of a few basic approaches
M. Tuma 111
9. Parallel solvers of linear algebraic equations: XV
Graph partitioning: classification of a few basic approaches
1. Kerninghan-Lin algorithm
M. Tuma 111
9. Parallel solvers of linear algebraic equations: XV
Graph partitioning: classification of a few basic approaches
1. Kerninghan-Lin algorithm
2. Level-structure partitioning
M. Tuma 111
9. Parallel solvers of linear algebraic equations: XV
Graph partitioning: classification of a few basic approaches
1. Kerninghan-Lin algorithm
2. Level-structure partitioning
3. Inertial partitioning
M. Tuma 111
9. Parallel solvers of linear algebraic equations: XV
Graph partitioning: classification of a few basic approaches
1. Kerninghan-Lin algorithm
2. Level-structure partitioning
3. Inertial partitioning
4. Spectral partitioning
M. Tuma 111
9. Parallel solvers of linear algebraic equations: XV
Graph partitioning: classification of a few basic approaches
1. Kerninghan-Lin algorithm
2. Level-structure partitioning
3. Inertial partitioning
4. Spectral partitioning
5. Multilevel partitioning
M. Tuma 112
9. Parallel solvers of linear algebraic equations: XVI.
Graph partitioning: Kerninghan-Lin (1970)
M. Tuma 112
9. Parallel solvers of linear algebraic equations: XVI.
Graph partitioning: Kerninghan-Lin (1970)
Partitioning by local searches. Often used for improving partitions provided by other algorithms. More efficient implementation by Fiduccia and Mattheyses, 1982.
M. Tuma 112
9. Parallel solvers of linear algebraic equations: XVI.
Graph partitioning: Kerninghan-Lin (1970)
Partitioning by local searches. Often used for improving partitions provided by other algorithms. More efficient implementation by Fiduccia and Mattheyses, 1982.
The intention
Start with a graph G = (V, E) with edge weights w : E → IR+ andpartitioning V = VA ∪ VB .
M. Tuma 112
9. Parallel solvers of linear algebraic equations: XVI.
Graph partitioning: Kerninghan-Lin (1970)
Partitioning by local searches. Often used for improving partitions provided by other algorithms. More efficient implementation by Fiduccia and Mattheyses, 1982.
The intention
Start with a graph G = (V, E) with edge weights w : E → IR+ andpartitioning V = VA ∪ VB .
Find X ⊂ VA and Y ⊂ VB such that the new partitionV = (VA ∪ Y \X)∪ (B ∪X \ Y ) reduces total cost of edges between VA
and VB given by
COST =∑
a∈VA,b∈VB
w(a, b).
M. Tuma 113
9. Parallel solvers of linear algebraic equations: XVII.
Graph partitioning: Kerninghan-Lin: II.
Monitoring gains in COST when exchanging a pair of vertices gain(a, b) for a ∈ VA and b ∈ VB is given by
E(a)− I(a) + E(b)− I(b)− 2w(a, b),
where E(x), I(x) denotes external or internal cost of x ∈ V , respectively
M. Tuma 113
9. Parallel solvers of linear algebraic equations: XVII.
Graph partitioning: Kerninghan-Lin: II.
Monitoring gains in COST when exchanging a pair of vertices gain(a, b) for a ∈ VA and b ∈ VB is given by
E(a)− I(a) + E(b)− I(b)− 2w(a, b),
where E(x), I(x) denotes external or internal cost of x ∈ V , respectively
V_A V_B
a b
M. Tuma 113
9. Parallel solvers of linear algebraic equations: XVII.
Graph partitioning: Kerninghan-Lin: II.
Monitoring gains in COST when exchanging a pair of vertices gain(a, b) for a ∈ VA and b ∈ VB is given by
E(a)− I(a) + E(b)− I(b)− 2w(a, b),
where E(x), I(x) denotes external or internal cost of x ∈ V , respectively
V_A V_B
a b
I(a)
M. Tuma 113
9. Parallel solvers of linear algebraic equations: XVII.
Graph partitioning: Kerninghan-Lin: II.
Monitoring gains in COST when exchanging a pair of vertices gain(a, b) for a ∈ VA and b ∈ VB is given by
E(a)− I(a) + E(b)− I(b)− 2w(a, b),
where E(x), I(x) denotes external or internal cost of x ∈ V , respectively
V_A V_B
a b
E(a)
M. Tuma 114
9. Parallel solvers of linear algebraic equations: XVIII.
Algorithm 2 Kernighan-Lincompute COST of the initial partitionuntil GAIN ≤ 0
for all nodes x compute E(x) + I(x)unmark all nodeswhile there are unmarked nodes do
find a suitable pair a, b of vertices from different partitionsmaximizing gain(a, b)mark a, b
end whilefind GAIN maximizing partial ssums of gains computed in the loopif GAIN > 0 then update the partition
end until
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
M. Tuma 116
9. Parallel solvers of linear algebraic equations: XVI.
Graph partitioning: Inertial algorithm
deals with graphs and their coordinates divides the set of graph nodes by a line (2D) or a plane (3D)
The strategy in 2D
Choose a line a(x− x0) + b(y − y0) = 0, a2 + b2 = 1. It has the slope −a/b, goes through (x0, y0). Compute distances ci of the projections of the nodes (xi, yi) from the
nodes. Compute distances di = a(yi − y0)− b(xi − x0) of the projections of the
nodes (xi, yi) from (x0, y0)
Find median d of these distances. Divide the nodes according to this median into two groups. How to choose the line?
M. Tuma 117
9. Parallel solvers of linear algebraic equations: XVII.
Graph partitioning: Inertial algorithm: II.
M. Tuma 117
9. Parallel solvers of linear algebraic equations: XVII.
Graph partitioning: Inertial algorithm: II.
M. Tuma 117
9. Parallel solvers of linear algebraic equations: XVII.
Graph partitioning: Inertial algorithm: II.
M. Tuma 118
9. Parallel solvers of linear algebraic equations: XVIII.
Graph partitioning: Inertial algorithm: III.
Some more explanation for the 2D
Finding a line such that the sum of squares of the projections to it isminimized.
This is a total least squares problem. Considering the nodes as mass units, the line taken as the axis should
minimize the moment of inertia among all possible lines. Mathematically:
n∑
i=1
c2i = (1)
=n∑
i=1
((xi − x0)2 + (yi − y0)
2 − (a(yi − y0)− b(xi − x0))2) = (a, b)M
(a
b
)
(2)
That is, a small eigenvalue problem
M. Tuma 119
9. Parallel solvers of linear algebraic equations: XIX.
Spectral partitioning
DF: The Laplacian matrix of an undirected unweighted graph G = (V, E)is given by
L(G) = AT A
where A is its incidence (edge by vertex) matrix. Namely,
L(G)ij =
degree of node i for i = j
−1 for (i, j) ∈ E, i 6= j
0 otherwise
Then
xT Lx = xT AT Ax =∑
i,j∈EG
(xi − xj)2. (3)
L positive semidefinite
M. Tuma 120
9. Parallel solvers of linear algebraic equations: XX.
Spectral partitioning: examples of Laplacians
1 2
3 4
5
−1 −1
1 −1
−1 1 −1
1 1 −1
1 1
2 −1 −1
−1 2 −1
−1 3 −1 −1
−1 −1 3 −1
−1 −1 2
M. Tuma 121
9. Parallel solvers of linear algebraic equations: XX.
Spectral partitioning
Laplacian corresponding to the graph of a connected mesh haseigenvalue 0.
The eigenvector corresponding to this eigenvalue is (1, . . . , 1)T /√
n. Denote by µ, the second smallest eigenvalue of L(G). Then from the
Let V is partitioned into V + and V −. Let v be a vector where v(x) = 1for x ∈ V + and v(x) = −1 otherwise. TH: Then number of edgesconnecting V + and V − is 1/4xT L(G)x.
xT L(G)x =∑
(i,j)∈E
(xi − xj)2 =
∑
(i,j)∈E,i∈V +,j∈V −
(xi − xj)2 =
= 4 ∗ number of edges between V + and V − (5)
M. Tuma 122
9. Parallel solvers of linear algebraic equations: XXI.
Spectral partitioning
Find the second eigenvector of the Laplacian Dissect by its values This is an approximation to the discrete optimization problem
Multilevel partitioning: acceleration of basic procedures
Matrix, of course, sparse Matrix should be distributed, based, e.g., on distributed read of row
lengths. first step: what are my rows? do i=1,n determine start of the row i find start of the row i find end of the row i compute length of the row end do
M. Tuma 132
9. Parallel solvers of linear algebraic equations: XXII.
Matrix, of course, sparse Matrix should be distributed, based, e.g., on distributed read of row
lengths. first step: what are my rows? do i=1,n determine start of the row i find start of the row i find end of the row i compute length of the row end do parallel gather / parallel sort / parallel merge
M. Tuma 132
9. Parallel solvers of linear algebraic equations: XXII.
Matrix, of course, sparse Matrix should be distributed, based, e.g., on distributed read of row
lengths. first step: what are my rows? do i=1,n determine start of the row i find start of the row i find end of the row i compute length of the row end do parallel gather / parallel sort / parallel merge finally, processes know what are their rows
M. Tuma 132
9. Parallel solvers of linear algebraic equations: XXII.
Matrix, of course, sparse Matrix should be distributed, based, e.g., on distributed read of row
lengths. first step: what are my rows? do i=1,n determine start of the row i find start of the row i find end of the row i compute length of the row end do parallel gather / parallel sort / parallel merge finally, processes know what are their rows / / / at least, at the beginning: static load balancing
M. Tuma 132
9. Parallel solvers of linear algebraic equations: XXII.
Matrix, of course, sparse Matrix should be distributed, based, e.g., on distributed read of row
lengths. first step: what are my rows? do i=1,n determine start of the row i find start of the row i find end of the row i compute length of the row end do parallel gather / parallel sort / parallel merge finally, processes know what are their rows / / / at least, at the beginning: static load balancing for example: cyclic distribution matrix rows to groups of approximately
nnz(A)/p nonzeros.
M. Tuma 133
9. Parallel solvers of linear algebraic equations: XXII.
Natural assumption: matrix processed only as distributed. second step: distributed read in MPI: all procesors check for their rows concurrently do i=1,n
if this is my row then
determine start of the row i
find start of the row i
find end of the row i
read / process the row: if (myid.eq.xxx)then read
end if
end do
M. Tuma 134
9. Parallel solvers of linear algebraic equations: XXII.
Changing sparsity patterns in outer iterations: SPAI (Cosgrove, Díaz,Griewank, 1992; Grote, Huckle, 1997) Evaluating new pattern by estimating norms of possible new residuals More exact evaluations of residuals (Gould, Scott, 1995) Procedurally parallel, but data paralelism difficult Need to have high-quality pattern predictions (Huckle, 1999, 2001;
Chow, 2000)
M. Tuma 150
10. Approximate inverse preconditioners: VII.
Frobenius norm minimization: special cases II.
Changing sparsity patterns in outer iterations: SPAI (Cosgrove, Díaz,Griewank, 1992; Grote, Huckle, 1997) Evaluating new pattern by estimating norms of possible new residuals More exact evaluations of residuals (Gould, Scott, 1995) Procedurally parallel, but data paralelism difficult Need to have high-quality pattern predictions (Huckle, 1999, 2001;
Chow, 2000)
Simple stationary iterative method for individual columns ci by solving
Aci = ei
(Chow, Saad, 1994)
M. Tuma 150
10. Approximate inverse preconditioners: VII.
Frobenius norm minimization: special cases II.
Changing sparsity patterns in outer iterations: SPAI (Cosgrove, Díaz,Griewank, 1992; Grote, Huckle, 1997) Evaluating new pattern by estimating norms of possible new residuals More exact evaluations of residuals (Gould, Scott, 1995) Procedurally parallel, but data paralelism difficult Need to have high-quality pattern predictions (Huckle, 1999, 2001;
Chow, 2000)
Simple stationary iterative method for individual columns ci by solving
Aci = ei
(Chow, Saad, 1994) Simple, but not very efficient “Gauss-Seidel” variant: sometimes much better, sometimes much
worse
M. Tuma 151
10. Approximate inverse preconditioners: VIII.
Frobenius norm minimization: special cases III.
M. Tuma 151
10. Approximate inverse preconditioners: VIII.
Frobenius norm minimization: special cases III.
Factorized inverse preconditioners based on Frobenius normapproximate minimization for SPD matrices (Kolotilina, Yeremin, 1973)
Z = arg minX∈S
FI(XT , L) = arg min
X∈S‖I −XT L‖2F , where A = LLT .
M. Tuma 151
10. Approximate inverse preconditioners: VIII.
Frobenius norm minimization: special cases III.
Factorized inverse preconditioners based on Frobenius normapproximate minimization for SPD matrices (Kolotilina, Yeremin, 1973)
Z = arg minX∈S
FI(XT , L) = arg min
X∈S‖I −XT L‖2F , where A = LLT .
The procedure: first get Z from problem
‖I −XT L‖2F =n∑
i=1
‖eTi − xT
i L‖22, Set D = (diag(Z))−1, Z = ZD1/2
Then A−1 ≈ ZZT
M. Tuma 151
10. Approximate inverse preconditioners: VIII.
Frobenius norm minimization: special cases III.
Factorized inverse preconditioners based on Frobenius normapproximate minimization for SPD matrices (Kolotilina, Yeremin, 1973)
Z = arg minX∈S
FI(XT , L) = arg min
X∈S‖I −XT L‖2F , where A = LLT .
The procedure: first get Z from problem
‖I −XT L‖2F =n∑
i=1
‖eTi − xT
i L‖22, Set D = (diag(Z))−1, Z = ZD1/2
Then A−1 ≈ ZZT
Extended to nonsymmetric case
M. Tuma 151
10. Approximate inverse preconditioners: VIII.
Frobenius norm minimization: special cases III.
Factorized inverse preconditioners based on Frobenius normapproximate minimization for SPD matrices (Kolotilina, Yeremin, 1973)
Z = arg minX∈S
FI(XT , L) = arg min
X∈S‖I −XT L‖2F , where A = LLT .
The procedure: first get Z from problem
‖I −XT L‖2F =n∑
i=1
‖eTi − xT
i L‖22, Set D = (diag(Z))−1, Z = ZD1/2
Then A−1 ≈ ZZT
Extended to nonsymmetric case
Rather robust, often underestimated
M. Tuma 152
10. Approximate inverse preconditioners: IX.
A-orthogonalization: AINV
M. Tuma 152
10. Approximate inverse preconditioners: IX.
A-orthogonalization: AINV
For an SPD matrix A: find an upper triangular Z and diagonal matrix Dsuch that
ZT AZ = D → A−1 = ZD−1ZT (7)
M. Tuma 152
10. Approximate inverse preconditioners: IX.
A-orthogonalization: AINV
For an SPD matrix A: find an upper triangular Z and diagonal matrix Dsuch that
ZT AZ = D → A−1 = ZD−1ZT (7)
The algorithm: Conjugate Gram-Schmidt: GS with a different innerproduct: (x, y)A
M. Tuma 152
10. Approximate inverse preconditioners: IX.
A-orthogonalization: AINV
For an SPD matrix A: find an upper triangular Z and diagonal matrix Dsuch that
ZT AZ = D → A−1 = ZD−1ZT (7)
The algorithm: Conjugate Gram-Schmidt: GS with a different innerproduct: (x, y)A
Origins of the A-orthogonalization for solving linear systems in morepapers in 40’s
M. Tuma 152
10. Approximate inverse preconditioners: IX.
A-orthogonalization: AINV
For an SPD matrix A: find an upper triangular Z and diagonal matrix Dsuch that
ZT AZ = D → A−1 = ZD−1ZT (7)
The algorithm: Conjugate Gram-Schmidt: GS with a different innerproduct: (x, y)A
Origins of the A-orthogonalization for solving linear systems in morepapers in 40’s
A more detailed treatment of theA-orthogonalization: in the first Wilkinson paper (with Fox and Huskey,1948)
M. Tuma 152
10. Approximate inverse preconditioners: IX.
A-orthogonalization: AINV
For an SPD matrix A: find an upper triangular Z and diagonal matrix Dsuch that
ZT AZ = D → A−1 = ZD−1ZT (7)
The algorithm: Conjugate Gram-Schmidt: GS with a different innerproduct: (x, y)A
Origins of the A-orthogonalization for solving linear systems in morepapers in 40’s
A more detailed treatment of theA-orthogonalization: in the first Wilkinson paper (with Fox and Huskey,1948)
Extended to nonsymmetric case
M. Tuma 152
10. Approximate inverse preconditioners: IX.
A-orthogonalization: AINV
For an SPD matrix A: find an upper triangular Z and diagonal matrix Dsuch that
ZT AZ = D → A−1 = ZD−1ZT (7)
The algorithm: Conjugate Gram-Schmidt: GS with a different innerproduct: (x, y)A
Origins of the A-orthogonalization for solving linear systems in morepapers in 40’s
A more detailed treatment of theA-orthogonalization: in the first Wilkinson paper (with Fox and Huskey,1948)
Extended to nonsymmetric case Breakdown-free modification for SPD A (Benzi, Cullum, T., 2001)
M. Tuma 153
10. Approximate inverse preconditioners: X.
A-orthogonalization: AINV
M. Tuma 153
10. Approximate inverse preconditioners: X.
A-orthogonalization: AINVAlgorithm H-S I.
zi = ei −i−1∑
k=1
zkeTi Azk
zTk Azk
, i = 1, . . . , n; Z = [z1, . . . , zn]
M. Tuma 153
10. Approximate inverse preconditioners: X.
A-orthogonalization: AINVAlgorithm H-S I.
zi = ei −i−1∑
k=1
zkeTi Azk
zTk Azk
, i = 1, . . . , n; Z = [z1, . . . , zn]
left-looking
stabilized diagonal entries (in exact arithmetic eTi Azk ≡ zT
i Azk, i ≤ k)
M. Tuma 154
10. Approximate inverse preconditioners: XI.
Possibility of breakdowns
M. Tuma 154
10. Approximate inverse preconditioners: XI.
Possibility of breakdowns AINV: modified incomplete A-orthogonalization; exists for M-matrices
and H-matrices (Benzi, Meyer, T., 1996)
M. Tuma 154
10. Approximate inverse preconditioners: XI.
Possibility of breakdowns AINV: modified incomplete A-orthogonalization; exists for M-matrices
and H-matrices (Benzi, Meyer, T., 1996)
Possibility of breakdown in A-orthogonalization for non-H matrices
M. Tuma 154
10. Approximate inverse preconditioners: XI.
Possibility of breakdowns AINV: modified incomplete A-orthogonalization; exists for M-matrices
and H-matrices (Benzi, Meyer, T., 1996)
Possibility of breakdown in A-orthogonalization for non-H matrices
Possibly poor approximate inverses for these matrices
M. Tuma 154
10. Approximate inverse preconditioners: XI.
Possibility of breakdowns AINV: modified incomplete A-orthogonalization; exists for M-matrices
and H-matrices (Benzi, Meyer, T., 1996)
Possibility of breakdown in A-orthogonalization for non-H matrices
Possibly poor approximate inverses for these matrices
A-orthogonalization in historical perspective
M. Tuma 154
10. Approximate inverse preconditioners: XI.
Possibility of breakdowns AINV: modified incomplete A-orthogonalization; exists for M-matrices
and H-matrices (Benzi, Meyer, T., 1996)
Possibility of breakdown in A-orthogonalization for non-H matrices
Possibly poor approximate inverses for these matrices
A-orthogonalization in historical perspective
Fox, Huskey, Wilkinson, 1948: H-S I.
M. Tuma 154
10. Approximate inverse preconditioners: XI.
Possibility of breakdowns AINV: modified incomplete A-orthogonalization; exists for M-matrices
and H-matrices (Benzi, Meyer, T., 1996)
Possibility of breakdown in A-orthogonalization for non-H matrices
Possibly poor approximate inverses for these matrices
A-orthogonalization in historical perspective
Fox, Huskey, Wilkinson, 1948: H-S I.
Escalator method by Morris, 1946: a variation of H-S I. (non-stabilizedcomputation of D)
M. Tuma 154
10. Approximate inverse preconditioners: XI.
Possibility of breakdowns AINV: modified incomplete A-orthogonalization; exists for M-matrices
and H-matrices (Benzi, Meyer, T., 1996)
Possibility of breakdown in A-orthogonalization for non-H matrices
Possibly poor approximate inverses for these matrices
A-orthogonalization in historical perspective
Fox, Huskey, Wilkinson, 1948: H-S I.
Escalator method by Morris, 1946: a variation of H-S I. (non-stabilizedcomputation of D)
Vector method by Purcell, 1952: basically H-S II.
M. Tuma 154
10. Approximate inverse preconditioners: XI.
Possibility of breakdowns AINV: modified incomplete A-orthogonalization; exists for M-matrices
and H-matrices (Benzi, Meyer, T., 1996)
Possibility of breakdown in A-orthogonalization for non-H matrices
Possibly poor approximate inverses for these matrices
A-orthogonalization in historical perspective
Fox, Huskey, Wilkinson, 1948: H-S I.
Escalator method by Morris, 1946: a variation of H-S I. (non-stabilizedcomputation of D)
Vector method by Purcell, 1952: basically H-S II.
Approximate inverse by bordering (Saad, 1996) is equivalent to H-S I.(Benzi, T. (2002))
M. Tuma 154
10. Approximate inverse preconditioners: XI.
Possibility of breakdowns AINV: modified incomplete A-orthogonalization; exists for M-matrices
and H-matrices (Benzi, Meyer, T., 1996)
Possibility of breakdown in A-orthogonalization for non-H matrices
Possibly poor approximate inverses for these matrices
A-orthogonalization in historical perspective
Fox, Huskey, Wilkinson, 1948: H-S I.
Escalator method by Morris, 1946: a variation of H-S I. (non-stabilizedcomputation of D)
Vector method by Purcell, 1952: basically H-S II.
Approximate inverse by bordering (Saad, 1996) is equivalent to H-S I.(Benzi, T. (2002))
Bridson, Tang, 1998 – (nonsymmetric) algorithms equivalent to H-S I.
M. Tuma 155
10. Approximate inverse preconditioners: XII.
Other possible stabilization attempts:
M. Tuma 155
10. Approximate inverse preconditioners: XII.
Other possible stabilization attempts:
Pivoting
M. Tuma 155
10. Approximate inverse preconditioners: XII.
Other possible stabilization attempts:
Pivoting
Look-ahead
M. Tuma 155
10. Approximate inverse preconditioners: XII.
Other possible stabilization attempts:
Pivoting
Look-ahead
DCR
M. Tuma 155
10. Approximate inverse preconditioners: XII.
Other possible stabilization attempts:
Pivoting
Look-ahead
DCR
Block algorithms
M. Tuma 156
11. Polynomial preconditioners: I.
The problem
M. Tuma 156
11. Polynomial preconditioners: I.
The problem
Find a preconditioner M such that M−1 is a polynomial in A of a givendegree k, that is
M−1 = Pk(A) =
k∑
j=0
αjAj .
First proposed by Cesari, 1937 (for Richardson iteration)
M. Tuma 156
11. Polynomial preconditioners: I.
The problem
Find a preconditioner M such that M−1 is a polynomial in A of a givendegree k, that is
M−1 = Pk(A) =
k∑
j=0
αjAj .
First proposed by Cesari, 1937 (for Richardson iteration) Naturally motivated since by the Cayley-Hamilton theorem we have
Qk(A) ≡k∑
j=0
βjAj = 0
for the characteristic polynomial of A, k ≤ n.
M. Tuma 156
11. Polynomial preconditioners: I.
The problem
Find a preconditioner M such that M−1 is a polynomial in A of a givendegree k, that is
M−1 = Pk(A) =
k∑
j=0
αjAj .
First proposed by Cesari, 1937 (for Richardson iteration) Naturally motivated since by the Cayley-Hamilton theorem we have
Qk(A) ≡k∑
j=0
βjAj = 0
for the characteristic polynomial of A, k ≤ n. Therefore, we have
A−1 = − 1
β0
k∑
j=1
βjAj−1
M. Tuma 157
11. Polynomial preconditioners: II.
Polynomial preconditioners and Krylov space methods
M. Tuma 157
11. Polynomial preconditioners: II.
Polynomial preconditioners and Krylov space methods
Remind: CG forms an approximation to the solution vector for k ≥ 1
xk+1 = x0 + Pk(A)r0
M. Tuma 157
11. Polynomial preconditioners: II.
Polynomial preconditioners and Krylov space methods
Remind: CG forms an approximation to the solution vector for k ≥ 1
xk+1 = x0 + Pk(A)r0
The polynomial Pk(A) is optimal in minimizing
(x− x∗)T A(x− x∗)
among all polynomials of degree at most k with Pk(0) = 1.
M. Tuma 157
11. Polynomial preconditioners: II.
Polynomial preconditioners and Krylov space methods
Remind: CG forms an approximation to the solution vector for k ≥ 1
xk+1 = x0 + Pk(A)r0
The polynomial Pk(A) is optimal in minimizing
(x− x∗)T A(x− x∗)
among all polynomials of degree at most k with Pk(0) = 1. Therefore, why polynomial preconditioners?
M. Tuma 157
11. Polynomial preconditioners: II.
Polynomial preconditioners and Krylov space methods
Remind: CG forms an approximation to the solution vector for k ≥ 1
xk+1 = x0 + Pk(A)r0
The polynomial Pk(A) is optimal in minimizing
(x− x∗)T A(x− x∗)
among all polynomials of degree at most k with Pk(0) = 1. Therefore, why polynomial preconditioners?
Number of CG iterations still can be decreased
M. Tuma 157
11. Polynomial preconditioners: II.
Polynomial preconditioners and Krylov space methods
Remind: CG forms an approximation to the solution vector for k ≥ 1
xk+1 = x0 + Pk(A)r0
The polynomial Pk(A) is optimal in minimizing
(x− x∗)T A(x− x∗)
among all polynomials of degree at most k with Pk(0) = 1. Therefore, why polynomial preconditioners?
Number of CG iterations still can be decreased Can be useful when bottleneck is in scalar products, message
passing, memory hierarchy. Can strongly enhance vector processing.
M. Tuma 157
11. Polynomial preconditioners: II.
Polynomial preconditioners and Krylov space methods
Remind: CG forms an approximation to the solution vector for k ≥ 1
xk+1 = x0 + Pk(A)r0
The polynomial Pk(A) is optimal in minimizing
(x− x∗)T A(x− x∗)
among all polynomials of degree at most k with Pk(0) = 1. Therefore, why polynomial preconditioners?
Number of CG iterations still can be decreased Can be useful when bottleneck is in scalar products, message
Neumann series preconditioners for SPD systems (Dubois,Greenbaum, Rodrigue, 1979)
M. Tuma 158
11. Polynomial preconditioners: III.
Basic classes of polynomial preconditioners:I.
Neumann series preconditioners for SPD systems (Dubois,Greenbaum, Rodrigue, 1979) Let A = M1 −N such that M is nonsingular and G = M−1
1 N satisfiesρ(G) < 1. Then
A−1 = (I −G)−1M−11 =
+∞∑
j=1
Gj
M−1
1
M. Tuma 158
11. Polynomial preconditioners: III.
Basic classes of polynomial preconditioners:I.
Neumann series preconditioners for SPD systems (Dubois,Greenbaum, Rodrigue, 1979) Let A = M1 −N such that M is nonsingular and G = M−1
1 N satisfiesρ(G) < 1. Then
A−1 = (I −G)−1M−11 =
+∞∑
j=1
Gj
M−1
1
The preconditioner: truncating the series
M−1 =
k∑
j=1
Gj
M−1
1 , k > 0
M. Tuma 158
11. Polynomial preconditioners: III.
Basic classes of polynomial preconditioners:I.
Neumann series preconditioners for SPD systems (Dubois,Greenbaum, Rodrigue, 1979) Let A = M1 −N such that M is nonsingular and G = M−1
1 N satisfiesρ(G) < 1. Then
A−1 = (I −G)−1M−11 =
+∞∑
j=1
Gj
M−1
1
The preconditioner: truncating the series
M−1 =
k∑
j=1
Gj
M−1
1 , k > 0
Preconditioners Pk of odd degrees sufficient (not less efficient thanPk+1.
M. Tuma 159
11. Polynomial preconditioners: IV.
Basic classes of polynomial preconditioners:II.
M. Tuma 159
11. Polynomial preconditioners: IV.
Basic classes of polynomial preconditioners:II.
Generalized Neumann series preconditioners for SPD systems(Johnson, Micchelli, Paul, 1983)
M. Tuma 159
11. Polynomial preconditioners: IV.
Basic classes of polynomial preconditioners:II.
Generalized Neumann series preconditioners for SPD systems(Johnson, Micchelli, Paul, 1983) Parametrizing the approximate inverse (I −G)−1
I + γ1G + γ2G2 + . . . γkGk
M. Tuma 159
11. Polynomial preconditioners: IV.
Basic classes of polynomial preconditioners:II.
Generalized Neumann series preconditioners for SPD systems(Johnson, Micchelli, Paul, 1983) Parametrizing the approximate inverse (I −G)−1
I + γ1G + γ2G2 + . . . γkGk
the added degrees of freedom may be used to optimize theapproximation to A−1
M. Tuma 159
11. Polynomial preconditioners: IV.
Basic classes of polynomial preconditioners:II.
Generalized Neumann series preconditioners for SPD systems(Johnson, Micchelli, Paul, 1983) Parametrizing the approximate inverse (I −G)−1
I + γ1G + γ2G2 + . . . γkGk
the added degrees of freedom may be used to optimize theapproximation to A−1
Let
Rk = Rk|Rk(0) = 0; Rk(λ) > 0 ∀ λ from an inclusion set IS
M. Tuma 159
11. Polynomial preconditioners: IV.
Basic classes of polynomial preconditioners:II.
Generalized Neumann series preconditioners for SPD systems(Johnson, Micchelli, Paul, 1983) Parametrizing the approximate inverse (I −G)−1
I + γ1G + γ2G2 + . . . γkGk
the added degrees of freedom may be used to optimize theapproximation to A−1
Let
Rk = Rk|Rk(0) = 0; Rk(λ) > 0 ∀ λ from an inclusion set IS
Find Rk ∈ Qk such that it has min-max value on this inclusion setminmax polynomial.
M. Tuma 159
11. Polynomial preconditioners: IV.
Basic classes of polynomial preconditioners:II.
Generalized Neumann series preconditioners for SPD systems(Johnson, Micchelli, Paul, 1983) Parametrizing the approximate inverse (I −G)−1
I + γ1G + γ2G2 + . . . γkGk
the added degrees of freedom may be used to optimize theapproximation to A−1
Let
Rk = Rk|Rk(0) = 0; Rk(λ) > 0 ∀ λ from an inclusion set IS
Find Rk ∈ Qk such that it has min-max value on this inclusion setminmax polynomial.
Apply to residual polynomials 1−Qk(λ) = 1− λPk(λ) for thepolynomial preconditioner Pk.
M. Tuma 159
11. Polynomial preconditioners: IV.
Basic classes of polynomial preconditioners:II.
Generalized Neumann series preconditioners for SPD systems(Johnson, Micchelli, Paul, 1983) Parametrizing the approximate inverse (I −G)−1
I + γ1G + γ2G2 + . . . γkGk
the added degrees of freedom may be used to optimize theapproximation to A−1
Let
Rk = Rk|Rk(0) = 0; Rk(λ) > 0 ∀ λ from an inclusion set IS
Find Rk ∈ Qk such that it has min-max value on this inclusion setminmax polynomial.
Apply to residual polynomials 1−Qk(λ) = 1− λPk(λ) for thepolynomial preconditioner Pk.
This polynomial can be expressed in terms of Chebyshevpolynomials of the first kind
M. Tuma 160
11. Polynomial preconditioners: V.
Basic classes of polynomial preconditioners:III.
M. Tuma 160
11. Polynomial preconditioners: V.
Basic classes of polynomial preconditioners:III.
Least-squares preconditioners for SPD systems (Johnson, Micchelli,Paul, 1983)
M. Tuma 160
11. Polynomial preconditioners: V.
Basic classes of polynomial preconditioners:III.
Least-squares preconditioners for SPD systems (Johnson, Micchelli,Paul, 1983) Min-max polynomial may map small eigenvalues of A to large
eigenvalues of M−1A. This seems to degrade the convergence rate.Its quality seems to depend strongly on inclusion set estimate.
M. Tuma 160
11. Polynomial preconditioners: V.
Basic classes of polynomial preconditioners:III.
Least-squares preconditioners for SPD systems (Johnson, Micchelli,Paul, 1983) Min-max polynomial may map small eigenvalues of A to large
eigenvalues of M−1A. This seems to degrade the convergence rate.Its quality seems to depend strongly on inclusion set estimate.
This approach: minimize a quadratic norm of the residual polynomial:∫
IS
(1−Q(λ))2w(λ)dλ
M. Tuma 160
11. Polynomial preconditioners: V.
Basic classes of polynomial preconditioners:III.
Least-squares preconditioners for SPD systems (Johnson, Micchelli,Paul, 1983) Min-max polynomial may map small eigenvalues of A to large
eigenvalues of M−1A. This seems to degrade the convergence rate.Its quality seems to depend strongly on inclusion set estimate.
This approach: minimize a quadratic norm of the residual polynomial:∫
IS
(1−Q(λ))2w(λ)dλ
Jacobi weights (w(λ) = (b− λ)α(λ− a)α, α, β > −1 for IS = 〈a, b〉 orLegendre weights (w ≡ 1) for simple integration
M. Tuma 160
11. Polynomial preconditioners: V.
Basic classes of polynomial preconditioners:III.
Least-squares preconditioners for SPD systems (Johnson, Micchelli,Paul, 1983) Min-max polynomial may map small eigenvalues of A to large
eigenvalues of M−1A. This seems to degrade the convergence rate.Its quality seems to depend strongly on inclusion set estimate.
This approach: minimize a quadratic norm of the residual polynomial:∫
IS
(1−Q(λ))2w(λ)dλ
Jacobi weights (w(λ) = (b− λ)α(λ− a)α, α, β > −1 for IS = 〈a, b〉 orLegendre weights (w ≡ 1) for simple integration
Computing the polynomials from three-term recurrences (Stiefel,1958), or by kernel polynomials (Stiefel, 1958), or from normalequations (Saad, 1983)
M. Tuma 161
11. Polynomial preconditioners: VI.
Preconditioning symmetric indefinite systems
M. Tuma 161
11. Polynomial preconditioners: VI.
Preconditioning symmetric indefinite systems
DeBoor and Rice polynomials solve the minmax problem for generalinclusion sets composed from two parts: IS = 〈a, b〉 ∪ 〈c, d〉, b < 0 < c(DeBoor, Rice, 1982)
M. Tuma 161
11. Polynomial preconditioners: VI.
Preconditioning symmetric indefinite systems
DeBoor and Rice polynomials solve the minmax problem for generalinclusion sets composed from two parts: IS = 〈a, b〉 ∪ 〈c, d〉, b < 0 < c(DeBoor, Rice, 1982)
For equal lengths (b− a = d− c): can be expressed in terms ofChebyshev polynomials of the first kind. Best behavior in this case.
M. Tuma 161
11. Polynomial preconditioners: VI.
Preconditioning symmetric indefinite systems
DeBoor and Rice polynomials solve the minmax problem for generalinclusion sets composed from two parts: IS = 〈a, b〉 ∪ 〈c, d〉, b < 0 < c(DeBoor, Rice, 1982)
For equal lengths (b− a = d− c): can be expressed in terms ofChebyshev polynomials of the first kind. Best behavior in this case.
Grcar polynomials solve a slightly modified minmax approximationproblem. Formulated for residual polynomials. But: more oscillatorybehavior.
M. Tuma 161
11. Polynomial preconditioners: VI.
Preconditioning symmetric indefinite systems
DeBoor and Rice polynomials solve the minmax problem for generalinclusion sets composed from two parts: IS = 〈a, b〉 ∪ 〈c, d〉, b < 0 < c(DeBoor, Rice, 1982)
For equal lengths (b− a = d− c): can be expressed in terms ofChebyshev polynomials of the first kind. Best behavior in this case.
Grcar polynomials solve a slightly modified minmax approximationproblem. Formulated for residual polynomials. But: more oscillatorybehavior.
Both mentioned possibilities: positive definite preconditioned matrix, notexplicitly computable
M. Tuma 161
11. Polynomial preconditioners: VI.
Preconditioning symmetric indefinite systems
DeBoor and Rice polynomials solve the minmax problem for generalinclusion sets composed from two parts: IS = 〈a, b〉 ∪ 〈c, d〉, b < 0 < c(DeBoor, Rice, 1982)
For equal lengths (b− a = d− c): can be expressed in terms ofChebyshev polynomials of the first kind. Best behavior in this case.
Grcar polynomials solve a slightly modified minmax approximationproblem. Formulated for residual polynomials. But: more oscillatorybehavior.
Both mentioned possibilities: positive definite preconditioned matrix, notexplicitly computable
Clustering eigenvalues around µ < 0 and 1 (Freund, 1991; bilevelpolynomial of Ashby, 1991) Best behavior for nonequal intervals andb ≈ −c.
M. Tuma 162
11. Polynomial preconditioners: VII.
Further achievements
M. Tuma 162
11. Polynomial preconditioners: VII.
Further achievements
Different weights for least-squares polynomials for solving symmetricindefinite systems (Saad, 1983).
M. Tuma 162
11. Polynomial preconditioners: VII.
Further achievements
Different weights for least-squares polynomials for solving symmetricindefinite systems (Saad, 1983).
Adapting polynomials based on the information from the CG method(Ashby, 1987, 1990; Ashby, Manteuffel, Saylor, 1989; see also Fischer,Freund, 1994; O’Leary, 1991).
M. Tuma 162
11. Polynomial preconditioners: VII.
Further achievements
Different weights for least-squares polynomials for solving symmetricindefinite systems (Saad, 1983).
Adapting polynomials based on the information from the CG method(Ashby, 1987, 1990; Ashby, Manteuffel, Saylor, 1989; see also Fischer,Freund, 1994; O’Leary, 1991).
Double use of minmax polynomial can bring some improvement (Perlot,1995)
M. Tuma 162
11. Polynomial preconditioners: VII.
Further achievements
Different weights for least-squares polynomials for solving symmetricindefinite systems (Saad, 1983).
Adapting polynomials based on the information from the CG method(Ashby, 1987, 1990; Ashby, Manteuffel, Saylor, 1989; see also Fischer,Freund, 1994; O’Leary, 1991).
Double use of minmax polynomial can bring some improvement (Perlot,1995)
Polynomial preconditioners for solving nonsymmetric systems arepossible, but, typically, not a method of choice (Manteuffel, 1977, 1978;Saad, 1986; Smolarski, Saylor, 1988).
M. Tuma 163
12. Element-by-element preconditioners: I.
Basic notation
M. Tuma 163
12. Element-by-element preconditioners: I.
Basic notation
Assume that A is given as
A =∑
e
Ae
M. Tuma 163
12. Element-by-element preconditioners: I.
Basic notation
Assume that A is given as
A =∑
e
Ae
Consider
Me = (DA)e + (Ae −De),
where (DA)e is a part of DA corresponding to Ae.
M. Tuma 163
12. Element-by-element preconditioners: I.
Basic notation
Assume that A is given as
A =∑
e
Ae
Consider
Me = (DA)e + (Ae −De),
where (DA)e is a part of DA corresponding to Ae. Set
M =
ne∏
e=1
Me
M. Tuma 163
12. Element-by-element preconditioners: I.
Basic notation
Assume that A is given as
A =∑
e
Ae
Consider
Me = (DA)e + (Ae −De),
where (DA)e is a part of DA corresponding to Ae. Set
M =
ne∏
e=1
Me
Introduced by Hughes, Levit, Winget, 1983 (and formulated forJacobi-scaled A)
M. Tuma 164
12. Element-by-element preconditioners: II.
Other possibilities
M. Tuma 164
12. Element-by-element preconditioners: II.
Other possibilities
Simple application to solving nonsymmetric systems Mz = y (as theproduct of easily invertible matrices)
M. Tuma 164
12. Element-by-element preconditioners: II.
Other possibilities
Simple application to solving nonsymmetric systems Mz = y (as theproduct of easily invertible matrices)
For solving SPD systems Me matrices can be decomposed as
Me = LeLTe
M. Tuma 164
12. Element-by-element preconditioners: II.
Other possibilities
Simple application to solving nonsymmetric systems Mz = y (as theproduct of easily invertible matrices)
For solving SPD systems Me matrices can be decomposed as
Me = LeLTe
Another approach (Gustafsson, Linskog, 1986)
M =
ne∑
e=1
Le,
where Le can be modified to be positive definite (individual Ae do notneed to be regular)
M. Tuma 164
12. Element-by-element preconditioners: II.
Other possibilities
Simple application to solving nonsymmetric systems Mz = y (as theproduct of easily invertible matrices)
For solving SPD systems Me matrices can be decomposed as
Me = LeLTe
Another approach (Gustafsson, Linskog, 1986)
M =
ne∑
e=1
Le,
where Le can be modified to be positive definite (individual Ae do notneed to be regular)
parallelization by domains can solve some hard problems
M. Tuma 174
13. Vector / Parallel preconditioners: X.
Parallel preconditioning: distributed parallelism
M. Tuma 174
13. Vector / Parallel preconditioners: X.
Parallel preconditioning: distributed parallelism
subdomain boundary
P0
P1
P2
P0 P1 P2
M. Tuma 174
13. Vector / Parallel preconditioners: X.
Parallel preconditioning: distributed parallelism
subdomain boundary
P0
P1
P2
P0 P1 P2 Matrix-vector product: overlapping communication and computation
1) Initialize sends and receives of boundary nodes 2) Perform local matvecs 3) Complete receives of boundary data 4) Finish the computation
M. Tuma 175
13. Vector / Parallel preconditioners: XI.
Multicolorings
Faster convergence; parallel; need to balance these effects Doi, 1991; Doi, Lichnewsky, 1991; Doi, Hoshi, 1992; Wang, Hwang,
1995 Nodes with the same colour: mutually as far as possible
M. Tuma 175
13. Vector / Parallel preconditioners: XI.
Multicolorings
Faster convergence; parallel; need to balance these effects Doi, 1991; Doi, Lichnewsky, 1991; Doi, Hoshi, 1992; Wang, Hwang,
1995 Nodes with the same colour: mutually as far as possible
M. Tuma 175
13. Vector / Parallel preconditioners: XI.
Multicolorings
Faster convergence; parallel; need to balance these effects Doi, 1991; Doi, Lichnewsky, 1991; Doi, Hoshi, 1992; Wang, Hwang,
1995 Nodes with the same colour: mutually as far as possible
M. Tuma 175
13. Vector / Parallel preconditioners: XI.
Multicolorings
Faster convergence; parallel; need to balance these effects Doi, 1991; Doi, Lichnewsky, 1991; Doi, Hoshi, 1992; Wang, Hwang,
1995 Nodes with the same colour: mutually as far as possible
M. Tuma 175
13. Vector / Parallel preconditioners: XI.
Multicolorings
Faster convergence; parallel; need to balance these effects Doi, 1991; Doi, Lichnewsky, 1991; Doi, Hoshi, 1992; Wang, Hwang,
1995 Nodes with the same colour: mutually as far as possible
M. Tuma 175
13. Vector / Parallel preconditioners: XI.
Multicolorings
Faster convergence; parallel; need to balance these effects Doi, 1991; Doi, Lichnewsky, 1991; Doi, Hoshi, 1992; Wang, Hwang,
1995 Nodes with the same colour: mutually as far as possible
M. Tuma 175
13. Vector / Parallel preconditioners: XI.
Multicolorings
Faster convergence; parallel; need to balance these effects Doi, 1991; Doi, Lichnewsky, 1991; Doi, Hoshi, 1992; Wang, Hwang,
1995 Nodes with the same colour: mutually as far as possible
M. Tuma 175
13. Vector / Parallel preconditioners: XI.
Multicolorings
Faster convergence; parallel; need to balance these effects Doi, 1991; Doi, Lichnewsky, 1991; Doi, Hoshi, 1992; Wang, Hwang,
1995 Nodes with the same colour: mutually as far as possible
M. Tuma 176
14. Solving nonlinear systems: I.
Newton-Krylov paradigm
F (x) = 0
⇓Sequences of linear systems of the form
J(xk)∆x = −F (xk), J(xk) ≈ F ′(xk)
solved until for some k, k = 1, 2, . . .
‖F (xk)‖ < tol
J(xk) may change at points influenced by nonlinearities
M. Tuma 177
14. Solving nonlinear systems: II.
Much easier if matrix approximations are readily available
But: matrices are often given only implicitly.
For example: linear solvers in Newton-Krylov framework (see, e.g., Knoll,Keyes, 2004)
J(xk)∆x = −F (xk), J(xk) ≈ F ′(xk)
Only matvecs F ′(xk)v for a given vector v are typically performed. Finite differences can be used to get such products:
F (xk + ǫv)− F (xk)
ǫ≈ F ′(xk)v
matrices are always present in more or less implicit form: a tradeoff:implicitness × fast execution appears in many algorithms
For strong algebraic preconditioners we need matrix approximations
M. Tuma 178
14. Solving nonlinear systems: III.
To summarize
Jacobian J often provided only implicitly
Parallel functional evaluations
Efficient preconditioning of the linearized system
Efficient evaluation of the products Jx knowing the structure of J
M. Tuma 179
14. Solving nonlinear systems: IV.
Efficient preconditioning of the linearized system
Can strongly simplify the problem to be parallelized: Approximate inverse Jacobians Jacobian of related discretizations (convection-diffusion preconditioned
by diffusion, Brown, Saad, 1980) Operator split Jacobians:
J = (αI + S + R)−1 ≈ (αI + R)−1(I + α−1S)−1
Jacobians formed from only “strong” entries Jacobians of low-order discretizations Jacobians with freezed values for expensive terms Jacobians with freezed and updated values
M. Tuma 180
14. Solving nonlinear systems: V.
Getting a matrix approximation stored implicitly: cases
Get the matrix Ai+k by n matvecs Aej , j = 1, . . . , n (Inefficient) A sparse Ai+k can be often obtained via a significantly less matvecs
than n by grouping computed columns if we know its pattern. pattern (stencil) is often known (e.g., given by the problem grid in
PDE problems) often used in practice
but for approximating Ai+k we do not need so much it might be enough to use an approximate pattern of a different but
structurally similar matrix
M. Tuma 181
14. Solving nonlinear systems: VI.
How to approximate a matrix by small number of matvecs if we knowmatrix pattern:
Example 1: Efficient estimation of a banded matrix0BBBBBBBBBBBBBBBBB
♠ ∗
♠ ∗ ∗
∗ ∗ ♠
∗ ♠ ∗
♠ ∗ ∗
∗ ∗ ♠
∗ ♠ ∗
♠ ∗
1CCCCCCCCCCCCCCCCCA
Columns with “red spades” can be computed at the same time in onematvec since sparsity patterns of their rows do not overlap. Namely,
A(e1 + e4 + e7) computes entries in the columns 1, 4 and 7.
M. Tuma 182
14. Solving nonlinear systems: VII.
How to approximate a matrix by small number of matvecs if we knowmatrix pattern:
Example 2: Efficient estimation of a general matrix
∗ ∗ ∗∗ ∗ ∗∗ ∗ ∗
∗ ∗ ∗∗ ∗ ∗
∗ ∗ ∗
Again, By one matvec can be computed the columns for which sparsitypatterns of their rows do not overlap.
M. Tuma 182
14. Solving nonlinear systems: VII.
How to approximate a matrix by small number of matvecs if we knowmatrix pattern:
Example 2: Efficient estimation of a general matrix
♠ ∗ ∗♠ ∗ ∗∗ ♠ ∗
♠ ∗ ∗∗ ∗ ♠
∗ ∗ ♠
Again, By one matvec can be computed the columns for which sparsitypatterns of their rows do not overlap.
For example, A(e1 + e3 + e6) computes entries in the columns 1, 3 and 6.
M. Tuma 182
14. Solving nonlinear systems: VII.
How to approximate a matrix by small number of matvecs if we knowmatrix pattern:
Example 2: Efficient estimation of a general matrix
♠ ♠ ♠♠ ♠ ♠♠ ♠ ♠
♠ ♠ ♠♠ ♠ ♠
♠ ♠ ♠
Entries in A can be computed by four matvecs.In each matvec we need to have structurally orthogonal columns.
M. Tuma 183
14. Solving nonlinear systems: VIII.
Efficient matrix estimation: well established field
Structurally orthogonal columns can be grouped
Finding the minimum number of groups: combinatorially difficultproblem (NP-hard)
Classical field: a (very restricted) selection of references: Curtis, Powell;Reid,1974; Coleman, Moré, 1983; Coleman, Moré, 1984; Coleman,Verma, 1998; Gebremedhin, Manne, Pothen, 2003. extensions to SPD (Hessian) approximations extensions to use both A and AT in automatic differentiation not only direct determination of resulting entries (substitution
methods)
M. Tuma 184
14. Solving nonlinear systems: IX.
Efficient matrix estimation: graph coloring problem
♠ ♠ ♠♠ ♠ ♠♠ ♠ ♠
♠ ♠ ♠♠ ♠ ♠
♠ ♠ ♠
1
2
3
4
5
6
In the other words, columns which form an independent set in the graphof AT A (called intersection graph) can be grouped⇒ a graph coloringproblem for the graph of AT A.
Problem: Find a coloring of vertices of the graph of AT A (G(AT A)) withminimum number of colors such that edges connect only vertices of
different colors
M. Tuma 185
14. Solving nonlinear systems: X.
Our matrix is defined only implicitly.⇓
♠ ♠ ♠♠ ♠ ♠♠ ♠ ♠
♠ ♠ ♠♠ ♠ ♠
♠ ♠ ♠
Consider a new pattern: e.g.,if the entries denoted by ♣ are small, number of groups can be decreased:
M. Tuma 185
14. Solving nonlinear systems: X.
Our matrix is defined only implicitly.⇓
♠ ♠ ♣♠ ♣ ♠♣ ♠ ♠
♠ ♠ ♣♣ ♠ ♠
♠ ♣ ♠
♠ → ♠♠ → ♠
M. Tuma 186
14. Solving nonlinear systems: XI.
Our matrix is defined only implicitly.
♠ ♠ ♣♠ ♣ ♠♣ ♠ ♠
♠ ♠ ♣♣ ♠ ♠
♠ ♣ ♠
But: the computation of entries from matvecs is inexact
M. Tuma 187
14. Solving nonlinear systems: XII.
Computational procedure I.
Step 1: Compute pattern of Ai or M i. E.g., for Ai as sparsification of Ai:
M. Tuma 187
14. Solving nonlinear systems: XII.
Computational procedure I.
Step 1: Compute pattern of Ai or M i. E.g., for Ai as sparsification of Ai:
♠ ♠ ♣♠ ♣ ♠♣ ♠ ♠
♠ ♠ ♣♣ ♠ ♠
♠ ♣ ♠
→
♠ ♠♠ ♠
♠ ♠♠ ♠
♠ ♠♠ ♠
M. Tuma 187
14. Solving nonlinear systems: XII.
Computational procedure I.
Step 1: Compute pattern of Ai or M i. E.g., for Ai as sparsification of Ai:
♠ ♠ ♣♠ ♣ ♠♣ ♠ ♠
♠ ♠ ♣♣ ♠ ♠
♠ ♣ ♠
→
♠ ♠♠ ♠
♠ ♠♠ ♠
♠ ♠♠ ♠
M. Tuma 187
14. Solving nonlinear systems: XII.
Computational procedure I.
Step 1: Compute pattern of Ai or M i. E.g., for Ai as sparsification of Ai:
Step 2: Graph coloring problem for the graph G(patternT pattern) to getgroups.
M. Tuma 187
14. Solving nonlinear systems: XII.
Computational procedure I.
Step 1: Compute pattern of Ai or M i. E.g., for Ai as sparsification of Ai:
Step 2: Graph coloring problem for the graph G(patternT pattern) to getgroups.
♠ ♠♠ ♠
♠ ♠♠ ♠
♠ ♠♠ ♠
1
2
3
4
5
6
M. Tuma 187
14. Solving nonlinear systems: XII.
Computational procedure I.
Step 3: Using matvecs to get Ai+k for more indices k ≥ 0 as if theentries outside the pattern are not present
Notes:
getting the entries from the matvecs spoiled by errors
an approximation error for any estimated entry ai,j in A:∑
k∈(i,k)∈A\P
|aik|
A\P: entries outside the given pattern The error distribution can be strongly influenced by column grouping balancing the error
M. Tuma 188
14. Solving nonlinear systems: XIII.
Computational procedure II.Preconditioner based on exact estimation of off-diagonals in of Ai
(diagonal partial coloring problem)
♠ ♠ ♠♠ ♠ ♠♠ ♠ ♠
♠ ♠ ♠♠ ♠ ♠
♠ ♠ ♠
♠ ♠ ♣♠ ♣ ♠♣ ♠ ♠
♠ ♠ ♣♣ ♠ ♠
♠ ♣ ♠
Consider a new pattern: e.g.,if the entries denoted by ♣ are small, number of groups can be decreased:
M. Tuma 188
14. Solving nonlinear systems: XIII.
Computational procedure II.Preconditioner based on exact estimation of off-diagonals in of Ai
(diagonal partial coloring problem)
♠ ♠ ♣♠ ♣ ♠♣ ♠ ♠
♠ ♠ ♣♣ ♠ ♠
♠ ♣ ♠
♠ → ♠Since all off-diagonals in columns 4 and 5 are computed precisely
M. Tuma 188
14. Solving nonlinear systems: XIII.
Computational procedure II.Preconditioner based on exact estimation of off-diagonals in of Ai