Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent

Introduction into parallel computations

Miroslav Tuma

Institute of Computer Science

Academy of Sciences of the Czech Republic

and Technical University in Liberec

Presentation supported by the project

“Information Society” of the Academy of Sciences of the Czech Republic

under No. 1ET400300415

MFF UK, February, 2006

M. Tuma 2

Pre-introduction

Preliminaries

General knowledge of involved basic algorithms of NLA Simple ideas from direct and iterative solvers for solving large sparse

linear systems Complexities of algorithms

Not covered

Vectorization of basic linear algebra algorithms Parallelization of combinatorial algorithms FFT, parallel FFT, vectorized FFT Multigrid, multilevel algorithms Tools like PETSC etc. Eigenvalue problems

M. Tuma 3

Outline

Part I. A basic sketch on parallel processing1. Why to use parallel computers2. Classification (a very brief sketch)3. Some terminology; basic relations4. Parallelism for us5. Uniprocessor model6. Vector processor model7. Multiprocessor model

Part II. Parallel processing and numerical computations 8. Basic parallel operations 9. Parallel solvers of linear algebraic systems. 10. Approximate inverse preconditioners 11. Polynomial preconditioners 12. Element-by-element preconditioners 13. Vector / parallel preconditioners 14. Solving nonlinear systems

M. Tuma 4

1. Why to use parallel computers?

It might seem that

M. Tuma 4


It might seem that

always better technologies

M. Tuma 4


It might seem that


computers are still faster: Moore’s law

The number of transistors per square inch on integrated circuits doublesevery year since the integrated circuit was inventedThe observation made in 1965 by Gordon Moore, co-founder of Intel.(G.E. Moore, Electronics, April 1965).

M. Tuma 4


It might seem that


computers are still faster: Moore’s law

The number of transistors per square inch on integrated circuits doublesevery year since the integrated circuit was inventedThe observation made in 1965 by Gordon Moore, co-founder of Intel.(G.E. Moore, Electronics, April 1965).

really:

1971: chip 4004 : 2.3k transistors

1978: chip 8086 : 31k transistors (2 micron technology)

1982: chip 80286: 110k transistors (HMOS technology)

1985: chip 80386: 280k transistors (0.8 micron CMOS)

M. Tuma 5

1. Why to use parallel computers? II.

Further on

1989: chip 80486: 1.2M transistors

1993: Pentium: 3.1M transistors (0.8 micron biCMOS)

1995: Pentium Pro: 5.5M (0.6 micron)

1997: Pentium II: 7.5M transistors

1999: Pentium III: 24M transistors

2000: Pentium 4: 42M transistors

2002: Itanium: 220M transistors

2003: Itanium 2: 410M transistors

M. Tuma 6

1. Why to use parallel computers? III.

But: Physical limitations

finite signal speed (speed of light; 300000 km s−1)

implies: cycle time (clock rate): MHz or ns:

100 MHz < −−−− > 10 ns

cycle time: 1 ns⇒ 30 cm per cycle time

Cray-1 (1976): 80 MHz

in any case: size of atoms and quantum effects seem to be ultimatelimits

M. Tuma 7

1. Why to use parallel computers? IV.

Further motivation: important and very time-consumingproblems to be solved

reentry into the terrestrial atmosphere⇒Boltzmann equations

combustion⇒ large ODE systems

deformations, crash-tests⇒large systems of nonlinear equations

turbulent flows⇒ large systems of PDEs in 3D

⇓accelerations of computations still needed

⇓parallel processing

M. Tuma 8

1. Why to use parallel computers? V.

High-speed computing seems to be cost efficient

“The power of computer systems increases as the square of their cost”(Grosch’s law; H.A. Grosch. High speed arithmetic: The digital computeras a research tool. J. Opt. Soc. Amer. 43 (1953); H.A. Grosch. Grosch’s

law revisited. Computerworld 8 (1975), p.24)

M. Tuma 9

2. Classification: a very brief sketch

a) How deep can we go: levels of parallelism

running jobs in parallel for reliabilityIBM AN/FSQ-31 (1958) – purely duplex machine

(time for operations 2.5µ s – 63.5 µ s; computer connected with thehistory of the word byte)

running parts of jobs on independent specialized unitsUNIVAC LARC (1960) – first I/O processor

running jobs in parallel for speedBurroughs D-825 (1962) – more modules, job scheduler

running parts of programs in parallelBendix G-21 (1963), CDC 6600 (1964)

– nonsymmetric multiprocessor

M. Tuma 10

2. Classification: a very brief sketch II.

a) How deep can we go: levels of parallelism (continued)

running matrix-intensive stuff separatelydevelopment of IBM 704x/709x (1963), ASC TI (1965)

parallelizing instructionsIBM 709 (1957), IBM 7094 (1963)

data synchronizer units DSU→ channels) – enables simultaneouslyread/write/compute

overlap computational instructions / loads and stores IBR (instruction backup registers) instruction pipeline

M. Tuma 11

2. Classification: a very brief sketch III.

a) How deep can we go: levels of parallelism (continued, 3rdpart)

parallelizing arithmetics (bit level): less clocks per instructionsuperscalar in RISCs (CDC6600), static superscalar (VLIW)

Check dependencies Schedule operations

M. Tuma 12

2. Classification: a very brief sketch III.

b) Macro view based on Flynn classification

MISDSISD SIMD MIMD

Simple processor

processor Vector Array

processorShared memory Distributed memory

Cache coherent Non cache coherent

Processor/memory organization

SISD: single instruction – single data stream

MIMD: multiple instruction – multiple data streams

M. Tuma 13

2. Classification: a very brief sketch IV.

b) Macro view based on Flynn classification – MIMDmessage passing examples

Caltech Cosmic Cube (1980s)(maximum 64 processors; hypercubeorganization)

picture of Caltech Cosmic Cube

commercial microprocessors + MPP support examples: transputers, ncube-1, ncube-2

picture of transputer A100

standard microprocessors + network support examples: Intel Paragon (i860), Meiko CS-2 (Sun SPARC), TMC

CM-5 (Sun SPARC), IBM SP2-4 (RS6000)

some vector supercomputers: Fujitsu VPP machines

loosely coupled cluster systems

http://archives.caltech.edu//search_catalog.cfm?results_file=Detail_View&recsPerPage=1&firstRecToShow=1&search_field=cube&entry_type=&photo_id=&cat_series=

http://inmos.heddings.com/images/inmos_parts/A100.jpg

M. Tuma 14

2. Classification: a very brief sketch IV.

b) Macro view based on Flynn classificationshared memory machines examples

no hardware cache coherence (hardware maintaining synchronizationbetween cache and other memory) examples: BBN Butterfly (end of 70s), Cray T3D (1993) /T3E (1996),

vector superprocessors; Cray X-MP (1983), Cray Y-MP (1988), CrayC-90 (1990)

hardware cache coherence examples: SGI Origin (1996),Sun Fire (2001)

M. Tuma 15

2. Classification: a very brief sketch V.

of course, there are other possible classifications

by memory access (local/global caches, shared memory cases (UMA,NUMA, cache only memory), distributed memory, distributed sharedmemory)

MIMD by topology (master/slave, pipe, ring, array, torus, tree,hypercube, ...)

features at various levels

M. Tuma 16

2. Classification: a very brief sketch VI.

c) Miscellaneous: features making the execution faster: FPU and ALU work in parallel

mixing index evaluations and floating points is natural now it was not always like that: Cray-1 had rather weak integer arithmetics

multiple functional units (for different operations, or for the sameoperations) first for CDC 6600 (1964) – 10 independent units

pipeline for instructions IBM 7094 (1969) – IBR (instruction backup registers)

1 2 3 4 5

generic example of adding check exponents possibly swap operands shift one of mantissas by the number of bits determined by

differences in exponents compute the new mantissa normalize the result

M. Tuma 17

2. Classification: a very brief sketch VII.

c) Miscellaneous: features making the execution faster:(continued)

pipeline for operations example see later CDC 7600 (1969) – first vector processor

overlapping operations generalizes pipelining:

– possible dependencies between evaluations

– possible different number of stages

– time per stages may differ

processor arrays ILLIAC IV (1972) – 64 elementary processors

memory interleaving first for CDC 6600 (1964) – 64 memory banks Cray-2 efficiency relies on that

M. Tuma 18

3. Some terminology; basic relations

Definitions describing “new” features of the computers

time model

speedup

—- how fast we are

efficiency

—- how fast we are with respect to our resources

granularity (of algorithm, implementation)

—- how large blocks of the code we will consider

M. Tuma 19

3. Some terminology; basic relations: II.

Simplified time models

sequential timetseq = tsequential_startup_latency + toperation_time

vector pipeline timetvec = tvector_startup_latency + n ∗ toperation_time

communication timettransfer_n_words = tstartup_latency + n ∗ ttransfer_word

startup latency: delay time to start the transfer

more complicated relations among data and computer from ourstandpoint invisible, but we should be aware of them

M. Tuma 20

3. Some terminology; basic relations: III.

Speedup S

Ratio Ts/Tp where Ts is time for a non-enhanced run and Tp is time for theenhanced run. Typically:

– Ts: sequential time

– Tp: time for parallel or vectorized run

for multiprocesor run with p processors: 0 < S ≤ p

vector pipeline: next slide

M. Tuma 21

3. Some terminology; basic relations: IV.

Speedup S (continued)

time op− 1 op− 2 op− 3 op− 4 op− 5

1 a1

2 a2 a1

3 a3 a2 a1

4 a4 a3 a2 a1

5 a5 a4 a3 a2

. . . . . . . . . . . . . . .

for processing p entries needed: length ∗ p/(length + p) clock cycles (herelength = 5)

Speedup:S = length ∗ p/(length + p) ≤ p

speedup (better):S = length ∗ p ∗ tseq/(tvec_latency + (length + p) ∗ tvec_op)

M. Tuma 22

3. Some terminology; basic relations: V.

Efficiency E

Ratio S/p where S is speedup and p characterizes the enhancement. if p is number of processors: 0 < E ≤ 1

if p is pipeline length: 0 < E ≤ 1

Relative speedup and efficiency for multiprocessors Sp and Ep

Sp = T1/Tp,

where T1 is time for running the parallel code on one processor. typically Tp ≥ Ts

other similar definitions of E and S (e.g., taking into account relationparallel code × best sequential code

memory hierarchy effects (e.g., SGI2000 2-processor effect; largememory on parallel machines)

M. Tuma 23

3. Some terminology; basic relations: VI.

Amdahl’s law expresses natural surprise by the following fact:

if a process performs part of the work quickly and part of the workslowly then the overall (speedup, efficiency) is strongly limited by thepart performed slowly

M. Tuma 23




Notation: f : fraction of the slow (sequential) part (1− f): the rest (parallelized, vectorized) t: overall time

M. Tuma 23





f

1−f

sequential

parallel

M. Tuma 23





Then : S =f ∗ t + (1− f)t

f ∗ t + (1− f) ∗ (t/p)≤ 1

f

f

1−f

sequential

parallel

M. Tuma 23





Then : S =f ∗ t + (1− f)t

f ∗ t + (1− f) ∗ (t/p)≤ 1

f

E.g.: f = 1/10⇒ S ≤ 10

f

1−f

sequential

parallel

M. Tuma 24

3. Some terminology; basic relations: VII.

Amdahl’s law (continued)Described in: (Gene Amdahl: Interpretation of AMDAHL’s theorem,advertisement of IBM, 1967)

Gene Myron Amdahl (1922 —) worked on IBM 704/709, IBM/360 Series, Amdahl V470 (1975)

Amdahl’s law relevancy

Only a simple approximation of computer processing: dependence f(n)not considered: fully applies when there are absolute constraints for solution time

(weather prediction, financial transactions) Algorithm is effectively parallel if f → 0 for n→∞.

Speedup / efficiency anomalies: More processors may have more memory/cache Increasing chances to find a lucky solution in parallel combinatorial

algorithms

M. Tuma 25

3. Some terminology; basic relations: VIII.

Scalability program is scalable if:

larger efficiency comes with larger number of processors or longerpipeline

multiprocessors: linear, sublinear, superlinear S/E

different specialized definitions for growing number of processors / pipeline length growing time

Isoefficiency Overhead function: To(size, p) = pTp(size, p)− Ts

Efficiency: E = 1/(1 + To(size, p)/size)

Isoefficiency function: size = KTo(size, p) such that E is constant,K = E/(1− E)

Adding n numbers on p processors: size = Θ(p log p).

M. Tuma 26

3. Some terminology; basic relations: IX.

Load balancing

Techniques to minimize Tp on multiprocessors by approximate equalizingtasks for individual processors.

static load balancing array distribution schemes (block, cyclic, block-cyclic, randomized

block)

graph partitioning

hierarchical mappings

dynamic load balancing centralized schemes

distributed schemesWill be discussed later

M. Tuma 27

3. Some terminology; basic relations: IX.

Semaphores

Signals operated by individual processes and not by central control

Shared memory computers’ feature

Introduced By Dijkstra.

Message passing

Mechanism to transfer data from one process to another.

Distributed memory computers’ feature

Blocking × non-blocking communication

M. Tuma 28

4. Parallelism for us

Mathematician’s point of view

We need to: convert algorithms into state-of-the-art codes

algorithms→ codes→ computers

Algorithm

Idealized computer

Computer

Implementation, Code

What is the idealized computer?

M. Tuma 29

4. Parallelism for us

Idealized computer

idealized vector processor

idealized uniprocessor

idealized computers with more processors

M. Tuma 30

5. Uniprocessor model

CPU

Memory

I/O

M. Tuma 31

5. Uniprocessor model: II.

Example: model and reality

Even simple Pentium III has on-chip

pipeline (at least 11 stages for each instruction) data parallelism (SIMD type) like MMX (64bit) and SSE (128bit) instruction level parallelism (up to 3 instructions) more threads at system level based on bus communication

M. Tuma 32

5. Uniprocessor model: III.

How to ...?: pipelined superscalar CPU: not for us

( pipelines; ability to issue more instructions at the same time)

detecting true data dependencies: dependencies in processing order

detecting resource dependencies: competition of data for computationalresources

— reordering instructions; most microprocessors enable out-of-orderscheduling

solving branch dependencies

— speculative scheduling across; typically every 5th-6th instruction is abranch

VLIW – compile time scheduling

M. Tuma 33

5. Uniprocessor model: IV.

How to ...?: memory and its connection to CPU

(should be considered by us)

1. memory latency—- delay between memory request and data retrieval

2. memory bandwidth—- rate at which data can be transferred from/to memory

M. Tuma 34

5. Uniprocessor model: V.

Memory latency and performance

Example: processor 2GHz, DRAM with latency 0.1µs; two FMA unit on theprocessor and 4-way superscalar (4 instructions in a cycle, e.g., two addsand two multiplies)

cycle time: 0.5ns

maximum processor rate: 8 GFLOPs

for every memory request: 0.1 µs waiting

it is: 200 cycles wasted for each operation

dot product: two data fetches for each multiply-add (2 ops)

consequently: one op for one fetch

resulting rate: 10 MFLOPs

M. Tuma 35

5. Uniprocessor model: VI.

Hiding / improving memory latency (I.)

a) Using cache

The same example cache of size 64kB

it can store matrices A,B and and C of dimension 50

matrix multiplication A ∗B = C

matrix fetch: 5000 bytes: 500 µs

ops: 2n3 time for ops: 2 ∗ 643 ∗ 0.5 ns = 262 µs

total: 762 µs

resulting rate: 688 MFLOPs

M. Tuma 36

5. Uniprocessor model: VII.

Hiding / improving memory latency (II.)

b) Using multithreading(Thread: A sequence of instructions in a program which runs a certain

procedure.)

dot products of rows of A with b

do i=1,nr(i)=A(i,:)’*b

end do

M. Tuma 37

5. Uniprocessor model: VIII.

Hiding / improving memory latency (III.) (continued)

multithreaded version of the dot product

do i=1,nr(i)=new_thread(dot_product, double, A(i,:), b)

end do

processing more threads: able to hide memory latency

important condition: fast switches of threads

HEP or Tera can switch in each cycle

M. Tuma 38

5. Uniprocessor model: IX.

Hiding / improving memory latency (III.)

c) Prefetching

advancing data loads

as some other techniques, it can induce the rate for our example: anoperation per clock cycle

M. Tuma 39

5. Uniprocessor model: X.

Memory bandwith

data transfer rate / peak versus average

improvement of memory bandwidth: increase size of communicatedmemory blocks

sending consecutive words from memory

requires spatial locality of data

column versus row major data access: the physical access should becompatible with the logical accass from programming language

M. Tuma 40

5. Uniprocessor model: XI.

Memory bandwith (continued)

summing columns of A

do i=1,nsum(i)=0.0d0for j=1,nsum(i)=sum(i) + A(i,j)

end doend do

matrix is stored columnwise: good spatial locality

matrix is stored rowwise: bad spatial locality

of course, code can be rewritten for row major data access

C, Pascal (rowwise), Fortran (columnwise)

M. Tuma 41

5. Uniprocessor model: XII.

Memory bandwith and latency: conclusions

The other side of memory hiding memory latency: increase in memorybandwidth

memory bandwidth improvements if the vectors are long: breaking theiteration space into blocks: tiling

exploit any possible spatial and temporal locality to amortize memorylatency and increase effective memory bandwidth

the ratio q: ops / number of memory accesses: good indicator oftolerance to memory bandwidth

memory layout, organization of computation are a significant challengefor users

M. Tuma 42

5. Uniprocessor model: XIII.

How to improve the ratio q: ops / number of memoryaccesses? How to standardize the improvement?

⇓more levels of Basic Linear Algebra Subroutines (BLAS)

basic linear algebraic operations with vectors

basic linear algebraic operations with matrices

closer to “matlab elegance”

in fact, first matlab started with LINPACK (1979) kernels with cleverimplementation of vector and matrix operations

M. Tuma 43

5. Uniprocessor model: XIV.

BLAS

operation ops comms q = ops/comms

αx + y 2 ∗ n 3 ∗ n + 1 ≈ 2/3

αAx + y 2 ∗ n2 + n n2 + 3 ∗ n + 1 ≈ 2

αAB + C 2 ∗ n3 + n2 4 ∗ n2 + 1 ≈ n/2

BLAS1 (1979): SAXPY (αx + y), dot_product (xT y), vector_norm, planerots, ...

BLAS2 (1988): matvecs (αAx + βy), rank-1 updates, rank-2updates,triang eqs, ...

BLAS3 (1990): matmats et al.: SGEMM (C = AB)

M. Tuma 44

5. Uniprocessor model: XIV.

BLAS pros and cons

BLAS (pros): for most of available computers

– increase effective memory bandwidth

– portability

– modularity

– clarity

– much simpler software maintenance

BLAS (cons): time-consuming interface for simple ops

– further possible improvements based on the problem knowledge(distinguishing cases with specific treatment like loop unrolling)

M. Tuma 45

5. Uniprocessor model: XV.

Standardization at the higher level: LAPACK

covers solvers for dense and banded

– systems of linear equations

– eigenvalue problems

– least-squares solutions of overdetermined systems

associated factorizations: LU, Cholesky, QR, SVD, Schur, generalizedSchur)

additional routines: estimates of condition numbers, factorizationreorderings by pivoting

based on LINPACK (1979) and EISPACK (1976) projects

M. Tuma 46

6. Vector processor model

Founding father

chief constructor of latest model of CDC computers with some parallelfeatures)

Cray computers: one of most successful chapters in the history ofdevelopment of parallel computers

first CRAYs: vector computers

M. Tuma 47

6. Vector processor model: II.

Vector processing principles

1. Vector computers’ basics

pipelined instructions

pipelined data: vector registers

typically different vector processing units for different operations

V1

V2

S1

*

+

M. Tuma 48

6. Vector processor model: III.


1. Vector computers’ basics (continued)

important breakthrough: efficient vectorizing sparse data⇒ enormousinfluence on scientific computing instructions: compress, expand, scatter, gather

scatter b

do i=1,na(index(i)) = b(i)

end do

x

mask1 1 1 1

x1 x3 x6 x10

Cyber-205 (late seventies): efficient software (in microcode) since Cray X-MP: performed by hardware

M. Tuma 49

6. Vector processor model: IV.


2. Chaining overlapping of vector instructions: introduced in Cray-1 (1976)

results in c + length clock cycles for a small c to process a vectoroperation with vectors of length length

the longer the vector chain the better speedup

the effect called supervector performance

V1

V2

S1

*

+

M. Tuma 50

6. Vector processor model: V.


Stripmining splitting long vectors still saw-like curve of speedup relative to vector length

S

length

Stride: distance between vector elements Fortran matrices: column major

C, Pascal matrices: row major

M. Tuma 51

6. Vector processor model: VII.

Vector processing and us

Prepare data to be easily vectorized: II.

M. Tuma 51



Prepare data to be easily vectorized: II. loop unrolling: prepare new possibilities for vectorization by a more

detailed description in some cases: predictable sizes of blocks: efficient processing of

loops of fixed size

M. Tuma 51



Prepare data to be easily vectorized: II. loop unrolling: prepare new possibilities for vectorization by a more

detailed description in some cases: predictable sizes of blocks: efficient processing of

loops of fixed size

subroutine dscal(n,da,dx,incx)do 50 i = mp1,n,5dx(i) = da*dx(i)dx(i + 1) = da*dx(i + 1)dx(i + 2) = da*dx(i + 2)dx(i + 3) = da*dx(i + 3)dx(i + 4) = da*dx(i + 4)50 continue

M. Tuma 52

6. Vector processor model: VIII.


Prepare data to be easily vectorized: III.

M. Tuma 52



Prepare data to be easily vectorized: III. loop interchanges: 1. recursive doubling for polynomial evaluation

M. Tuma 52




Horner’s rule: p(k) = an−k + p(k−1)x for getting p(n).strictly recursive and non-vectorizable

M. Tuma 52




Horner’s rule: p(k) = an−k + p(k−1)x for getting p(n).strictly recursive and non-vectorizable

[v1

v2

]←[

x

x2

]

[v3

v4

]← v2

[v1

v2

]

and so on

M. Tuma 53

6. Vector processor model: IX.


Prepare data to be easily vectorized: IV.

M. Tuma 53



Prepare data to be easily vectorized: IV. loop interchanges: 2. cyclic reduction demonstrated for solving tridiagonal systems: other “parallel” TD

solvers: later (twisted factorization)

M. Tuma 53





even-odd rearrangement of rows

M. Tuma 53






d0 f0

e1 d1 f1

e2 d2 f2

e3 d3 f3

e4 d4 f4

e5 d5 f5

e6 d1

M. Tuma 53






d0 f0

d2 e2 f2

d4 e4 f4

d6 e6

e1 f1 d1

e3 f3 d3

e5 f5 d5

M. Tuma 53






d0 f0

d2 e2 f2

d4 e4 f4

d6 e6

e1 f1 d1

e3 f3 d3

e5 f5 d5

more vectorizable than GE, more ops, worse cache treatment

M. Tuma 54

7. Multiprocessor model

Basic items (some of them emphasized once more)

communication

– in addition to memory latency and memory bandwidth we considerlatencies and bandwidths connected to mutual communication

granularity

– how large should be independent computational tasks

load balancing

– balancing work in the whole system

resulting measure: parallel efficiency / scalability

M. Tuma 55

7. Multiprocessor model: II.

Communication

Additional communication (with respect to uniprocessor P-M ): P-P

store and forward routing via l links between two processors

– tcomm = ts + l(mtw + th)

– ts: transfer startup time (includes startups for both nodes)

– m: message size

– th: node latency (header latency)

– tw: time to transfer a word

– simplification: tcomm = ts + lmtw

typically: poor efficiency of communication

M. Tuma 56

7. Multiprocessor model: III.

Communication (continued)

Single message

Message broken into two parts

M. Tuma 57

7. Multiprocessor model: IV.

Communication (continued 2)

packet routing: routing r packets via l links between two processors

subsequent sends after a part of the message (packet) received

– tcomm = ts + thl + tw1m + m/rtw2(r + s)

– ts: transfer startup time (includes startups for both nodes)

– tw1: time for packetizing the message, tw2: time to transfer a word, s:size of info on packetizing

– finally: tcomm = ts + thl + mtw

– stores overlapped by transfer cut through routing: message broken into flow control digits (fixed size

units)

– tcomm = ts + thl + mtw

supported by most current parallel machines and local networks

M. Tuma 58

7. Multiprocessor model: V.

Communication (shared memory issues)

avoid cache thrashing (degradation of performance due to insufficientcaches); much more important on multiprocessor architectures⇒typical deterioration of performance when a code is transferred to aparallel computer

more difficult to model prefetching

difficult to get and model spatial locality because of cache issues

cache sharing (sharing data for different processors in the same cachelines)

remote access latencies (data for a processor updated in a cache ofanother processor)

M. Tuma 59

7. Multiprocessor model: VI.

Optimizing communication

minimize amount of transferred data: better algorithms

message aggregation, communication granularity, communicationregularity: implementation

minimize distance of data transfer: efficient routing, physical platformorganizations (not treated here)(but tacitly used in some very generaland realistic assumptions)

M. Tuma 60

7. Multiprocessor model: VII.

Granularity of algorithms, implementation, computation

Rough classification of size of program sections executed withoutadditional communication

fine grain

medium grain

coarse grain

M. Tuma 61

7. Multiprocessor model: VIII.

Fine grain example 1: pointwise Jacobi iteration

x+ = (I −D−1A)x + D−1b

A =

B −I

−I B −I

. . . . . .

−I B

B =

4 −1

−1 4 −1

. . . . . .

−1 4

D =

4

4

. . .

4

M. Tuma 62

7. Multiprocessor model: IX.

Fine grain example 1: pointwise Jacobi iteration (continued)

x+ij = xij + (bij + xi−1,j + xi,j−1 + xi+1,j + xi,j+1 − 4 ∗ xij)/4

i

j

1 1 1 1 1

1 1 1 1 1

1 1 1 1 1

1 1 1 1 1

1 1 1 1 1

M. Tuma 63

7. Multiprocessor model: X.

Fine grain example 1: pointwise Gauss-Seidel iteration

x+ = (I − (D + L)−1A)x + (D + L)−1b

x+ij = xij + (bij + x+

i−1,j + x+i,j−1 + xi+1,j + xi,j+1 − 4 ∗ xij)/4

i

j

1

2

2

3

3

3

4

4

4

4

5

5

5

5

5

6

6

6

6

7

7

7

8

8

9

M. Tuma 64

7. Multiprocessor model: XI.

Granularity

Concept of granularity:can be generalized to: Decomposition of the computation

Problem decomposition (I/IV)

recursive decomposition: divide and conquer strategy

– example: sorting algorithm quicksort

—- select an entry in the sorted sequence

—- partitions the sequence into two subsequences

M. Tuma 65

7. Multiprocessor model: XII.

Problem decomposition (II/IV)

One step of quicksort – basic scheme

3 1 7 2 5 8 6 4 3

1 2 3 7 5 8 6 4 3

M. Tuma 66

7. Multiprocessor model: XIII.

Problem decomposition (III/IV)

data decomposition: split the problem data

– example: matrix multiplication

(A11 A12

A21 A22

)(B11 B12

B21 B22

)→(

C11 C12

C21 C22

)

M. Tuma 67

7. Multiprocessor model: XIV.

Problem decomposition (IV/IV)

exploratory decomposition: split the search space

– used, e.g., in approximate solving NP-hard combinatorial optimizationproblems

speculative; random decompositions

– example: evaluating branch instructions before a branch condition isevaluated

hybrid decomposition: first recursive decomposition into large chunks,later data decomposition

M. Tuma 68

7. Multiprocessor model: XV.

Load balancing

static mappings

– 1. data block distribution schemes

– example: matrix multiplication

– n: matrix dimension; p: number of processors

– 1D block distribution: processors own row matrix blocks: each one hasn/p of rows

– 2D block distribution: processors own blocks of size n/√

p× n/√

ppartitioned by both rows and columns:

– input, intermediate, output block data distributions

M. Tuma 69

7. Multiprocessor model: XVI.

Load balancing: 1D versus 2D matrix distribution for a matmat(matrix-matrix multiplication)

1D partitioning

2D partitioning

shared data: 1D: n2/p + n2, 2D output: O(n2/√

p

M. Tuma 70

7. Multiprocessor model: XVII.

Load balancing: Data block distribution schemes for matrixalgorithms with nonuniform work with respect to ordering of

indices

example: LU decomposition

cyclic and block-cyclic distributions

1D and 2D block cyclic distribution

M. Tuma 71

7. Multiprocessor model: XVIII.

Load balancing: other static mappings

randomized block distributions

– useful, e.g. for sparse or banded matrices

graph partitioning

– an application based input block data distribution

hierarchical static mappings

task-based partitionings

M. Tuma 72

7. Multiprocessor model: XIX.

Load balancing: dynamic mappings

centralized schemes

– master: a special process managing pool of available tasks

– slave: processors performing tasks from the pool

—- self-scheduling (choosing tasks in independent demands)

—- controlled-scheduling (master involved in providing tasks)

—- chunk-scheduling (slaves take a block of tasks)

distributed schemes

– more freedom, more duties

– synchronization between sender and receiver

– initiation of tasks

M. Tuma 73

7. Multiprocessor model: XX.

User point of view: tools

the most widespread message passing model: MPI paradigm

– supports execution of different programs on each of processors

– enables easy description using SPMD approach: a way to having jobof program writing efficient

– simple parallelization with calls to a library

other message passing model: PVM

– some enhancements but less efficient

Posix Thread API

Shared-memory OpenMP API

M. Tuma 74

7. Multiprocessor model: XXI.

Example: basic MPI routines

MPI_init(ierr)

MPI_finalize(ierr)

MPI_comm_rank(comm,rank,ierr)

MPI_comm_size(comm,size,ierr)

MPI_send(buf,n,type,dest,tag,comm,ierr)

MPI_recv(buf,n,type,srce,tag,comm,status,ierr)

MPI_bcast(buf,n,type,srce,tag,comm,status,ierr)

MPI_REDUCE(sndbuf,rcvbuf,1,type,op,0,comm,ierr)

M. Tuma 75

7. Multiprocessor model: XXII.

c******************************************************************c pi.f - compute pi by integrating f(x) = 4/(1 + x**2)

c (rewritten from the example program from MPICH, ANL)

c

c Each node:

c 1) receives the number of rectangles used in the approximation.

c 2) calculates the areas of it’s rectangles.

c 3) Synchronizes for a global summation.

c Node 0 prints the result.

c

program main

include ’mpif.h’

double precision PI25DT

parameter (PI25DT = 3.141592653589793238462643d0)

double precision mypi, pi, h, sum, x, f, a

integer n, myid, numprocs, i, rc

M. Tuma 76

7. Multiprocessor model: XXIII.

c function

f(a) = 4.d0 / (1.d0 + a*a)

c init

call MPI_INIT( ierr )

c who am I?

call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )

c how many of us?

call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )

print *, "Process ", myid, " of ", numprocs, " is alive"

c

10 if ( myid .eq. 0 ) then

write(*,*) ’Enter the number of intervals: (0 quits)’

read(*,*) n

endif

M. Tuma 77

7. Multiprocessor model: XXIV.

c distribute dimension

call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)

c calculate the interval size

h = 1.0d0/n

c

sum = 0.0d0

do i = myid+1, n, numprocs

x = h * (dble(i) - 0.5d0)

sum = sum + f(x)

end do

mypi = h * sum

c collect all the partial sums

call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0,

$ MPI_COMM_WORLD,ierr)

M. Tuma 78

7. Multiprocessor model: XXV.

c node 0 prints the answer.

if (myid .eq. 0) then

write(6, 97) pi, abs(pi - PI25DT)

97 format(’ pi is approximately: ’, F18.16,

+ ’ Error is: ’, F18.16)

endif

30 call MPI_FINALIZE(rc)

stop

end

M. Tuma 79

7. Multiprocessor model: XXVI.

Linear algebra standardization and multiprocessor model

BLACS: Basic linear algebra communication subroutines (low level ofconcurrent programming)

PBLAS: Parallel BLAS: “parallel” info is transferred via a descriptor array

ScaLAPACK: library of high-performance linear algebra for messagepassing architectures

All of these based on the message-passing primitives

M. Tuma 80

7. Multiprocessor model: XXVII.

Dependency tree for linear algebra high-performancesoftware

ScaLAPACK

PBLAS

BLACS

MPI, PVM

BLAS

LAPACK

M. Tuma 81

8. Basic parallel operations

Dense matrix-vector multiplication

Algorithm 1 sequential matrix-vector multiplication y = Axfor i = 1, . . . n

yi = 0for j = 1, . . . n

yi = yi + aijxj

end jend i

a) rowwise 1-D partitioning

b) 2-D partitioning

M. Tuma 82

8. Basic parallel operations: II.

Dense matrix-vector multiplication: rowwise 1-D partitioning

P0

P1

P2

P3

P4

P5

P0

P1

P2

P3

P4

P5

x0

x1

x2

x3

x4

x5

x0

x1x2

x3

x4

x5

Communication: all-to-all communication among n processors(P0 −−Pn−1); Θ(n) for a piece of communication

Multiplication: Θ(n)

Altogether: Θ(n) parallel time, Θ(n2) process time: cost optimal:(asymptotically same number of operations when sequentialized)

M. Tuma 83

8. Basic parallel operations: III.

Dense matrix-vector multiplication: block-rowwise 1-Dpartitioning

Blocks of the size n/p, matrix block-rowwise stripped, vectors x and ysplit into subvectors of length n/p.

Communication: all-to-all communication among p processors(P0 −−Pp−1): time ts log(p) + tw(n/p)(p− 1) ≈ ts log(p) + twn (usingrather general assumption on implementation of collectivecommunications).

Multiplication: n2/p

Altogether: n2/p + ts log(p) + twn parallel time; cost optimal forp = O(n) (asymptotically the same number of operations as in thesequential case).

M. Tuma 84

8. Basic parallel operations: IV.

Dense matrix-vector multiplication: 2-D partitioning

x0

x1

x2

x3

x4

x5

x0

x1x2

x3

x4

x5

P0 P1 P2 P5

P6

P25

... ...

...

...

...

Communication I.: Align vector x

Communication II.: one-to-all broadcast among n processors of eachcolumn: Θ(log(n))

Communication III.: all-to-one reduction in rows: Θ(log n)

Multiplication: 1

Altogether: Θ(n) parallel time; process time Θ(n2 log n). Algorithm is notcost optimal

M. Tuma 85

8. Basic parallel operations: V.

Dense matrix-vector multiplication: block 2-D partitioning

x0

x1

x2

x3

x4

x5

x0

x1x2

x3

x4

x5

P0 P1 P2 P5

P6

P25

... ...

...

...

...

Multiplication: n2/p

Aligning vector: ts + twn/√

p

Columnwise one-to-all broadcast: (ts + twn/√

p) log(√

p)

All-to-one reduction: (ts + twn/√

p) log(√

p)

Altogether: n2/p + ts log p + twn/√

p log p parallel time

Algorithm is cost optimal for p = O(n).

M. Tuma 86

8. Basic parallel operations: VI.

Dense matrix-matrix multiplication: 2-D partitioning

P0 P1 P2 P5

P6

P25

... ...

...

...

...

Communication: Two all-to-all broadcast steps

Each with√

p concurrent broadcasts among groups of√

p processes

Total communication time 2ts log(√

p) + twn2/p√

p

Multiplications of matrices of dimensions n/√

p,√

p-times.

Altogether: n3/p + tsp log p + 2twn2/√

p parallel time. Algorithm is costoptimal for p = O(n2)

Large memory consumption: each process has√

p blocks od sizeΘ(n2/p).

M. Tuma 87

8. Basic parallel operations: VII.

Dense matrix-matrix multiplication: Cannon’s algorithm

A00 A01 A02 A03

A10 A11 A12 A13

A20 A21 A22 A23

A30 A31 A32 A33

B00 B01 B02 B03

B10 B11 B12B13

B20 B21 B22 B23

B30 B31 B32 B33

M. Tuma 87



A00 A01 A02 A03

A10 A11 A12 A13

A20 A21 A22 A23

A30 A31 A32 A33

B00 B01 B02 B03

B10 B11 B12B13

B20 B21 B22 B23

B30 B31 B32 B33

A00 A01 A02 A03

A12 A13 A10A11

A22 A23 A20 A21

A33 A30 A31 A32

B00 B11 B22 B33

B21 B32 B03B10

B20 B31 B02 B13

B30 B01 B12 B23

M. Tuma 87



A00 A01 A02 A03

A10 A11 A12 A13

A20 A21 A22 A23

A30 A31 A32 A33

B00 B01 B02 B03

B10 B11 B12B13

B20 B21 B22 B23

B30 B31 B32 B33

A00 A01 A02 A03

A12 A13 A10A11

A22 A23 A20 A21

A33 A30 A31 A32

B00 B11 B22 B33

B21 B32 B03B10

B20 B31 B02 B13

B30 B01 B12 B23

M. Tuma 87



A00 A01 A02 A03

A12 A13 A10A11

A22 A23 A20 A21

A33 A30 A31 A32

B00 B11 B22 B33

B21 B32 B03B10

B20 B31 B02 B13

B30 B01 B12 B23

Memory-efficient: version of matrix-matrix multiplication

Parallel time and cost-optimality asymptotically the same.

Possible to use n3/ log n processes to get Θ(log n) parallel time (Dekel,Nassimi, Sahni) (not cost optimal)

There exists also a fast cost-optimal variant

M. Tuma 88

8. Basic parallel operations: VIII.

Gaussian elimination (here the kij case of LU factorization)

active part

(k,k)

(i,k)

(k,j)

(i,j)

a(k,j)=a(k,j)/a(k,k)

a(i,j)=aa(i,j)−a(i,k)*a(k,j)

M. Tuma 88

8. Basic parallel operations: VIII.

Gaussian elimination (here the kij case of LU factorization)

active part

(k,k)

(i,k)

(k,j)

(i,j)

a(k,j)=a(k,j)/a(k,k)

a(i,j)=aa(i,j)−a(i,k)*a(k,j)

sequential time complexity:2/3n3 + O(n2)

M. Tuma 89

8. Basic parallel operations: IX.

Standard Gaussian elimination: 1-D partitioning

1

10

0

0

0

0

0

0

0

0

0

0

0

0

M. Tuma 89



1

10

0

0

0

0

0

0

0

0

0

0

0

0

1

10

0

0

0

0

0

0

0

0

0

0

0

0

1

M. Tuma 89



1

10

0

0

0

0

0

0

0

0

0

0

0

0

1

10

0

0

0

0

0

0

0

0

0

0

0

0

1

1

10

0

0

0

0

0

0

0

0

0

0

0

0

1

M. Tuma 89



1

10

0

0

0

0

0

0

0

0

0

0

0

0

1

10

0

0

0

0

0

0

0

0

0

0

0

0

1

1

10

0

0

0

0

0

0

0

0

0

0

0

0

1

Computation: 3∑n−1

k=0(n− k − 1) = 3n(n− 1)/2

Parallel time: 3n(n− 1)/2 + tsn log n + 1/2twn(n− 1) log n

This is not cost-optimal, since the total time is Θ(n3 log n).

M. Tuma 90

8. Basic parallel operations: X.

Pipelined Gaussian elimination: 1-D partitioning

1 1 1

1 1 1

1

1

1

. . .

M. Tuma 91

8. Basic parallel operations: XI.

Pipelined Gaussian elimination: 1-D partitioning (continued)

Total number of steps: Θ(n)

Operations:, each of O(n) time complexity Communication O(n) entries

Division O(n) entries by a scalar

Elimination step on O(n) entries

Parallel time: O(n2); Total time: O(n3).

Not the same constant in the asymptotic complexity as in thesequential case: some processors are idle.

M. Tuma 92

8. Basic parallel operations: XII.

Gaussian elimination: further issues

2-D partitioning: Θ(n3) total time for n2 processes. block 2-D partitioning: Θ(n3/p) total time for p processes. 2-D partitionings: generally more scalable (allow efficient use of more

processors) Pivoting: changing layout of the elimination

Partial pivoting: No problem in 1-D rowwise partitioning: O(n) search in each row It might seem that it is better with 1-D columnwise partitioning:

O(log p) search. But this strongly limits pipelining. Strong restrictions to pipelining.

Weaker variants of pivoting (e.g., pairwise pivoting: may result instrong degradation of the numerical quality of the algorithm.

M. Tuma 93

8. Basic parallel operations: XIII.

Solving triangular system: back-substitution

sequential back-substitution for U*x=y(one possible order of operations)

do k=n,1,-1 ! (backwards)x(i)=y(i)do i=k-1,1,-1 ! (backwards)y(i)=y(i)-x(k)*U(i,k)

end doend do

Sequential complexity: n2/2 + O(n)

Rowwise block 1-D partitioning: Constant communication, O(n/p) forcomputation, Θ(n) steps: total time Θ(n2/p).

Block 2-D partitioning: Θ(n2√p) total time.

M. Tuma 94

8. Basic parallel operations: XIV.

Solving linear recurrences: a case of parallel prefix operation

Parallel prefix operation:

Get y0 = x0, y1 = x0♥x1, . . . , yi = x0♥x1♥ . . .♥xi for associative ♥.

M. Tuma 94


Solving linear recurrences: a case of parallel prefix operation

Parallel prefix operation:

Get y0 = x0, y1 = x0♥x1, . . . , yi = x0♥x1♥ . . .♥xi for associative ♥.

0:7

0 1 2 3 4 5 6 7

0:1 2:3 4:5 6:7

0:3 4:7

0:5

0:60:40:2

M. Tuma 95


Parallel prefix operation (continued)

Application to zi+1 = aizi + bi

Get pi = a0 . . . ai using the parallel prefix operation Compute βi = bi/pi in parallel. Compute si = β0 + . . . + βi−1 using parallel prefix operation. Compute zi = sipi−1 in parallel.

M. Tuma 96

8. Basic parallel operations: XV.

Conclusion for basic parallel operations

Still far from contemporary scientific computing There are large dense matrices in practical problems, but: A lot can be performed by ready-made scientific software like

SCALAPACK Problems are:

Sparse: O(n) sequential steps may be too many But: contemporary sparse matrix software strongly relies on using

dense blocks connected by a general sparse structure Very often unstructured: operations with general graphs and

specialized combinatorial routines should be efficiently implementedwhouch would be generally efficient on a wide spectrum of computerarchitectures.

Not homogenous in the sense that completely different parallelizationtechniques should be used in implementations.

M. Tuma 97

9. Parallel solvers of linear algebraic equations

Basic classification of (sequential) solvers

Ax = b

Our case of interest

A is large

A is, fortunately, most often, sparse

Different classes of methods for solving the system with variousadvantages and disadvantages.

Gaussian elimination→ direct methods CG method→ Krylov space iterative methods (+) multilevel information transfer

M. Tuma 98

9. Parallel solvers of linear algebraic equations: II.

Hunt for extreme parallelism: Algorithm by Csanky

M. Tuma 98



Compute powers of A: A2, A3, . . . An−1 (O(log2 n) complexity)

M. Tuma 98



Compute powers of A: A2, A3, . . . An−1 (O(log2 n) complexity) Compute traces sk = tr(Ak) of the powers (O(log n) complexity)

M. Tuma 98



Compute powers of A: A2, A3, . . . An−1 (O(log2 n) complexity) Compute traces sk = tr(Ak) of the powers (O(log n) complexity) Solve Newton identities for the coefficients ai of the characteristic

polynomial (O(log2 n)).

kpk = sk − p1sk−1 − . . .− pk−1s1, (k = 1, . . . , n)

M. Tuma 98





kpk = sk − p1sk−1 − . . .− pk−1s1, (k = 1, . . . , n)

=

M. Tuma 98





kpk = sk − p1sk−1 − . . .− pk−1s1, (k = 1, . . . , n)

=

−1

M. Tuma 98





kpk = sk − p1sk−1 − . . .− pk−1s1, (k = 1, . . . , n)

=

−1

Compute the inverse using Cayley-Hamilton theorem (O(log2 n))

M. Tuma 98





kpk = sk − p1sk−1 − . . .− pk−1s1, (k = 1, . . . , n)

=

−1

Compute the inverse using Cayley-Hamilton theorem (O(log2 n)) Horribly unstable.

M. Tuma 99

9. Parallel solvers of linear algebraic equations: III.

Typical key operations in the skeleton of Krylov subspacemethods

1. Matrix-vector multiplication (one right-hand side) with a sparse matrix.

2. Matrix-matrix multiplications (more right-hand sides), first matrix issparse.

3. Sparse matrix-matrix multiplications.

4. Preconditioning operation (we will explaain preconditioning later).

5. Orthogonalization in some algorithms (GMRES).

6. Some standard dense stuff (saxpys, dot products, norm computations.

7. Overlapping communication and computation. It sometimes changesnumerical properties of the implementation.

M. Tuma 100

9. Parallel solvers of linear algebraic equations: IV.

System matrix goes to sparse: more possible data structures

M. Tuma 100

9. Parallel solvers of linear algebraic equations: IV.

System matrix goes to sparse: more possible data structures* ** *

**

**

**

*

**

**

**

*

**

**

*

**

**

**

**

*

**

* ***

*

Band 6

* ** *

**

**

**

*

**

**

**

*

**

**

*

**

**

**

**

*

**

* ***

*

Profile 6

* ** *

**

**

**

*

**

**

**

*

**

**

*

**

**

**

**

*

**

* ***

*

Frontal method - dynamic band

Movingwindow -

M. Tuma 101

9. Parallel solvers of linear algebraic equations: V.

General sparsity structure can be reasonably treated.

M. Tuma 101



Banded and envelope paradigms often lead to slower algorithms e.g.,when matrices has to be decomposed.

M. Tuma 101




Machines often support gather-scatter useful with indirect addressingconnected to sparse matrices.

M. Tuma 101





Generally sparse data structures typically preferred.

M. Tuma 101





Generally sparse data structures typically preferred.

∗ ∗ ∗ ∗∗ ∗ ∗ ∗

∗ ∗ ∗∗ ∗ ∗ ∗

∗ ∗ ∗ ∗∗ ∗ ∗ ∗

∗ ∗ ∗∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗

∗ ∗ ∗ ∗∗ ∗ f ∗ ∗

∗ ∗ ∗∗ ∗ ∗ ∗

∗ f ∗ ∗ f ∗∗ ∗ f ∗ f ∗

∗ f ∗ ∗∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗

M. Tuma 102

9. Parallel solvers of linear algebraic equations: VI.


Scheduling for parallel computation is not straightforward.

(Some) issues useful for linear algebraic solvers:

M. Tuma 102





1. Sparse fill-in minimizing reorderings

M. Tuma 102






2. Graph partitioning

M. Tuma 102







3. Reordering matrix for matvecs for 1-D / 2-D partitioning

M. Tuma 102








4. Sparse matrix-matrix multiplication

M. Tuma 102








4. Sparse matrix-matrix multiplication

5. Some ideas from preconditioning

M. Tuma 103

9. Parallel solvers of linear algebraic equations: VII.

Sparse fill-in minimizing reorderings.

M. Tuma 103



static differs them from dynamic reordering strategies (pivoting)

two basic types

M. Tuma 103



static differs them from dynamic reordering strategies (pivoting)

two basic types

local reorderings: based on local greedy criterion

global reorderings: taking into account the whole graph / matrix

M. Tuma 104

9. Parallel solvers of linear algebraic equations: VIII.

Local fill-in minimizing reorderings: MD: the basic algorithm.

G = G(A)

for i = 1 to n do

find v such that degG(v) = minv∈V degG(v)

G = Gv

end i

The order of found vertices induces their new renumbering

deg(v) = |Adj(v)|; graph G as a superscript determines the currentgraph

M. Tuma 105

9. Parallel solvers of linear algebraic equations: IX.

MD: the basic algorithm: example.

v v

G G_v

M. Tuma 106

9. Parallel solvers of linear algebraic equations: X.

global reorderings: ND algorithm (George, 1973)

Find separator

Reorder the matrix numbering nodes in the separator last

Do it recursively

Vertex separator

C_1 C_2

S

M. Tuma 107

9. Parallel solvers of linear algebraic equations: XI.

ND algorithm after one level of recursion

C_1

C_2

S

SC_2C_1

M. Tuma 108

9. Parallel solvers of linear algebraic equations: XII.

ND algorithm after more levels of recursion

1 7 4 43 22 28 25

3 8 6 44 24 29 27

2 9 5 45 23 30 36

19 20 21 46 40 41 42

10 16 13 47 31 37 34

1712 15 48 33 38 36

11 18 14 49 32 39 35

M. Tuma 109

9. Parallel solvers of linear algebraic equations: XIII.

static reorderings: summary

M. Tuma 109



the most useful strategy: combining local and global reorderings

M. Tuma 109




modern nested dissections are based on graph partitioners: partition agraph such that components have very similar sizes

separator is small

can be correctly formulated and solved for a general graph

theoretical estimates for fill-in and number of operations

M. Tuma 109




modern nested dissections are based on graph partitioners: partition agraph such that components have very similar sizes

separator is small

can be correctly formulated and solved for a general graph

theoretical estimates for fill-in and number of operations

modern local reorderings: used after a few steps of an incompletenested dissection

M. Tuma 110

9. Parallel solvers of linear algebraic equations: XIV

Graph partitioning

The goal: separate a given graph into pieces of similar sizes havingsmall separators.

M. Tuma 110


Graph partitioning


TH: Let G = (V, E) be a planar graph. Then we can find a vertexseparator S = (VS , ES) which divides V into two disjoint sets V1 and V2

such that max(|V1|, |V2|) ≤ 2/3|V | and VS | ≤ 2 ∗√

3|V |.

M. Tuma 110


Graph partitioning




3|V |.

Many different strategies for general cases

M. Tuma 110


Graph partitioning




3|V |.


Recursive bisecections or k-sections

M. Tuma 110


Graph partitioning




3|V |.


Recursive bisecections or k-sections

Sometimes for weighted graphs

M. Tuma 111

9. Parallel solvers of linear algebraic equations: XV

Graph partitioning: classification of a few basic approaches

M. Tuma 111



1. Kerninghan-Lin algorithm

M. Tuma 111




2. Level-structure partitioning

M. Tuma 111





3. Inertial partitioning

M. Tuma 111






4. Spectral partitioning

M. Tuma 111






4. Spectral partitioning

5. Multilevel partitioning

M. Tuma 112

9. Parallel solvers of linear algebraic equations: XVI.

Graph partitioning: Kerninghan-Lin (1970)

M. Tuma 112



Partitioning by local searches. Often used for improving partitions provided by other algorithms. More efficient implementation by Fiduccia and Mattheyses, 1982.

M. Tuma 112




The intention

Start with a graph G = (V, E) with edge weights w : E → IR+ andpartitioning V = VA ∪ VB .

M. Tuma 112




The intention

Start with a graph G = (V, E) with edge weights w : E → IR+ andpartitioning V = VA ∪ VB .

Find X ⊂ VA and Y ⊂ VB such that the new partitionV = (VA ∪ Y \X)∪ (B ∪X \ Y ) reduces total cost of edges between VA

and VB given by

COST =∑

a∈VA,b∈VB

w(a, b).

M. Tuma 113

9. Parallel solvers of linear algebraic equations: XVII.

Graph partitioning: Kerninghan-Lin: II.

Monitoring gains in COST when exchanging a pair of vertices gain(a, b) for a ∈ VA and b ∈ VB is given by

E(a)− I(a) + E(b)− I(b)− 2w(a, b),

where E(x), I(x) denotes external or internal cost of x ∈ V , respectively

M. Tuma 113




E(a)− I(a) + E(b)− I(b)− 2w(a, b),


V_A V_B

a b

M. Tuma 113




E(a)− I(a) + E(b)− I(b)− 2w(a, b),


V_A V_B

a b

I(a)

M. Tuma 113




E(a)− I(a) + E(b)− I(b)− 2w(a, b),


V_A V_B

a b

E(a)

M. Tuma 114

9. Parallel solvers of linear algebraic equations: XVIII.

Graph partitioning: Kerninghan-Lin: III.The algorithm

Algorithm 2 Kernighan-Lincompute COST of the initial partitionuntil GAIN ≤ 0

for all nodes x compute E(x) + I(x)unmark all nodeswhile there are unmarked nodes do

find a suitable pair a, b of vertices from different partitionsmaximizing gain(a, b)mark a, b

end whilefind GAIN maximizing partial ssums of gains computed in the loopif GAIN > 0 then update the partition

end until

M. Tuma 115

9. Parallel solvers of linear algebraic equations: XIX.

Graph partitioning: Level structure algorithms

based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm

M. Tuma 115




M. Tuma 115




M. Tuma 115




M. Tuma 115




M. Tuma 115




M. Tuma 115




M. Tuma 115




M. Tuma 115




M. Tuma 115




M. Tuma 115




M. Tuma 115




M. Tuma 115




M. Tuma 115




M. Tuma 116


Graph partitioning: Inertial algorithm

deals with graphs and their coordinates divides the set of graph nodes by a line (2D) or a plane (3D)

The strategy in 2D

Choose a line a(x− x0) + b(y − y0) = 0, a2 + b2 = 1. It has the slope −a/b, goes through (x0, y0). Compute distances ci of the projections of the nodes (xi, yi) from the

nodes. Compute distances di = a(yi − y0)− b(xi − x0) of the projections of the

nodes (xi, yi) from (x0, y0)

Find median d of these distances. Divide the nodes according to this median into two groups. How to choose the line?

M. Tuma 117


Graph partitioning: Inertial algorithm: II.

M. Tuma 117



M. Tuma 117



M. Tuma 118

9. Parallel solvers of linear algebraic equations: XVIII.

Graph partitioning: Inertial algorithm: III.

Some more explanation for the 2D

Finding a line such that the sum of squares of the projections to it isminimized.

This is a total least squares problem. Considering the nodes as mass units, the line taken as the axis should

minimize the moment of inertia among all possible lines. Mathematically:

n∑

i=1

c2i = (1)

=n∑

i=1

((xi − x0)2 + (yi − y0)

2 − (a(yi − y0)− b(xi − x0))2) = (a, b)M

(a

b

)

(2)

That is, a small eigenvalue problem

M. Tuma 119


Spectral partitioning

DF: The Laplacian matrix of an undirected unweighted graph G = (V, E)is given by

L(G) = AT A

where A is its incidence (edge by vertex) matrix. Namely,

L(G)ij =

degree of node i for i = j

−1 for (i, j) ∈ E, i 6= j

0 otherwise

Then

xT Lx = xT AT Ax =∑

i,j∈EG

(xi − xj)2. (3)

L positive semidefinite

M. Tuma 120

9. Parallel solvers of linear algebraic equations: XX.

Spectral partitioning: examples of Laplacians

1 2

3 4

5

−1 −1

1 −1

−1 1 −1

1 1 −1

1 1

2 −1 −1

−1 2 −1

−1 3 −1 −1

−1 −1 3 −1

−1 −1 2

M. Tuma 121

9. Parallel solvers of linear algebraic equations: XX.


Laplacian corresponding to the graph of a connected mesh haseigenvalue 0.

The eigenvector corresponding to this eigenvalue is (1, . . . , 1)T /√

n. Denote by µ, the second smallest eigenvalue of L(G). Then from the

Courant-Fischer theorem:

µ = minxT Lx | x ∈ IRn ∧ xT x = 1 ∧ xT (1, . . . , 1)T = 0. (4)

Let V is partitioned into V + and V −. Let v be a vector where v(x) = 1for x ∈ V + and v(x) = −1 otherwise. TH: Then number of edgesconnecting V + and V − is 1/4xT L(G)x.

xT L(G)x =∑

(i,j)∈E

(xi − xj)2 =

∑

(i,j)∈E,i∈V +,j∈V −

(xi − xj)2 =

= 4 ∗ number of edges between V + and V − (5)

M. Tuma 122

9. Parallel solvers of linear algebraic equations: XXI.


Find the second eigenvector of the Laplacian Dissect by its values This is an approximation to the discrete optimization problem

Multilevel partitioning: acceleration of basic procedures

Multilevel nested dissection Multilevel spectral partitioning

Approximate the initial graph G by a (simpler, smaller, cheaper) G Partition G Refine the partition from G to G Perform these steps recursively.

M. Tuma 123

9. Parallel solvers of linear algebraic equations: XXII.

Graph partitioning: problems with our model

Edge cuts are not proportional to the total communication volume

Latencies of messages typically more important than the volume

In mane cases, minmax problem should be considered (minimizingmaximum communication cost)

nonsymmetric partitions might be considered (bipartite graph model;hypergraph model)

general rectangular problem should be considered

partitioning in parallel (there are papers and codes)

M. Tuma 124

9. Parallel solvers of linear algebraic equations:

Iterative methods

M. Tuma 124


Iterative methods

Stationary iterative methods Used in some previous examples Typically not methods of choice Useful as auxiliary methods

M. Tuma 124


Iterative methods


Krylov space methods see the course by Zdenek Strakos

M. Tuma 124


Iterative methods



Simple iterative schemes driven by data decomposition Schwarz methods

M. Tuma 124


Iterative methods



Simple iterative schemes driven by data decomposition Schwarz methods

Added hierarchical principle not treated here

M. Tuma 125


Iterative Schwarz methods

M. Tuma 125



Ω =⋃

Ωi, d domains

Ωi ∩Omegaj 6= ∅ : overlap

M. Tuma 125



Ω =⋃

Ωi, d domains

Ωi ∩Omegaj 6= ∅ : overlap

M. Tuma 126


Iterative Schwarz methods: II.

z(0), z(1), . . .

M. Tuma 126



z(0), z(1), . . .

A|Ωj= Aj = RT

j ARj

(Rj extracts columns from I corresponding to nodes in Ωj)

M. Tuma 126



z(0), z(1), . . .

A|Ωj= Aj = RT

j ARj


rj ≡ (b−Az(k))|Ωj= RT

j (b−Az(k))

M. Tuma 126



z(0), z(1), . . .

A|Ωj= Aj = RT

j ARj



j (b−Az(k))

z(k+i/d) = z(k+(i−1)/d)+RjA−1j RT

j (b−Az(k+(i−1)/d)) ≡ z(k+(i−1)/d)+Bjr(k+(i−1)/d)

M. Tuma 126



z(0), z(1), . . .

A|Ωj= Aj = RT

j ARj



j (b−Az(k))

z(k+i/d) = z(k+(i−1)/d)+RjA−1j RT

j (b−Az(k+(i−1)/d)) ≡ z(k+(i−1)/d)+Bjr(k+(i−1)/d)

This is the Multiplicative Schwarz procedureLess parallel, more powerful

M. Tuma 127


Iterative Schwarz methods: III.

z(0), z(1), . . .

M. Tuma 127



z(0), z(1), . . .

g groups of domains that do not overlap

M. Tuma 127



z(0), z(1), . . .


x(k+1/g) = x(k) +∑

j∈group1

Bjr(k)

x(k+2/g) = x(k+1/g) +∑

j∈group2

Bjr(k+1/g) . . .

x(k+1) = x(k+(g−1)/g) +∑

j∈groupg

Bjr(k+(g−1)/g)

M. Tuma 127



z(0), z(1), . . .


x(k+1/g) = x(k) +∑

j∈group1

Bjr(k)

x(k+2/g) = x(k+1/g) +∑

j∈group2

Bjr(k+1/g) . . .

x(k+1) = x(k+(g−1)/g) +∑

j∈groupg

Bjr(k+(g−1)/g)

z(k+1) = z(k) +∑

j

Bjr(k)

M. Tuma 127



z(0), z(1), . . .


x(k+1/g) = x(k) +∑

j∈group1

Bjr(k)

x(k+2/g) = x(k+1/g) +∑

j∈group2

Bjr(k+1/g) . . .

x(k+1) = x(k+(g−1)/g) +∑

j∈groupg

Bjr(k+(g−1)/g)

z(k+1) = z(k) +∑

j

Bjr(k)

This is the Additive Schwarz procedureMore parallel, more powerful

M. Tuma 128


FETI (Finite Element Tearing and Interconnecting

M. Tuma 128


FETI (Finite Element Tearing and Interconnecting

non-overlapping domain decomposition scheme

numerically scalable for a wide class of PDE problems (e.g., some 2ndorder elasticity, plate and shell problems)

successful parallel implementations

problem: domain matrices do not need to be regular

here, an example for two subdomains

M. Tuma 129


FETI (Finite Element Tearing and Interconnecting: II.

M. Tuma 129



Omega^(2)

Omega^(1)

Gamma_I

M. Tuma 129



Omega^(2)

Omega^(1)

Gamma_I

K(1)u(1) = f (1) + B(1)T λ

K(2)u(2) = f (2) + B(2)T λ

B(1)u(1) = B(2)u(2)

M. Tuma 130


FETI (Finite Element Tearing and Interconnecting: III.

K(1)u(1) = f (1) + B(1)T λ

K(2)u(2) = f (2) + B(2)T λ

B(1)u(1) = B(2)u(2)

M. Tuma 130



K(1)u(1) = f (1) + B(1)T λ

K(2)u(2) = f (2) + B(2)T λ

B(1)u(1) = B(2)u(2)

If we can substitute we get

u(1) = K(1)−1(f (1) + B(1)T λ)

u(2) = K(2)−1(f (2) + B(2)T λ)

(B(1)K(1)−1B(1)T +B(2)K(2)−1

B(2)T )λ = B(1)K(1)−1f (1)+B(2)K(2)−1

f (2)

M. Tuma 130



In general we have

u(1) = K(1)+(f (1) + B(1)T λ) + R(1)α

u(2) = K(2)+(f (2) + B(2)T λ) + R(2)α

M. Tuma 130



In general we have

u(1) = K(1)+(f (1) + B(1)T λ) + R(1)α

u(2) = K(2)+(f (2) + B(2)T λ) + R(2)α

(B(1)K(1)+B(1)T + B(2)K(2)+B(2)T −B(2)R(2)

−R(2)TB(2)T 0

)(λ

α

)

=

(B(1)K(1)+f (1) + B(2)K(2)+f (2)

−R(2)Tf (2)

)

interface system

M. Tuma 131


FETI (Finite Element Tearing and Interconnecting: IV.(

FI −GI

−GTI 0

)(λ

α

)=

(d

−e

)

M. Tuma 131


FETI (Finite Element Tearing and Interconnecting: IV.(

FI −GI

−GTI 0

)(λ

α

)=

(d

−e

)

Solution: general augmented system (constrained minimization) conjugate gradients projected to the null-space of GT

I . initial λ satisfying the constraint as GT

I (GTI GI)

−1e

explicit projector P = I −GI(GTI GI)

−1GTI )

reorthogonalizations closely related method: balanced DD (balancing residuals by adding a

coarse problem from equilibrium conditions for possibly singularproblems; Mandel, 1993)

M. Tuma 132


”Universal” matrix operation – parallel aspects: 1.

Matrix, of course, sparse

M. Tuma 132



Matrix, of course, sparse Matrix should be distributed, based, e.g., on distributed read of row

lengths.

M. Tuma 132




lengths. first step: what are my rows? do i=1,n determine start of the row i find start of the row i find end of the row i compute length of the row end do

M. Tuma 132




lengths. first step: what are my rows? do i=1,n determine start of the row i find start of the row i find end of the row i compute length of the row end do parallel gather / parallel sort / parallel merge

M. Tuma 132




lengths. first step: what are my rows? do i=1,n determine start of the row i find start of the row i find end of the row i compute length of the row end do parallel gather / parallel sort / parallel merge finally, processes know what are their rows

M. Tuma 132




lengths. first step: what are my rows? do i=1,n determine start of the row i find start of the row i find end of the row i compute length of the row end do parallel gather / parallel sort / parallel merge finally, processes know what are their rows / / / at least, at the beginning: static load balancing

M. Tuma 132




lengths. first step: what are my rows? do i=1,n determine start of the row i find start of the row i find end of the row i compute length of the row end do parallel gather / parallel sort / parallel merge finally, processes know what are their rows / / / at least, at the beginning: static load balancing for example: cyclic distribution matrix rows to groups of approximately

nnz(A)/p nonzeros.

M. Tuma 133



Natural assumption: matrix processed only as distributed.

M. Tuma 133



Natural assumption: matrix processed only as distributed. second step: distributed read

M. Tuma 133



Natural assumption: matrix processed only as distributed. second step: distributed read in MPI: all procesors check for their rows concurrently do i=1,n

if this is my row then

determine start of the row i

find start of the row i

find end of the row i

read / process the row: if (myid.eq.xxx)then read

end if

end do

M. Tuma 134



How efficiently merge sets of sparse vectors?

M. Tuma 134



How efficiently merge sets of sparse vectors? entries stored with local indices

M. Tuma 134



How efficiently merge sets of sparse vectors? entries stored with local indices

1 3 11 13

1 4 9 11

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 GLOBAL

GLOBAL

1

1

2

3 4

5

5

6

LOCAL

M. Tuma 135



1 3 11 13

1 4 9 11

GLOBAL

1

1

2

3 4

5

5

6

LOCAL

M. Tuma 135



1 3 11 13

1 4 9 11

GLOBAL

1

1

2

3 4

5

5

6

LOCAL

local to global mapping: direct indexing

global to local mapping: e.g., hash tables

M. Tuma 136

9. Parallel solvers of linear algebraic equations: XXIII.

Sparse matrix-matrix multiplications

A natural routine when dealing with more blocks Useful even for forming Schur complements in the sequential case

case 1: C = AB, all stored by rows case 2: C −AB, A stored by columns, B stored by rows

cc -- clear wn01; set linksc

do i=1,max(p,n)wn01(i)=0link(i)=0head(i)=0first(i)=ia(i)

end do

M. Tuma 137

9. Parallel solvers of linear algebraic equations: XXIV

Sparse matrix-matrix multiplications: II.

cc -- initialize pointers firstc

do i=1,pj=first(i)if(j.lt.ia(i+1)) then

k=ja(j)-shiftif(head(k).eq.0) thenlink(i)=0

elseif(head(k).ne.0) thenlink(i)=head(k)

end ifhead(k)=i

end ifend doindc=1ic(1)=indc

M. Tuma 138

9. Parallel solvers of linear algebraic equations: XXV

Sparse matrix-matrix multiplications: III.

cc -- loop of rows of ac

do i=1,mnewj=head(i)ind2=0

200 continuej=newjif(j.eq.0) go to 400newj=link(j)jfirst=first(j)first(j)=jfirst+1

M. Tuma 139

9. Parallel solvers of linear algebraic equations: XXVI.

Sparse matrix-matrix multiplications: IV.

cc -- if indices of j-th column are not processedc

if(jfirst+1.lt.ia(j+1)) thenl=ja(jfirst+1)-shiftif(head(l).eq.0) thenlink(j)=0

elseif(head(l).ne.0) thenlink(j)=head(l)

end ifhead(l)=j

end if

M. Tuma 140

9. Parallel solvers of linear algebraic equations: XXVII.

Sparse matrix-matrix multiplications: V.

cc -- coded loop search through the row of bc

temp=aa(jfirst)kstrt=ib(j)kstop=ib(j+1)-1

cc -- search the row of bc

do k=kstrt,kstopk1=jb(k)if(wn01(k1).eq.0) then

ind2=ind2+1wn02(ind2)=k1wr02(ind2)=temp*ab(k)wn01(k1)=ind2

elsewr02(wn01(k1))=wr02(wn01(k1))+temp*ab(k)

end if

M. Tuma 141

9. Parallel solvers of linear algebraic equations: XXVIII.

Sparse matrix-matrix multiplications: VI.

cc -- end of coded loop in jc

go to 200400 continue

cc -- rewrite indices and elements to ic/jc/acc

do j=1,ind2k=wn02(j)jc(indc)=kwn01(k)=0ac(indc)=wr02(j)indc=indc+1

end doic(i+1)=indc

end do

M. Tuma 142

9. Parallel solvers of linear algebraic equations: XXIX.

Preconditioners: approximations M to A: M ≈ A

M. Tuma 142



Within a stationary (linear, consistent) method: need to solve a systemwith M

x+ = x−M−1(Ax− b) (6)

M. Tuma 142



Within a stationary (linear, consistent) method: need to solve a systemwith M

x+ = x−M−1(Ax− b) (6)

Desired properties of M : good approximation to A

in the sense of a norm of (M −A) in the sense of a norm of (I −M−1A) if factorized, then with stable factors

systems with M should be easy to solve

applicable to a wide spectrum of computer architectures

M. Tuma 143

9. Parallel solvers of linear algebraic equations: XXX.

Having M as a preconditioner in (1) is equivalent to transform the linearsystem to

M−1Ax = M−1b (preconditioner applied from left)

M. Tuma 143




Other transformations are possible obtained by change of variables and/ or supporting matrix symmetry.

AM−1y = b (preconditioner applied from right)M−1

1 AM−12 x = M−1

1 b (split preconditioner)

M. Tuma 143




Other transformations are possible obtained by change of variables and/ or supporting matrix symmetry.

AM−1y = b (preconditioner applied from right)M−1

1 AM−12 x = M−1

1 b (split preconditioner)

In all these cases, corresponding stationary iterations can be put down

Two basic approaches how to plug-in preconditioning: Write directly recursions for the transformed system. Mostly in case

of stationary iterative methods Use it only inside a procedure to get M−1z (or similar operations) for

a given z. This is more flexible and useful also for non-stationaryiterative methods

M. Tuma 144

10. Approximate inverse preconditioners: I.

M ≈ A−1

M. Tuma 144

10. Approximate inverse preconditioners: I.

M ≈ A−1

Example grid to show the local character of fill-in

Vertex separator

C_1 C_2

S

M. Tuma 145

10. Approximate inverse preconditioners: II.

Some properties of approximate inverses

M. Tuma 145



Fill-in not only local

M. Tuma 145




Even more stress to stay sparse

M. Tuma 145





Provide reasonably precise info on the exact matrix inverse

M. Tuma 145






Explicit – potential for parallelism

M. Tuma 145







Why the fill-in may be non-local

M. Tuma 145







Why the fill-in may be non-local A −→ nonzeros determined by nonzeros in adjacency graph G(A)

M. Tuma 145







Why the fill-in may be non-local A −→ nonzeros determined by nonzeros in adjacency graph G(A)

A−1 −→ nonzeros determined by nonzeros in transitive closure of G(A)(paths in G(A)↔ edges in the transitive closure)

M. Tuma 146

10. Approximate inverse preconditioners: III.

Summarizing motivation

M. Tuma 146


Summarizing motivation Approximate inverses have specific features not shared with other

preconditioners.

M. Tuma 146



preconditioners.

AI are sometimes pretty efficient as preconditioners. Can help to solvesome hard problems.

M. Tuma 146



preconditioners.


AI can lead to development of some other algorithms.

M. Tuma 146



preconditioners.



Especially helpful on parallel computer architectures.

M. Tuma 146



preconditioners.




A lot of features that still have to be developed.

M. Tuma 146



preconditioners.




A lot of features that still have to be developed.

In short (PDE terms): Hope to capture by approximate inverses also somebasic non-local features of discrete Green functions.

M. Tuma 147

10. Approximate inverse preconditioners: IV.

Some basic techniques

M. Tuma 147



Frobenius norm minimization (Benson, 1973)

minimize FW (X, A) = ‖I −XA‖2W = tr[(I −XA)W (I −XA)T

]

M. Tuma 147





]

Global matrix iterations (Schulz, 1933)

Iterate Gi+1 = Gi(2I −AGi)

M. Tuma 147





]



A-orthogonalization (Benzi, T., 1996)

Get W, Z, D from ZT AW = D ≡ A−1 = WD−1ZT

M. Tuma 147





]



A-orthogonalization (Benzi, T., 1996)

Get W, Z, D from ZT AW = D ≡ A−1 = WD−1ZT

Approximate inverses as auxiliary procedures, e.g. in block algorithms(Axelsson, Brinkkemper, Il’in, 1984; Concus, Golub, Meurant, 1985)

M. Tuma 148

10. Approximate inverse preconditioners: V.

Other approaches

M. Tuma 148


Other approaches

Approximate inverse smoothers in geometric and algebraic multigrids:Chow (2000); Tang, Wan, 2000; Bröker, Grote, Mayer, Reusken, 2002;Bröker, Grote, 2002.

M. Tuma 148


Other approaches


Inverted direct incomplete decompositions, Alvarado, Dag, 1992

M. Tuma 148


Other approaches



Approximate inverses by bordering, Saad, 1996(ZT

−yT 1

)(A v

vT α

)(Z −y

1

)=

(D

δ

)

M. Tuma 148


Other approaches



Approximate inverses by bordering, Saad, 1996(ZT

−yT 1

)(A v

vT α

)(Z −y

1

)=

(D

δ

)

Sherman-Morrison formula based preconditioners (Bru, Cerdán, Marín,Mas, 2002)

M. Tuma 149

10. Approximate inverse preconditioners: VI.

Frobenius norm minimization: special cases I.

M. Tuma 149



Least-squares approximate inverse (AI): W = I (Benson, 1973)

Minimize FI(X, A) = ‖I −XA‖F =n∑

i=1

‖eTi − xiA‖22,

xi: rows of XIt leads to n simple least-squares problems

M. Tuma 149





i=1



Direct block method (DB): W = A−1 (Benson, 1973):

Solve [GA]ij = δij for (i, j) ∈ S,

where S is the sparsity pattern for the inverse

M. Tuma 149





i=1



Direct block method (DB): W = A−1 (Benson, 1973):

Solve [GA]ij = δij for (i, j) ∈ S,

where S is the sparsity pattern for the inverse

In both LS and DB: sparsity pattern assumption.

M. Tuma 150

10. Approximate inverse preconditioners: VII.

Frobenius norm minimization: special cases II.

M. Tuma 150



Changing sparsity patterns in outer iterations: SPAI (Cosgrove, Díaz,Griewank, 1992; Grote, Huckle, 1997)

M. Tuma 150



Changing sparsity patterns in outer iterations: SPAI (Cosgrove, Díaz,Griewank, 1992; Grote, Huckle, 1997) Evaluating new pattern by estimating norms of possible new residuals More exact evaluations of residuals (Gould, Scott, 1995) Procedurally parallel, but data paralelism difficult Need to have high-quality pattern predictions (Huckle, 1999, 2001;

Chow, 2000)

M. Tuma 150




Chow, 2000)

Simple stationary iterative method for individual columns ci by solving

Aci = ei

(Chow, Saad, 1994)

M. Tuma 150




Chow, 2000)

Simple stationary iterative method for individual columns ci by solving

Aci = ei

(Chow, Saad, 1994) Simple, but not very efficient “Gauss-Seidel” variant: sometimes much better, sometimes much

worse

M. Tuma 151

10. Approximate inverse preconditioners: VIII.

Frobenius norm minimization: special cases III.

M. Tuma 151



Factorized inverse preconditioners based on Frobenius normapproximate minimization for SPD matrices (Kolotilina, Yeremin, 1973)

Z = arg minX∈S

FI(XT , L) = arg min

X∈S‖I −XT L‖2F , where A = LLT .

M. Tuma 151




Z = arg minX∈S



The procedure: first get Z from problem

‖I −XT L‖2F =n∑

i=1

‖eTi − xT

i L‖22, Set D = (diag(Z))−1, Z = ZD1/2

Then A−1 ≈ ZZT

M. Tuma 151




Z = arg minX∈S





i=1

‖eTi − xT

i L‖22, Set D = (diag(Z))−1, Z = ZD1/2

Then A−1 ≈ ZZT

Extended to nonsymmetric case

M. Tuma 151




Z = arg minX∈S





i=1

‖eTi − xT

i L‖22, Set D = (diag(Z))−1, Z = ZD1/2

Then A−1 ≈ ZZT


Rather robust, often underestimated

M. Tuma 152

10. Approximate inverse preconditioners: IX.

A-orthogonalization: AINV

M. Tuma 152



For an SPD matrix A: find an upper triangular Z and diagonal matrix Dsuch that

ZT AZ = D → A−1 = ZD−1ZT (7)

M. Tuma 152




ZT AZ = D → A−1 = ZD−1ZT (7)

The algorithm: Conjugate Gram-Schmidt: GS with a different innerproduct: (x, y)A

M. Tuma 152




ZT AZ = D → A−1 = ZD−1ZT (7)


Origins of the A-orthogonalization for solving linear systems in morepapers in 40’s

M. Tuma 152




ZT AZ = D → A−1 = ZD−1ZT (7)



A more detailed treatment of theA-orthogonalization: in the first Wilkinson paper (with Fox and Huskey,1948)

M. Tuma 152




ZT AZ = D → A−1 = ZD−1ZT (7)





M. Tuma 152




ZT AZ = D → A−1 = ZD−1ZT (7)




Extended to nonsymmetric case Breakdown-free modification for SPD A (Benzi, Cullum, T., 2001)

M. Tuma 153

10. Approximate inverse preconditioners: X.


M. Tuma 153


A-orthogonalization: AINVAlgorithm H-S I.

zi = ei −i−1∑

k=1

zkeTi Azk

zTk Azk

, i = 1, . . . , n; Z = [z1, . . . , zn]

M. Tuma 153


A-orthogonalization: AINVAlgorithm H-S I.

zi = ei −i−1∑

k=1

zkeTi Azk

zTk Azk

, i = 1, . . . , n; Z = [z1, . . . , zn]

left-looking

stabilized diagonal entries (in exact arithmetic eTi Azk ≡ zT

i Azk, i ≤ k)

M. Tuma 154

10. Approximate inverse preconditioners: XI.

Possibility of breakdowns

M. Tuma 154


Possibility of breakdowns AINV: modified incomplete A-orthogonalization; exists for M-matrices

and H-matrices (Benzi, Meyer, T., 1996)

M. Tuma 154




Possibility of breakdown in A-orthogonalization for non-H matrices

M. Tuma 154





Possibly poor approximate inverses for these matrices

M. Tuma 154






A-orthogonalization in historical perspective

M. Tuma 154







Fox, Huskey, Wilkinson, 1948: H-S I.

M. Tuma 154








Escalator method by Morris, 1946: a variation of H-S I. (non-stabilizedcomputation of D)

M. Tuma 154









Vector method by Purcell, 1952: basically H-S II.

M. Tuma 154










Approximate inverse by bordering (Saad, 1996) is equivalent to H-S I.(Benzi, T. (2002))

M. Tuma 154










Approximate inverse by bordering (Saad, 1996) is equivalent to H-S I.(Benzi, T. (2002))

Bridson, Tang, 1998 – (nonsymmetric) algorithms equivalent to H-S I.

M. Tuma 155

10. Approximate inverse preconditioners: XII.

Other possible stabilization attempts:

M. Tuma 155



Pivoting

M. Tuma 155



Pivoting

Look-ahead

M. Tuma 155



Pivoting

Look-ahead

DCR

M. Tuma 155



Pivoting

Look-ahead

DCR

Block algorithms

M. Tuma 156

11. Polynomial preconditioners: I.

The problem

M. Tuma 156


The problem

Find a preconditioner M such that M−1 is a polynomial in A of a givendegree k, that is

M−1 = Pk(A) =

k∑

j=0

αjAj .

First proposed by Cesari, 1937 (for Richardson iteration)

M. Tuma 156


The problem


M−1 = Pk(A) =

k∑

j=0

αjAj .

First proposed by Cesari, 1937 (for Richardson iteration) Naturally motivated since by the Cayley-Hamilton theorem we have

Qk(A) ≡k∑

j=0

βjAj = 0

for the characteristic polynomial of A, k ≤ n.

M. Tuma 156


The problem


M−1 = Pk(A) =

k∑

j=0

αjAj .

First proposed by Cesari, 1937 (for Richardson iteration) Naturally motivated since by the Cayley-Hamilton theorem we have

Qk(A) ≡k∑

j=0

βjAj = 0

for the characteristic polynomial of A, k ≤ n. Therefore, we have

A−1 = − 1

β0

k∑

j=1

βjAj−1

M. Tuma 157

11. Polynomial preconditioners: II.

Polynomial preconditioners and Krylov space methods

M. Tuma 157



Remind: CG forms an approximation to the solution vector for k ≥ 1

xk+1 = x0 + Pk(A)r0

M. Tuma 157




xk+1 = x0 + Pk(A)r0

The polynomial Pk(A) is optimal in minimizing

(x− x∗)T A(x− x∗)

among all polynomials of degree at most k with Pk(0) = 1.

M. Tuma 157




xk+1 = x0 + Pk(A)r0


(x− x∗)T A(x− x∗)

among all polynomials of degree at most k with Pk(0) = 1. Therefore, why polynomial preconditioners?

M. Tuma 157




xk+1 = x0 + Pk(A)r0


(x− x∗)T A(x− x∗)


Number of CG iterations still can be decreased

M. Tuma 157




xk+1 = x0 + Pk(A)r0


(x− x∗)T A(x− x∗)


Number of CG iterations still can be decreased Can be useful when bottleneck is in scalar products, message

passing, memory hierarchy. Can strongly enhance vector processing.

M. Tuma 157




xk+1 = x0 + Pk(A)r0


(x− x∗)T A(x− x∗)


Number of CG iterations still can be decreased Can be useful when bottleneck is in scalar products, message

passing, memory hierarchy. Can strongly enhance vector processing. simplicity, matrix-free computations

M. Tuma 158

11. Polynomial preconditioners: III.

Basic classes of polynomial preconditioners:I.

M. Tuma 158



Neumann series preconditioners for SPD systems (Dubois,Greenbaum, Rodrigue, 1979)

M. Tuma 158



Neumann series preconditioners for SPD systems (Dubois,Greenbaum, Rodrigue, 1979) Let A = M1 −N such that M is nonsingular and G = M−1

1 N satisfiesρ(G) < 1. Then

A−1 = (I −G)−1M−11 =

+∞∑

j=1

Gj

M−1

1

M. Tuma 158





A−1 = (I −G)−1M−11 =

+∞∑

j=1

Gj

M−1

1

The preconditioner: truncating the series

M−1 =

k∑

j=1

Gj

M−1

1 , k > 0

M. Tuma 158





A−1 = (I −G)−1M−11 =

+∞∑

j=1

Gj

M−1

1

The preconditioner: truncating the series

M−1 =

k∑

j=1

Gj

M−1

1 , k > 0

Preconditioners Pk of odd degrees sufficient (not less efficient thanPk+1.

M. Tuma 159

11. Polynomial preconditioners: IV.

Basic classes of polynomial preconditioners:II.

M. Tuma 159



Generalized Neumann series preconditioners for SPD systems(Johnson, Micchelli, Paul, 1983)

M. Tuma 159



Generalized Neumann series preconditioners for SPD systems(Johnson, Micchelli, Paul, 1983) Parametrizing the approximate inverse (I −G)−1

I + γ1G + γ2G2 + . . . γkGk

M. Tuma 159




I + γ1G + γ2G2 + . . . γkGk

the added degrees of freedom may be used to optimize theapproximation to A−1

M. Tuma 159




I + γ1G + γ2G2 + . . . γkGk


Let

Rk = Rk|Rk(0) = 0; Rk(λ) > 0 ∀ λ from an inclusion set IS

M. Tuma 159




I + γ1G + γ2G2 + . . . γkGk


Let


Find Rk ∈ Qk such that it has min-max value on this inclusion setminmax polynomial.

M. Tuma 159




I + γ1G + γ2G2 + . . . γkGk


Let



Apply to residual polynomials 1−Qk(λ) = 1− λPk(λ) for thepolynomial preconditioner Pk.

M. Tuma 159




I + γ1G + γ2G2 + . . . γkGk


Let



Apply to residual polynomials 1−Qk(λ) = 1− λPk(λ) for thepolynomial preconditioner Pk.

This polynomial can be expressed in terms of Chebyshevpolynomials of the first kind

M. Tuma 160

11. Polynomial preconditioners: V.

Basic classes of polynomial preconditioners:III.

M. Tuma 160



Least-squares preconditioners for SPD systems (Johnson, Micchelli,Paul, 1983)

M. Tuma 160



Least-squares preconditioners for SPD systems (Johnson, Micchelli,Paul, 1983) Min-max polynomial may map small eigenvalues of A to large

eigenvalues of M−1A. This seems to degrade the convergence rate.Its quality seems to depend strongly on inclusion set estimate.

M. Tuma 160





This approach: minimize a quadratic norm of the residual polynomial:∫

IS

(1−Q(λ))2w(λ)dλ

M. Tuma 160






IS


Jacobi weights (w(λ) = (b− λ)α(λ− a)α, α, β > −1 for IS = 〈a, b〉 orLegendre weights (w ≡ 1) for simple integration

M. Tuma 160






IS


Jacobi weights (w(λ) = (b− λ)α(λ− a)α, α, β > −1 for IS = 〈a, b〉 orLegendre weights (w ≡ 1) for simple integration

Computing the polynomials from three-term recurrences (Stiefel,1958), or by kernel polynomials (Stiefel, 1958), or from normalequations (Saad, 1983)

M. Tuma 161

11. Polynomial preconditioners: VI.

Preconditioning symmetric indefinite systems

M. Tuma 161



DeBoor and Rice polynomials solve the minmax problem for generalinclusion sets composed from two parts: IS = 〈a, b〉 ∪ 〈c, d〉, b < 0 < c(DeBoor, Rice, 1982)

M. Tuma 161




For equal lengths (b− a = d− c): can be expressed in terms ofChebyshev polynomials of the first kind. Best behavior in this case.

M. Tuma 161





Grcar polynomials solve a slightly modified minmax approximationproblem. Formulated for residual polynomials. But: more oscillatorybehavior.

M. Tuma 161






Both mentioned possibilities: positive definite preconditioned matrix, notexplicitly computable

M. Tuma 161






Both mentioned possibilities: positive definite preconditioned matrix, notexplicitly computable

Clustering eigenvalues around µ < 0 and 1 (Freund, 1991; bilevelpolynomial of Ashby, 1991) Best behavior for nonequal intervals andb ≈ −c.

M. Tuma 162

11. Polynomial preconditioners: VII.

Further achievements

M. Tuma 162



Different weights for least-squares polynomials for solving symmetricindefinite systems (Saad, 1983).

M. Tuma 162




Adapting polynomials based on the information from the CG method(Ashby, 1987, 1990; Ashby, Manteuffel, Saylor, 1989; see also Fischer,Freund, 1994; O’Leary, 1991).

M. Tuma 162





Double use of minmax polynomial can bring some improvement (Perlot,1995)

M. Tuma 162





Double use of minmax polynomial can bring some improvement (Perlot,1995)

Polynomial preconditioners for solving nonsymmetric systems arepossible, but, typically, not a method of choice (Manteuffel, 1977, 1978;Saad, 1986; Smolarski, Saylor, 1988).

M. Tuma 163

12. Element-by-element preconditioners: I.

Basic notation

M. Tuma 163


Basic notation

Assume that A is given as

A =∑

e

Ae

M. Tuma 163


Basic notation


A =∑

e

Ae

Consider

Me = (DA)e + (Ae −De),

where (DA)e is a part of DA corresponding to Ae.

M. Tuma 163


Basic notation


A =∑

e

Ae

Consider


where (DA)e is a part of DA corresponding to Ae. Set

M =

ne∏

e=1

Me

M. Tuma 163


Basic notation


A =∑

e

Ae

Consider


where (DA)e is a part of DA corresponding to Ae. Set

M =

ne∏

e=1

Me

Introduced by Hughes, Levit, Winget, 1983 (and formulated forJacobi-scaled A)

M. Tuma 164

12. Element-by-element preconditioners: II.

Other possibilities

M. Tuma 164


Other possibilities

Simple application to solving nonsymmetric systems Mz = y (as theproduct of easily invertible matrices)

M. Tuma 164


Other possibilities


For solving SPD systems Me matrices can be decomposed as

Me = LeLTe

M. Tuma 164


Other possibilities



Me = LeLTe

Another approach (Gustafsson, Linskog, 1986)

M =

ne∑

e=1

Le,

where Le can be modified to be positive definite (individual Ae do notneed to be regular)

M. Tuma 164


Other possibilities



Me = LeLTe

Another approach (Gustafsson, Linskog, 1986)

M =

ne∑

e=1

Le,

where Le can be modified to be positive definite (individual Ae do notneed to be regular)

Parallel implementations (van Gijzen, 1994; Daydé, L’Excellent, Gould,1997)

M. Tuma 165

13. Vector / Parallel preconditioners: I.

Decoupling parts of triangular factors

Forced aposteriori annihilation in triangular factors (Seager, 1986)

M. Tuma 165




∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗

M. Tuma 165




∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗

M. Tuma 165




∗∗ ∗∗ ∗

∗∗ ∗∗ ∗∗ ∗

∗∗ ∗∗ ∗

M. Tuma 165




∗∗ ∗∗ ∗

∗∗ ∗∗ ∗∗ ∗

∗∗ ∗∗ ∗

Can lead to slow convergence

M. Tuma 166

13. Vector / Parallel preconditioners: II.

Partial vectorization

Exploiting vector potential of a special structure of the matrix Example factor from 5-point stencil:

M. Tuma 166




**

**

***

**

**

**

***

**

**

**

***

**

**

**

**

**

**

**

**

*

**

**

***

**

M. Tuma 166




**

**

***

**

**

**

***

**

**

**

***

**

**

**

**

**

**

**

**

*

**

**

***

**

So nice only for regular grids

M. Tuma 167

13. Vector / Parallel preconditioners: III.

Generalized partial vectorization: jagged diagonal formats,modified jagged diagonal formats, stripes (Melhem, 1988;

Anderson, 1988; Paolini, Di Brozolo, 1989)

Storing matrix as a small number of long diagonals

M. Tuma 167





a

b c

d e

f g h

Construction 1) row compression 2) sorting the rows 3) considering the matrix as a set of columns

M. Tuma 167





a

b c

d e

f g h


M. Tuma 167





f g h

b c

d e

a


M. Tuma 167





f g h

b c

d e

a


M. Tuma 167





f g h

b c

d e

a


Other sophisticated variations: cf. Heroux, Vu, Yang, 1991.

M. Tuma 168

13. Vector / Parallel preconditioners: IV.

Wavefront processing for 5-point stencil in 2D

M. Tuma 168



M. Tuma 168



generalization to 7-point pencils in 3D: hyperplane approach block chequer-board distribution of processors

M. Tuma 169

13. Vector / Parallel preconditioners: V.

Generalized wavefront / hyperplane processing: levelscheduling

Structure of L or U can be described by a directed acyclic graph The level scheduling is an aposteriori reordering (applied to graphs

triangular factors of A) (Anderson, 1988)

M. Tuma 169





2 5 7 10

49

13

6

8

M. Tuma 169





∗

∗

∗ ∗

∗ ∗ ∗

∗ ∗

∗ ∗ ∗

∗ ∗

∗ ∗

∗ ∗ ∗

∗ ∗ ∗

2 5 7 10

49

13

6

8

M. Tuma 169





∗

∗

∗ ∗

∗ ∗ ∗

∗ ∗

∗ ∗ ∗

∗ ∗

∗ ∗

∗ ∗ ∗

∗ ∗ ∗

2 5 7 10

49

13

6

8

Suitable for unstructured matrices

M. Tuma 170

13. Vector / Parallel preconditioners: VI.

Twisted factorization

Concurrent factorization from both ends of domain (Babuçka, 1972;Meurant 1984; van der Vorst, 1987)

M. Tuma 170




∗∗ ∗∗ ∗∗ ∗∗ ∗ ∗

∗ ∗∗ ∗∗ ∗∗

∗ ∗∗ ∗∗ ∗∗ ∗∗∗ ∗∗ ∗∗ ∗∗ ∗

M. Tuma 170




∗∗ ∗∗ ∗∗ ∗∗ ∗ ∗

∗ ∗∗ ∗∗ ∗∗

∗ ∗∗ ∗∗ ∗∗ ∗∗∗ ∗∗ ∗∗ ∗∗ ∗

Only two-way parallelism Can be performed in a nested way (van der Vorst, 1987)

M. Tuma 171

13. Vector / Parallel preconditioners: VII.

Ordering from corners for regular grids in 2D

M. Tuma 171



M. Tuma 171



M. Tuma 171



M. Tuma 171



M. Tuma 171



M. Tuma 171



Can be generalized to 3D

M. Tuma 172

13. Vector / Parallel preconditioners: VIII.

Generalized ordering from corners: reorderings based ondomains

M. Tuma 172



M. Tuma 172



M. Tuma 172



Useful for general domains (matrices) Sophisticated graph partitioning algorithms

M. Tuma 173

13. Vector / Parallel preconditioners: IX.

Generalized ordering from corners: reorderings based ondomains: additional ideas

M. Tuma 173

13. Vector / Parallel preconditioners: IX.

Generalized ordering from corners: reorderings based ondomains: additional ideas

ILU with overlapped diagonal blocks (Radicati, Robert, 1987) Chan, Govaerts,1990: ILU for domains can provide faster even

sequential iterative methods Tang (1992); Tan (1995): Enhanced interface conditions for better

coupling Karypis, Kumar, 1996: But, convergence rate can be strongly

deteriorated Benzi, Marín, T., 1997: Parallel approximate inverse preconditioners +

parallelization by domains can solve some hard problems

M. Tuma 174

13. Vector / Parallel preconditioners: X.

Parallel preconditioning: distributed parallelism

M. Tuma 174



subdomain boundary

P0

P1

P2

P0 P1 P2

M. Tuma 174



subdomain boundary

P0

P1

P2

P0 P1 P2 Matrix-vector product: overlapping communication and computation

1) Initialize sends and receives of boundary nodes 2) Perform local matvecs 3) Complete receives of boundary data 4) Finish the computation

M. Tuma 175

13. Vector / Parallel preconditioners: XI.

Multicolorings

Faster convergence; parallel; need to balance these effects Doi, 1991; Doi, Lichnewsky, 1991; Doi, Hoshi, 1992; Wang, Hwang,

1995 Nodes with the same colour: mutually as far as possible

M. Tuma 175


Multicolorings



M. Tuma 175


Multicolorings



M. Tuma 175


Multicolorings



M. Tuma 175


Multicolorings



M. Tuma 175


Multicolorings



M. Tuma 175


Multicolorings



M. Tuma 175


Multicolorings



M. Tuma 176

14. Solving nonlinear systems: I.

Newton-Krylov paradigm

F (x) = 0

⇓Sequences of linear systems of the form

J(xk)∆x = −F (xk), J(xk) ≈ F ′(xk)

solved until for some k, k = 1, 2, . . .

‖F (xk)‖ < tol

J(xk) may change at points influenced by nonlinearities

M. Tuma 177

14. Solving nonlinear systems: II.

Much easier if matrix approximations are readily available

But: matrices are often given only implicitly.

For example: linear solvers in Newton-Krylov framework (see, e.g., Knoll,Keyes, 2004)

J(xk)∆x = −F (xk), J(xk) ≈ F ′(xk)

Only matvecs F ′(xk)v for a given vector v are typically performed. Finite differences can be used to get such products:

F (xk + ǫv)− F (xk)

ǫ≈ F ′(xk)v

matrices are always present in more or less implicit form: a tradeoff:implicitness × fast execution appears in many algorithms

For strong algebraic preconditioners we need matrix approximations

M. Tuma 178

14. Solving nonlinear systems: III.

To summarize

Jacobian J often provided only implicitly

Parallel functional evaluations

Efficient preconditioning of the linearized system

Efficient evaluation of the products Jx knowing the structure of J

M. Tuma 179

14. Solving nonlinear systems: IV.

Efficient preconditioning of the linearized system

Can strongly simplify the problem to be parallelized: Approximate inverse Jacobians Jacobian of related discretizations (convection-diffusion preconditioned

by diffusion, Brown, Saad, 1980) Operator split Jacobians:

J = (αI + S + R)−1 ≈ (αI + R)−1(I + α−1S)−1

Jacobians formed from only “strong” entries Jacobians of low-order discretizations Jacobians with freezed values for expensive terms Jacobians with freezed and updated values

M. Tuma 180

14. Solving nonlinear systems: V.

Getting a matrix approximation stored implicitly: cases

Get the matrix Ai+k by n matvecs Aej , j = 1, . . . , n (Inefficient) A sparse Ai+k can be often obtained via a significantly less matvecs

than n by grouping computed columns if we know its pattern. pattern (stencil) is often known (e.g., given by the problem grid in

PDE problems) often used in practice

but for approximating Ai+k we do not need so much it might be enough to use an approximate pattern of a different but

structurally similar matrix

M. Tuma 181

14. Solving nonlinear systems: VI.

How to approximate a matrix by small number of matvecs if we knowmatrix pattern:

Example 1: Efficient estimation of a banded matrix0BBBBBBBBBBBBBBBBB

♠ ∗

♠ ∗ ∗

∗ ∗ ♠

∗ ♠ ∗

♠ ∗ ∗

∗ ∗ ♠

∗ ♠ ∗

♠ ∗

1CCCCCCCCCCCCCCCCCA

Columns with “red spades” can be computed at the same time in onematvec since sparsity patterns of their rows do not overlap. Namely,

A(e1 + e4 + e7) computes entries in the columns 1, 4 and 7.

M. Tuma 182

14. Solving nonlinear systems: VII.


Example 2: Efficient estimation of a general matrix

∗ ∗ ∗∗ ∗ ∗∗ ∗ ∗

∗ ∗ ∗∗ ∗ ∗

∗ ∗ ∗

Again, By one matvec can be computed the columns for which sparsitypatterns of their rows do not overlap.

M. Tuma 182




♠ ∗ ∗♠ ∗ ∗∗ ♠ ∗

♠ ∗ ∗∗ ∗ ♠

∗ ∗ ♠

Again, By one matvec can be computed the columns for which sparsitypatterns of their rows do not overlap.

For example, A(e1 + e3 + e6) computes entries in the columns 1, 3 and 6.

M. Tuma 182




♠ ♠ ♠♠ ♠ ♠♠ ♠ ♠

♠ ♠ ♠♠ ♠ ♠

♠ ♠ ♠

Entries in A can be computed by four matvecs.In each matvec we need to have structurally orthogonal columns.

M. Tuma 183

14. Solving nonlinear systems: VIII.

Efficient matrix estimation: well established field

Structurally orthogonal columns can be grouped

Finding the minimum number of groups: combinatorially difficultproblem (NP-hard)

Classical field: a (very restricted) selection of references: Curtis, Powell;Reid,1974; Coleman, Moré, 1983; Coleman, Moré, 1984; Coleman,Verma, 1998; Gebremedhin, Manne, Pothen, 2003. extensions to SPD (Hessian) approximations extensions to use both A and AT in automatic differentiation not only direct determination of resulting entries (substitution

methods)

M. Tuma 184

14. Solving nonlinear systems: IX.

Efficient matrix estimation: graph coloring problem

♠ ♠ ♠♠ ♠ ♠♠ ♠ ♠

♠ ♠ ♠♠ ♠ ♠

♠ ♠ ♠

1

2

3

4

5

6

In the other words, columns which form an independent set in the graphof AT A (called intersection graph) can be grouped⇒ a graph coloringproblem for the graph of AT A.

Problem: Find a coloring of vertices of the graph of AT A (G(AT A)) withminimum number of colors such that edges connect only vertices of

different colors

M. Tuma 185

14. Solving nonlinear systems: X.

Our matrix is defined only implicitly.⇓

♠ ♠ ♠♠ ♠ ♠♠ ♠ ♠

♠ ♠ ♠♠ ♠ ♠

♠ ♠ ♠

Consider a new pattern: e.g.,if the entries denoted by ♣ are small, number of groups can be decreased:

M. Tuma 185

14. Solving nonlinear systems: X.

Our matrix is defined only implicitly.⇓

♠ ♠ ♣♠ ♣ ♠♣ ♠ ♠

♠ ♠ ♣♣ ♠ ♠

♠ ♣ ♠

♠ → ♠♠ → ♠

M. Tuma 186

14. Solving nonlinear systems: XI.

Our matrix is defined only implicitly.

♠ ♠ ♣♠ ♣ ♠♣ ♠ ♠

♠ ♠ ♣♣ ♠ ♠

♠ ♣ ♠

But: the computation of entries from matvecs is inexact

M. Tuma 187

14. Solving nonlinear systems: XII.

Computational procedure I.

Step 1: Compute pattern of Ai or M i. E.g., for Ai as sparsification of Ai:

M. Tuma 187




♠ ♠ ♣♠ ♣ ♠♣ ♠ ♠

♠ ♠ ♣♣ ♠ ♠

♠ ♣ ♠

→

♠ ♠♠ ♠

♠ ♠♠ ♠

♠ ♠♠ ♠

M. Tuma 187




♠ ♠ ♣♠ ♣ ♠♣ ♠ ♠

♠ ♠ ♣♣ ♠ ♠

♠ ♣ ♠

→

♠ ♠♠ ♠

♠ ♠♠ ♠

♠ ♠♠ ♠

M. Tuma 187




Step 2: Graph coloring problem for the graph G(patternT pattern) to getgroups.

M. Tuma 187




Step 2: Graph coloring problem for the graph G(patternT pattern) to getgroups.

♠ ♠♠ ♠

♠ ♠♠ ♠

♠ ♠♠ ♠

1

2

3

4

5

6

M. Tuma 187



Step 3: Using matvecs to get Ai+k for more indices k ≥ 0 as if theentries outside the pattern are not present

Notes:

getting the entries from the matvecs spoiled by errors

an approximation error for any estimated entry ai,j in A:∑

k∈(i,k)∈A\P

|aik|

A\P: entries outside the given pattern The error distribution can be strongly influenced by column grouping balancing the error

M. Tuma 188

14. Solving nonlinear systems: XIII.

Computational procedure II.Preconditioner based on exact estimation of off-diagonals in of Ai

(diagonal partial coloring problem)

♠ ♠ ♠♠ ♠ ♠♠ ♠ ♠

♠ ♠ ♠♠ ♠ ♠

♠ ♠ ♠

♠ ♠ ♣♠ ♣ ♠♣ ♠ ♠

♠ ♠ ♣♣ ♠ ♠

♠ ♣ ♠

Consider a new pattern: e.g.,if the entries denoted by ♣ are small, number of groups can be decreased:

M. Tuma 188




♠ ♠ ♣♠ ♣ ♠♣ ♠ ♠

♠ ♠ ♣♣ ♠ ♠

♠ ♣ ♠

♠ → ♠Since all off-diagonals in columns 4 and 5 are computed precisely

M. Tuma 188




♠ ♠ ♣♠ ♣ ♠♣ ♠ ♠

♠ ♠ ♣♣ ♠ ♠

♠ ♣ ♠

♠ not → ♠Because of row 1

Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent

Documents