Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of linear equations, applying finite difference systems, and computing the fast Fourier Transform. The matrix-vector product is the basis of most of our algorithms. Parallel Linear Algebra 1 / 35
35
Embed
Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Parallel Linear Algebra
Our goals: Fast and efficient parallel algorithms forthe matrix-vector product,the matrix-matrix product,solving systems of linear equations,applying finite difference systems,and computing the fast Fourier Transform.
The matrix-vector product is the basis of most of our algorithms.
Parallel Linear Algebra 1 / 35
Decomposing a matrix
How to distribute an m × n matrix A to p processes?
Rowwise decomposition:each process is responsible for m/p contiguous rows.Columnwise decomposition:each process is responsible for n/p contiguous columns.Checkerboard decomposition:Assume that k divides m and that l divides n.
I Assume moreover that k · l = p.I Imagine that the processes form a k × l mesh.I Process (i , j) obtains the submatrix of A consisting of the i th row
interval of length m/k and the j th column interval of length n/l .
Parallel Linear Algebra 2 / 35
The Matrix-Vector Product
Our goal: Compute y = A · x for a m × n matrix A and a vector x with ncomponents.
Assumptions:I We do assume that matrix A has been distributed to the various
processes.I Process 1 knows the vector x and has to determine the vector y .
The conventional sequential algorithm determines y by setting
yi =n∑
j=1
A[i , j] · xj .
I To compute yi we perform n multiplications and n − 1 additions.I Overall m · n multiplications and m · (n − 1) additions suffice.
Parallel Linear Algebra The Matrix-Vector Product 3 / 35
The Rowwise Decomposition
Replicate x : broadcast x to all processes in time O(n · log2 p).Each process determines its m
p vector-vector products in timeO(m·n
p ).
Process 1 performs a Gather operation in time O(m): p − 1messages of length m/p are involved.Performance analysis:
I Communication time is proportional to n · log2 p + m and overalltime Θ(m · n/p + n · log2 p + m) is sufficient.
I Efficiency is Θ(m · n/(m · n + p · (n · log2 p + m))).I Constant efficiency follows, if
m · n = Ω(p · (n · log2 p + m)) = Ω(p · log2 p · n + m · p)I Hence we get constant efficiency for
m = Ω(p · log2 p) and n = Ω(p).
Parallel Linear Algebra The Matrix-Vector Product 4 / 35
The Columnwise Decomposition
Apply MPI_Scatter to distribute the blocks of x to “their”processes. Since this involves p − 1 messages of length n/p, timeO(n) is sufficient.Each process i computes the matrix-vector product y i = Ai · x i forits block Ai of columns.Time O(m · n/p) is sufficient.Process 1 applies a Reduce operation to sum up y1, y2, . . . , yp intime O(m · log2 p).Performance analysis:
I Run time is bounded by O(m · n/p + n + m · log2 p).I Here we have constant efficiency, if computing time dominates
communication time. Require
m = Ω(p) and n = Ω(p · log2 p).
Parallel Linear Algebra The Matrix-Vector Product 5 / 35
Checkerboard Decomposition
Process 1 applies a Scatter operation addressed to the lprocesses of row 1 of the process mesh. Time O(l · n
l ) = O(n).Then each process of row 1 broadcasts its block of x to the kprocesses in its column: time O(n
l · log2 k) suffices. All processescompute their matrix-vector products in time O(m · n/p).The processes in column 1 of the process mesh apply a Reduceoperation for their row to sum up the l vectors of length m
k : timeO(m/k · log2 l) is sufficient.Process 1 gathers the k − 1 vectors of length m
k in time O(m).Performance analysis:
I The total computation time is bounded byO(m · n/p + n + n
l · log2 k + mk · log2 l + m).
I The total communication time is bounded by O(n + m), providedlog2 k ≤ l and log2 l ≤ k .
I We obtain constant efficiency, if m = Ω(p) and n = Ω(p).
Parallel Linear Algebra The Matrix-Vector Product 6 / 35
Summary
The checkerboard decomposition has the best performance, if m ≈ n.Why?
All three decompositions have the same computation time.Assuming m = n,
I the communication time of the rowwise decomposition is dominatedby boadcasting the vector x : time O(n log2 p),
I whereas the final Reduce dominates for the columnwisedecomposition: time O(m log2 p).
I The checkerboard decomposition cuts down on the messagelength!
Parallel Linear Algebra The Matrix-Vector Product 7 / 35
Matrix-Matrix Product
Our goal is to compute the n × n product matrix C = A · B for n × nmatrices A and B.
To compute C[i , j] =∑n
k=1 A[i , k ] · B[k , j] sequentially,n multiplications and n − 1 additions are required.Since C has n2 entries, we obtain running time Θ(n3).We discuss four approaches:
I the first algorithm uses the rowwise decomposition.I The algorithm of Fox and its improvement, the algorithm of Cannon,
use the checkerboard decomposition.I The DNS algorithm assumes a variant of the checkerboard
decomposition.
Parallel Linear Algebra The Matrix-Matrix Product 8 / 35
The Rowwise Decomposition
Process i receives the submatrices Ai of A and Bi of B,corresponding to the i th row interval of length n
p .
Further subdivide Ai ,Bi into the np square submatrices Ai,j ,Bi,j .
Define C i,j analogously and observe that C i,j =∑p
k=1 Ai,k · Bk ,j
holds. The computation:I In phase 1 process i computes all products Ai,i · Bi,j for j = 1, . . . ,p
in time O(p · np ·
np ·
np ) = O( n3
p2 ), then sends Bi to process i + 1 andreceives Bi−1 from process i − 1 in time O(n2/p).
I In phase 2 process i computes all products Ai,i−1 · Bi−1,j , sendsBi−1 to process i + 1 and receives Bi−2 from i − 1 . . ..
Performance analysis:I All in all p phases. Hence computing time is bounded by O(n3/p)
and communication time is bounded by O(n2).I The compute/communicate ratio n3
p /n2 = n
p is small!
Parallel Linear Algebra The Matrix-Matrix Product 9 / 35
The Algorithm of Fox
We again determine the product matrix according toC i,j =
∑pk=1 Ai,k · Bk ,j , but now
I processes are arranged in a√
p ×√p mesh of processes.I Process i knows the n/
√p × n/
√p submatrices Ai,j and Bi,j .
We have√
p phases.In phase k we want process (i , j) to compute Ai,i+k−1 · Bi+k−1,j :
I process (i , i + k − 1) broadcasts Ai,i+k−1 to all processes in row i ,I process (i , j) computes Ai,i+k−1 · Bi+k−1,j ,I receives Bi+k,j from (i + 1, j) and sends Bi+k−1,j to (i − 1, j).
Performance Analysis:I Per phase: computing time O(( n√
p )3) and communication time
O( n2
p · log p).I We have
√p phases: computation time O( n3
p ), communication time
O( n2√
p · log p). The compute/communicate ratio n√p log2 p increases.
Parallel Linear Algebra The Matrix-Matrix Product 10 / 35
The Algorithm of Cannon
The setup is as for the algorithm of Fox.In particular, process (i , j) has to determine C i,j =
∑pk=1 Ai,k · Bk ,j .
At the very beginning, redistribute matrices, such that process(i , j) holds Ai,i+j and Bi+j,j .We again have
√p phases. In phase k we want process (i , j) to
compute Ai,i+j+k−1 · Bi+j+k−1,j :I process (i , j) computes Ai,i+j+k−1 · Bi+j+k−1,j ,I sends Ai,i+j+k−1 to (i , j − 1) and Bi+j+k−1,j to (i − 1, j) andI receives Ai,i+j+k from (i , j + 1) and Bi+j+k,j from (i + 1, j).
Performance Analysis:I Per phase: computation time O(( n√
p )3), communication timeO(( n√
p )2).I Overall, computation time O( n3
p ), communication time O( n2√
p ) andthe compute/communicate ratio n√
p increases again.
Parallel Linear Algebra The Matrix-Matrix Product 11 / 35
How did we save Communication?
- Rowwise decomposition: in each of the p phases row blocks areexchanged.All in all O(p · n2/p) communication.
- The algorithm of Fox: a broadcast in each of the√
p withcommunication time O(n2/p · log p).All in all communication time O(n2/
√p · log p): merging
point-to-point messages into broadcasts is profitable!- The algorithm of Cannon: after initially rearranging submatrices,
the broadcasts in the algorithm of Fox are replaced by point topoint messages.All in all communication time O(
√p · n2/p).
Parallel Linear Algebra The Matrix-Matrix Product 12 / 35
The DNS Algorithm
p = n3 processes are arranged in an n × n × n mesh of processes.Process (i , j ,1) stores A[i , j],B[i , j] and has to determine C[i , j].
We move A[i , k ] to process (i , ∗, k): (i , k ,1) sends A[i , k ] to(i , k , k), which broadcasts A[i , k ] to all processes (i , ∗, k).Next we move B[k , j] to process (∗, j , k): (k , j ,1) sends B[k , j] to(k , j , k), which broadcasts B[k , j] to all processes (∗, j , k).Process (i , j , k) computes the product A[i , k ] · B[k , j].Process (i , j ,1) computes
∑nk=1 A[i , k ] · B[k , j] with MPI_Reduce.
Performance analysis:I The replication step takes time O(log2 n), since the broadcast
dominates. The multiplication step runs in constant time and theReduce operation runs in logarithmic time.
I Time O(log2 n) suffices. Its efficiency Θ(1/ log2 n) is too small.I We scale down.
Parallel Linear Algebra The Matrix-Matrix Product 13 / 35
Scaling down the number of processors
We work with p processes. Let q = p1/3 and imagine that the pprocesses are arranged in a q × q × q mesh.Input distribution: process (i , j ,1) receives the n
q ×nq submatrices
Ai,j and Bi,j : the matrices Ai,j and Bi,j play the role of the entriesA[i , j] and B[i , j].Mimic the algorithm for n3 processes.Performance analysis:
I The total computing time is O( n3
q3 ) = O( n3
p ), since nq ×
nq matrices
have to be multiplied.I During replication and summing, n
q ×nq matrices are involved and
hence the communication time is bounded by O( n2
q2 · log p).I The compute/communicate ratio is n
q·log2 p .
Best performance so far. p should be sufficiently large.
Parallel Linear Algebra The Matrix-Matrix Product 14 / 35
Summary
The checkerboard decomposition is again better than the rowwisedecomposition.Cannon’s algorithm replaces a broadcast by a point-to-pointmessage and is therefore faster than the algorithm of Fox.The DNS algorithm partitions the matrices A and B among q2 ofthe q3 processes.
I Thus each “input process” gets a relatively large chunk.I However there are only two (instead of
√p) communication steps:
namely when replicating and summing.I Observe that DNS is better than Cannon only if p is sufficiently
large.
Parallel Linear Algebra The Matrix-Matrix Product 15 / 35
Solving Linear Systems
We are given a matrix A and a right-hand side b and would like tosolve the linear system A · x = b.
We begin with the easy case of lower triangular matrices A anddescribe back substitution.Then we discuss efficient parallelizations of Gaussian eliminationand continue with iterative methods: Jacobi relaxation, theGauss-Seidel algorithm, the conjugate gradient approach and theNewton method.Finally we consider parallelization of the finite difference method.
Solving Linear Systems 16 / 35
Backsubstitution
We have to solve the system
A[i ,1] · x1 + · · ·+ A[i , i] · xi = bi
for i = 1, . . . ,n.
A sequential solution:I first determine x1 from the first equation A[i ,1] · x1 = b1.I If we already know x1, . . . , xi−1, then determine xi from the i th
equation.I Since an evaluation of the i th equation requires time O(i), the
sequential solution runs in time O(n2).We consider two input distributions:
I The off-diagonal decomposition of matrix A:process 1 knows the main diagonal and process i (i ≥ 2) knows thei − 1st offdiagonal A[i ,1],A[i + 1,2], . . . ,A[n,n − i + 1].
I And the rowwise decomposition.
Solving Linear Systems Backsubstitution 17 / 35
The Off-Diagonal Decomposition I
We use the linear array as communication pattern.
Process 1 successively determines x1, . . . , xn.Once computed, xi is forwarded thru the linear array.
How to solve the i th equation A[i ,1] · x1 + · · ·+ A[i , i] · xi = bi?
I Process i computes A[i ,1] · x1 immediately after receiving x1 fromprocess i − 1. Then i sends A[i ,1] · x1 to process i − 1 and x1 toprocess i + 1.
I If process i − 1 receives x2 from process i − 2, it computes theproduct A[i ,2] · x2, sends the sum A[i ,1] · x1 + A[i ,2] · x2 to processi − 2 and forwards x2 to process i .
I We communicate according to the principle of“just in time production”.
Solving Linear Systems Backsubstitution 18 / 35
The Off-Diagonal Decomposition II
x1
x2
x 3
A[2,1]
A[3,1]
A[4,1]x1
A[3,2]
A[4,2]x
A[4,3]x
x4
x 1
1x
2x
2
3
time
processors
Solving Linear Systems Backsubstitution 19 / 35
The Off-Diagonal Decomposition III
Backsubstitution with p processes.
Assign the off-diagonals (A[j ,1], . . . ,A[n,n − j + 1]) forj ∈ (i − 1) · n/p + 1, . . . , i · n/p to process i .The computing time: we have p phases with compute timeO((n/p)2) per phase.All in all compute time is bounded by O(n2
p ).
Communication O(n/p) per phase and hence all in all O(n)communication.
The running time is bounded by O(n2
p + n).We achieve constant efficiency, whenever n = Ω(p).
Solving Linear Systems Backsubstitution 20 / 35
The Rowwise Decomposition
This time process i determines xi .Once xi is determined, process i broadcasts xi to processesi + 1, . . . ,n.
For p processes:
Each process is responsible for np variables. Time O(n) per
variable is sufficient.⇒ The compute time is bounded by O(n2
p ).There is one broadcast per unknown and communication time isbounded by O(n · log2 p).We achieve constant efficiency, whenever n = Ω(p · log2 p).
Solving Linear Systems Backsubstitution 21 / 35
Gaussian Elimination with Partial Pivoting
Include right hand side b as last column of matrix A.
If we have already eliminated nonzeroes below the diagonals incolumns 1, . . . , i − 1, then
use largest entry A[i , j] for j = i , . . . ,n as pivot,
swap rows i and j and set rowk = rowk − A[k ,i]A[i,i] · rowi for k > i .
Performance analysis for the sequential algorithm:
When dealing with row i :I Determine the largest entry A[j , i] in column i in time O(n).I The elimination step for row i requires O(n − i + 1) arithmetic
operations.I All in all O(n + (n − i + 1)2) = O(n2) operations suffice.
The total number of arithmetic operations is bounded by O(n3).
Solving Linear Systems Gaussian Elimination 22 / 35
A parallelization of Gaussian Elimination I
We work with p processes and the rowwise decomposition:each process receives an “interval” of n/p rows.We maintain the sequential structure of pivoting,but parallelize each pivoting step instead.Assume that we have reached row i .
I To utilize the rowwise decomposition we look for the largest entry inrow i (and not in column i).
I We have to eliminate all non-zeroes in row i :the process holding row i has to
F determine the largest entry A[i, k ] in row i ,F compute the vector mi of multiples for the elimination stepF and send mi to the remaining processes.
Solving Linear Systems Gaussian Elimination 23 / 35
A parallelization of Gaussian Elimination II
Avoid broadcasting (mi , k). When dealing with row i − 1:I After computing mi−1, the process j holding row i interrupts its
elimination work for row i − 1,I immediately recomputes row i and determines (mi , k) instead,I sends (mi , k) to process j + 1 andI then resumes its elimination work for row i − 1.
We cover communication by computation:I the expensive broadcast of (mi , k) is replaced by sending (mi , k)
thru the linear array of processes.I Whenever a process receives (mi , k), it immediately forwards
(mi , k) to its neighbor process.Performance analysis:
I No delay when eliminating row i , if the compute time Θ( np · n) for
pivoting dominates the maximal communication delay p · n.
The overall compute time is bounded by O(n · np · n) = O(n3
p ).There is no delay due to communication, provided n = Ω(p2).
Solving Linear Systems Gaussian Elimination 24 / 35
Iterative Methods
In an iterative method an approximate solution of a linear systemA · x = b is successively improved.One starts with an initial “guess” x(0) andreplaces x(t) by a presumably better solution x(t + 1).
Assume that the computation of x(t + 1) is based on the
matrix-vector product.
We obtain a fast parallel algorithm and can exploit sparse linearsystems.We describe:
I the Jacobi relaxation and its variants,I the Newton method to approximately compute the inverse A−1.
Let D be the diagonal matrix with D[i , i] = A[i , i].Set M = D−1 · (D − A).
I Another view of the Jacobi iteration:if A is invertible and if x∗ is the unique solution of A · x = b, then
x(t + 1)− x∗ = M · (x(t)− x∗).
I Consequently x(t)− x∗ = M t · (x(0)− x∗) follows for all t .I If limt→∞M t = 0, then x(t) converges against x∗.
The Jacobi Relaxation converges for row diagonally dominantmatrices A: i.e., if
|A[i , i]| >∑j 6=i
|A[i , j]|
holds for all i .
Solving Linear Systems Iterative Methods 27 / 35
Two Extensions
In many practical applications the Jacobi overrelaxation convergesfaster.
I for a suitable coefficient γ:
xi (t + 1) = (1− γ) · xi (t) +γ
A[i , i]·
bi −∑j 6=i
A[i , j] · xj (t)
.
I The Jacobi relaxation is a special case: set γ = 1.The Gauss-Seidel algorithm incorporates already recomputedvalues of xj (i.e., it replaces xj(t) by xj(t + 1)).
I An example is
xi (t + 1) =1
A[i , i]·
bi −∑j<i
A[i , j] · xj (t + 1)−∑j>i
A[i , j] · xj (t)
.
I The Gauss-Seidel method does not look parallelizable!?
Find a function u : [0,1]2 → R whichsatisfies the Poisson equation uxx + uyy = Hand which has prescribed values on the boundary of the unitsquare [0,1]2.
If u is sufficiently smooth and if h is sufficiently small, then
Approximate uy ,y analogously and we get H(x , y) ≈u(x + h, y) + u(x − h, y) + u(x , y + h) + u(x , y − h)− 4u(x , y)
h2 .
For N sufficiently large, set
h =1N, ui,j = u(
iN,
jN
) and Hi,j = H(iN,
jN
).
Solving Linear Systems The Finite Difference Method 31 / 35
The Linear System I
Choose (x , y) as one of the grid points ( iN ,
jN ) | 0 < i , j < N and
we get the linear system
−4ui,j + ui+1,j + ui−1,j + ui,j+1 + ui,j−1 =Hi,j
N2
The system is huge: (N − 1)2 equations in (N − 1)2 unknowns.(The values of u at the boundary are prescribed.)The matrix of the system has (N − 1)4 entries, but is sparse, sinceany equation has at most five nonzero coefficients.To utilize sparsity, we apply iterative methods.
Solving Linear Systems The Finite Difference Method 32 / 35
The Linear System II
We process the system beginning with the lower boundary andworking upwards: