Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Post on 17-May-2021

15 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-1

Numerical Algorithms

Chapter 11

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-2

Numerical Algorithms

In textbook do:

• Matrix multiplication

• Solving a system of linear equations

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-3

a0,0 a0,1

a1,0

a0,m-2

an-1,0

a0,m-1

an-2,0

an-1,m-1an-1,m-2

an-2,m-1

a1,1 a1,m-2 a1,m-1

an-2,1 an-2,m-2

an-1,1

Row

Column

Matrices — A ReviewAn n × m matrix

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-4

Matrix Addition

Involves adding corresponding elements of each matrix to form theresult matrix.

Given the elements of A as ai,j and the elements of B as bi,j, eachelement of C is computed as

ci,j = ai,j + bi,j

(0 ≤ i < n, 0 ≤ j < m)

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-5

Matrix Multiplication

Multiplication of two matrices, A and B, produces the matrix C

whose elements, ci,j (0 ≤ i < n, 0 ≤ j < m), are computed as follows:

where A is an n × l matrix and B is an l × m matrix.

ci j, ai,kbk,jk 0=

l 1–∑=

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-6

× =A B C

Matrix multiplication, C = A × B

i

j

ci,j

Row

ColumnMultiply Sum

results

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-7

× =A b c

i ci

Rowsum

Matrix-Vector Multiplicationc = A × b

Matrix-vector multiplication follows directly from the definition of

matrix-matrix multiplication by making B an n × 1 matrix (vector).

Result an n × 1 matrix (vector).

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-8

Relationship of Matrices to Linear Equations

A system of linear equations can be written in matrix form:

Ax = b

Matrix A holds the a constants

x is a vector of the unknowns

b is a vector of the b constants.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-9

Implementing Matrix Multiplication

Sequential Code

Assume throughout that the matrices are square (n × n matrices).

The sequential code to compute A × B could simply be

for (i = 0; i < n; i++)for (j = 0; j < n; j++) {c[i][j] = 0;for (k = 0; k < n; k++)c[i][j] = c[i][j] + a[i][k] * b[k][j];

}

This algorithm requires n3 multiplications and n3 additions, leading

to a sequential time complexity of Ο(n3). Very easy to parallelize.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-10

Parallel Code

With n processors (and n × n matrices), can obtain:

• Time complexity of O(n2) with n processorsEach instance of inner loop independent and can be done by aseparate processor

• Time complexity of O(n) with n2 processorsOne element of A and B assigned to each processor.

Cost optimal since O(n3) = n × O(n2) = n2 × O(n)].

• Time complexity of O(log n) with n3 processorsBy parallelizing the inner loop. Not cost-optimal since

O(n3)≠n3×O(log n)).

O(log n) lower bound for parallel matrix multiplication.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-11

Partitioning into Submatrices

Suppose matrix divided into s2 submatrices. Each submatrix has n/s × n/s elements. Using notation Ap,q as submatrix in submatrix rowp and submatrix column q:

for (p = 0; p < s; p++)for (q = 0; q < s; q++) {Cp,q = 0; /* clear elements of submatrix */for (r = 0; r < m; r++)/* submatrix multiplication &*/Cp,q = Cp,q + Ap,r * Br,q;/*add to accum. submatrix*/

}

The lineCp,q = Cp,q + Ap,r * Br,q;

means multiply submatrix Ap,r and Br,q using matrix multiplicationand add to submatrix Cp,q using matrix addition. Known as block

matrix multiplication.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-12

Block Matrix Multiplication

× =

Sum

A B C

p

qMultiply results

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-13

a0,0 a0,1 a0,2 a0,3

a1,0

a2,0

a3,0

a1,2a1,1

a2,1

a3,1

a2,2

a3,2 a3,3

a1,3

a2,3

b0,0 b0,1 b0,2 b0,3

b1,0

b2,0

b3,0

b1,2b1,1

b2,1

b3,1

b2,2

b3,2 b3,3

b1,3

b2,3

a0,0 a0,1

a1,0 a1,1

b0,0 b0,1

b1,0 b1,1

a0,2 a0,3

a1,2 a1,3

b2,0 b2,1

b3,0 b3,1

(a) Matrices

(b) Multiplying A0,0 × B0,0

a0,0b0,0+a0,1b1,0 a0,0b0,1+a0,1b1,1

a1,0b0,0+a1,1b1,0 a1,0b0,1+a1,1b1,1

A0,0 B0,0 A0,1 B1,0

a0,2b2,0+a0,3b3,0 a0,2b2,1+a0,3b3,1

a1,2b2,0+a1,3b3,0 a1,2b2,1+a1,3b3,1

+

× + ×

=

=

a0,0b0,0+a0,1b1,0+a0,2b2,0+a0,3b3,0 a0,0b0,1+a0,1b1,1+a0,2b2,1+a0,3b3,1

a1,0b0,0+a1,1b1,0+a1,2b2,0+a1,3b3,0 a1,0b0,1+a1,1b1,1+a1,2b2,1+a1,3b3,1

= C0,0

Submatrix multiplication

×

to obtain C0,0

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-14

b[][j]

a[i][]Row i

Column j

c[i][j]

Processor Pi,j

Direct Implementation

One processor to compute each element of C - n2 processors wouldbe needed. One row of elements of A and one column of elementsof B needed. Some of same elements sent to more than oneprocessor. Can use submatrices.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-15

P0 P1 P2 P3

P0

P0 P2

c0,0

a0,0 b0,0 a0,1 b1,0 a0,2 b2,0 a0,3 b3,0

×

+

×××

+

+

Performance ImprovementUsing tree construction n numbers can be added in log n stepsusing n processors:

Computational timecomplexity of Ο(log n)

using n3 processors.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-16

App

Aqp Aqq

Apq

i j

i

j

Bpp

Bqp Bqq

Bpq Cpp

Cqp Cqq

Cpq

P1 P3P2P0

P0 + P1

P4 + P5 P6 + P7

P2 + P3

P5 P7P6P4

Recursive Implementation

Apply same algorithm on each submatrix recursivly.

Excellent algorithm for a shared memory systems because oflocality of operations.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-17

Recursive Algorithmmat_mult(App, Bpp, s){if (s == 1) /* if submatrix has one element */

C = A * B; /* multiply elements */else { /* continue to make recursive calls */

s = s/2; /* no of elements in each row/column */P0 = mat_mult(App, Bpp, s);P1 = mat_mult(Apq, Bqp, s);P2 = mat_mult(App, Bpq, s);P3 = mat_mult(Apq, Bqq, s);P4 = mat_mult(Aqp, Bpp, s);P5 = mat_mult(Aqq, Bqp, s);P6 = mat_mult(Aqp, Bpq, s);P7 = mat_mult(Aqq, Bqq, s);Cpp = P0 + P1; /* add submatrix products to */Cpq = P2 + P3; /* form submatrices of final matrix */Cqp = P4 + P5;Cqq = P6 + P7;

}return (C); /* return final matrix */}

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-18

Mesh Implementations

• Cannon’s algorithm

• Fox’s algorithm (not in textbook but similar complexity)

• Systolic array

All involve using processor arranged a mesh and shifting elements

of the arrays through the mesh. Accumulate the partial sums at

each processor.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-19

Mesh ImplementationsCannon’s Algorithm

Uses a mesh of processors with wraparound connections (a torus) to shift the Aelements (or submatrices) left and the B elements (or submatrices) up.

1.Initially processor Pi,j has elements ai,j and bi,j (0 ≤ i < n, 0 ≤ k < n).2. Elements are moved from their initial position to an “aligned” position. The

complete ith row of A is shifted i places left and the complete jth column ofB is shifted j places upward. This has the effect of placing the element ai,j+iand the element bi+j,j in processor Pi,j,. These elements are a pair of thoserequired in the accumulation of ci,j.

3.Each processor, Pi,j, multiplies its elements.4. The ith row of A is shifted one place right, and the jth column of B is shifted

one place upward. This has the effect of bringing together the adjacentelements of A and B, which will also be required in the accumulation.

5. Each processor, Pi,j, multiplies the elements brought to it and adds theresult to the accumulating sum.

6. Step 4 and 5 are repeated until the final result is obtained (n - 1 shifts withn rows and n columns of elements).

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-20

B

A

Movement of A and B elements

j

i

Pi,j

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-21

B

A

Step 2 — Alignment of elements of A and B

j

i

bi+j,j

ai,j+i

i places

j places

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-22

B

A

Step 4 — One-place shift of elements of A and B

j

i

Pi,j

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-23

c0,0 c0,1 c0,2 c0,3

c1,0 c1,1 c1,2 c1,3

c2,0 c2,1 c2,2 c2,3

c3,0 c3,1 c3,2 c3,3

b3,0b2,0b1,0b0,0

b3,3b2,3b1,3b0,3

b3,2b2,2b1,2b0,2

b3,1b2,1b1,1b0,1

a0,3 a0,2 a0,1a0,0

a3,3 a3,2 a3,1 a3,0

a2,3 a2,2 a2,1a2,0

a1,3 a1,2 a1,1a1,0

Systolic array

Pumpingaction

One cycle delay

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-24

c0

c1

c2

c3

b3b2b1b0

a0,3 a0,2 a0,1 a0,0

a3,3 a3,2 a3,1 a3,0

a2,3 a2,2 a2,1 a2,0

a1,3 a1,2 a1,1 a1,0

Pumpingaction

Matrix-Vector Multiplication

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-25

Solving a System of Linear Equations

an−1,0x0 + an−1,1x1 + an−1,2x2 … + an−1,n−1xn−1 = bn−1

.

.

.a2,0x0 + a2,1x1 + a2,2x2 … + a2,n−1xn−1 = b2a1,0x0 + a1,1x1 + a1,2x2 … + a1,n−1xn−1 = b1a0,0x0 + a0,1x1 + a0,2x2 … + a0,n−1xn−1 = b0

which, in matrix form, is

Ax = b

Objective is to find values for the unknowns, x0, x1, …, xn−1, givenvalues for a0,0, a0,1, …, an−1,n−1, and b0, …, bn .

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-26

Solving a System of Linear Equations

Dense matrices

Gaussian Elimination - parallel time complexity O(n2)

Sparse matrices

By iteration - depends upon iteration method and number ofiterations but typically O(log n)

• Jacobi iteration• Gauss-Seidel relaxation (not good for parallelization)• Red-Black ordering• Multigrid

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-27

Gaussian Elimination

Convert general system of linear equations into triangular system ofequations. Then be solved by Back Substitution.

Uses characteristic of linear equations that any row can be replacedby that row added to another row multiplied by a constant.

Starts at the first row and works toward the bottom row. At the ithrow, each row j below the ith row is replaced by row j + (row i) (−aj,i/ai,i). The constant used for row j is −aj,i/ai,i. Has the effect of makingall the elements in the ith column below the ith row zero because

a j i, a j i, ai i,a j i,–

ai i,------------

+ 0= =

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-28

Clearedto zero

Alreadyclearedto zero

Row i

Column i

Column

Row

Gaussian elimination

Row jStep through

aji

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-29

Partial Pivoting

If ai,i is zero or close to zero, we will not be able to compute the

quantity −aj,i/ai,i.

Procedure must be modified into so-called partial pivoting by

swapping the ith row with the row below it that has the largest

absolute element in the ith column of any of the rows below the ith

row if there is one. (Reordering equations will not affect the system.)

In the following, we will not consider partial pivoting.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-30

Sequential Code

Without partial pivoting:

for (i = 0; i < n-1; i++) /* for each row, except last */for (j = i+1; j < n; j++) {/*step thro subsequent rows */m = a[j][i]/a[i][i]; /* Compute multiplier */for (k = i; k < n; k++)/*last n-i-1 elements of row j*/a[j][k] = a[j][k] - a[i][k] * m;

b[j] = b[j] - b[i] * m;/* modify right side */}

The time complexity is O(n3).

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-31

Already

Row i

Column

Row

Broadcastith row

n − i +1 elements(including b[i])

clearedto zero

Parallel Implementation

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-32

AnalysisCommunication

n − 1 broadcasts performed sequentially. ith broadcast contains n −i + 1 elements.

Time complexity of Ο(n2) (see textbook)

Computation

After row broadcast, each processor Pj beyond broadcast processorPi will compute its multiplier, and operate upon n − j + 2 elements ofits row. Ignoring the computation of the multiplier, there are n − j + 2multiplications and n − j + 2 subtractions.

Time complexity of Ο(n2) (see textbook).Efficiency will be relatively low because all the processors beforethe processor holding row i do not participate in the computationagain.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-33

Broadcast

P0 P1 P2 Pn-1

rows

Row

Pipeline implementation of Gaussian elimination

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-34

P0

P1

P3

P2

0

n/p

2n/p

3n/p

RowStrip Partitioning

Poor processor allocation! Processors do not participate incomputation after their last row is processed.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-35

P0

P1

0

n/p

2n/p

3n/p

Row

Cyclic-Striped Partitioning

An alternative which equalizes the processor workload:

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-36

Iterative Methods

Time complexity of direct method at Ο(N2) with N processors, is

significant.

Time complexity of iteration method depends upon:

• the type of iteration,• number of iterations• number of unknowns, and • required accuracy

but can be less than the direct method especially for a few

unknowns i.e a sparse system of linear equations.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-37

Jacobi Iteration

Iteration formula - ith equation rearranged to have ith unknown on

left side:

Superscript indicates iteration:

is kth iteration of xi , is (k−1)th iteration of xj.

xki

1

ai i,--------- bi ai j, x j

k 1–

j i≠∑–=

xki x j

k 1–

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-38

Example of a Sparse System of Linear Equations

Laplace’s Equation

Solve for f over the two-dimensional x-y space.

For a computer solution, finite difference methods are appropriate

Two-dimensional solution space is “discretized” into a large number

of solution points.

f2∂

x2∂

---------f

2∂

y2∂

---------+ 0=

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-39

∆ ∆

f(x, y)

Solution space

y

x

Finite Difference Method

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-40

If distance between points, ∆, made small enough:

Substituting into Laplace’s equation, we get

Rearranging, we get

Rewritten as an iterative formula:

f k(x, y) - kth iteration, f k−1(x, y) - (k − 1)th iteration.

f2∂

x2∂

---------1

∆2------ f x ∆ y,+( ) 2 f x y,( )– f x ∆– y,( )+[ ]≈

f2∂

y2∂

---------1

∆2------ f x y ∆+,( ) 2 f x y,( )– f x y ∆–,( )+[ ]≈

1

∆2------ f x ∆ y,+( ) f x ∆– y,( ) f x y ∆+,( ) f x y ∆–,( ) 4 f x y,( )–+ + +[ ] 0=

f x y,( ) f x ∆ y,–( ) f x y ∆–,( ) f x ∆+ y,( ) f x y ∆+,( )+ + +[ ]

4------------------------------------------------------------------------------------------------------------------------------------=

fk

x y,( )f

k 1–x ∆– y,( ) f

k 1–x y ∆–,( ) f

k 1–+ x ∆ y,+( ) f

k 1–+ x y ∆+,( )+[ ]

4--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------=

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-41

x1 x4x3x2 x8x7x6x5 x9

x31 x34x33x32 x38x37x36x35 x39

x41 x44x43x42 x48x47x46x45 x49

x51 x54x53x52 x58x57x56x55 x59

x61 x64x63x62 x68x67x66x65 x69

x71 x74x73x72 x78x77x76x75 x79

x11 x14x13x12 x18x17x16x15 x19

x21 x24x23x22 x28x27x26x25 x29

x81 x84x83x82 x88x87x86x85 x89

x60

x70

x80

x90

x40

x50

x30

x20

x10

x91 x94x93x92 x98x97x96x95 x99 x100

Boundary points (see text)

Natural Order

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-42

Relationship with a General System of Linear Equations

Using natural ordering, ith point computed from ith equation:

or

xi−n + xi−1 − 4xi + xi+1+ xi+n = 0

which is a linear equation with five unknowns (except those withboundary points).

In general form, the ith equation becomes:

ai,i−nxi−n + ai,i−1xi−1 + ai,ixi + ai,i+1xi+1+ ai,i+nxi+n = 0

where ai,i = −4, and ai,i−n = ai,i−1 = ai,i+1 = ai,i+n = 1.

xixi n– xi 1– xi 1+ xi n++ + +

4--------------------------------------------------------------------------=

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-43

−4−4

−4

−4

−4

1 1 11 1

1 1

1 11

ai,i ai,i+nai,i−1 ai,i+1ai,i−n1 1ith equation

11

11

1

1

×

x1

=

To includeboundary values

11

A x00

00

and some zeroentries (see text)

x2

xN

xN-1

Those equations with a boundary point on diagonal unnecessary

for solution

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-44

Point

Point to be computed

computed

Sequential order of computation

Gauss-Seidel RelaxationUses some newly computed values to compute other values in thatiteration.

Basic formnot suitablefor parallelization

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-45

Gauss-Seidel Iteration Formula

where the superscript indicates the iteration.

With natural ordering of unknowns, formula reduces to

xki = (−1/ai,i )[ai,i−n x

ki−n + a i,i−1 x

ki−1 + a i,i+1 x

k−1i+1+ a i,i+n x

k−1i+n ]

At the kth iteration, two of the four values (before the ith element)taken from the kth iteration and two values (after the ith element)taken from the (k−1)th iteration. We have:

xki

1ai i,--------- bi ai j, x

kj ai j, x

k 1–j

j i 1+=

N

∑–j 1=

i 1–

∑–=

fk

x y,( )f

kx ∆– y,( ) f

kx y ∆–,( ) f

k 1–+ x ∆ y,+( ) f

k 1–+ x y ∆+,( )+[ ]

4--------------------------------------------------------------------------------------------------------------------------------------------------------------------=

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-46

Red-Black OrderingFirst, black points computed. Next, red points computed. Blackpoints computed simultaneously, and red points computedsimultaneously.

Red

Black

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-47

Red-Black Parallel Code

forall (i = 1; i < n; i++)forall (j = 1; j < n; j++)if ((i + j) % 2 == 0) /* compute red points */f[i][j] = 0.25*(f[i-1][j] + f[i][j-1] + f[i+1][j] + f[i][j+1]);

forall (i = 1; i < n; i++)forall (j = 1; j < n; j++)if ((i + j) % 2 != 0) /* compute black points */f[i][j] = 0.25*(f[i-1][j] + f[i][j-1] + f[i+1][j] + f[i][j+1]);

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-48

Higher-Order Difference Methods

More distant points could be used in the computation. The followingupdate formula:

fk

x y,( ) =

1

60------ 16 f

k 1–x ∆– y,( ) 16 f

k 1–x y ∆–,( ) 16 f

k 1–+ x ∆ y,+( ) 16 f

k 1–+ x y ∆+,( )1

1-+

f–k 1–

x 2∆– y,( ) fk 1–

x y 2∆–,( )– fk 1–

x 2∆+ y,( )– fk 1–

x y 2∆+,( )–

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-49

Nine-point stencil

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-50

Overrelaxation

Improved convergence obtained by adding factor (1 − ω)xi to Jacobior Gauss-Seidel formulae. Factor ω is the overrelaxation parameter.

Jacobi overrelaxation formula

where 0 < ω < 1.

Gauss-Seidel successive overrelaxation

where 0 < ω ≤ 2. If ω = 1, we obtain the Gauss-Seidel method.

xik ω

aii------ bi aij xi

k 1–

j i≠∑– 1 ω–( )xi

k 1–+=

xik ω

aii------ bi aij xi

kaij xi

k 1–

j i 1+=

N

∑–j 1=

i 1–

∑– 1 ω–( )xik 1–

+=

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-51

Multigrid Method

First, a coarse grid of points used. With these points, iterationprocess will start to converge quickly.

At some stage, number of points increased to include points of thecoarse grid and extra points between the points of the coarse grid.Initial values of extra points found by interpolation. Computationcontinues with this finer grid.

Grid can be made finer and finer as computation proceeds, orcomputation can alternate between fine and coarse grids.

Coarser grids take into account distant effects more quickly andprovide a good starting point for the next finer grid.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-52

Coarsest grid points Finer grid pointsProcessor

Multigrid processor allocation

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-53

(Semi) Asynchronous Iteration

As noted early, synchronizing on every iteration will cause

significant overhead - best if one can is to synchronize after a

number of iterations.

top related