Top Banner
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved. slides11-1 Numerical Algorithms Chapter 11
53

Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

May 17, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-1

Numerical Algorithms

Chapter 11

Page 2: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-2

Numerical Algorithms

In textbook do:

• Matrix multiplication

• Solving a system of linear equations

Page 3: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-3

a0,0 a0,1

a1,0

a0,m-2

an-1,0

a0,m-1

an-2,0

an-1,m-1an-1,m-2

an-2,m-1

a1,1 a1,m-2 a1,m-1

an-2,1 an-2,m-2

an-1,1

Row

Column

Matrices — A ReviewAn n × m matrix

Page 4: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-4

Matrix Addition

Involves adding corresponding elements of each matrix to form theresult matrix.

Given the elements of A as ai,j and the elements of B as bi,j, eachelement of C is computed as

ci,j = ai,j + bi,j

(0 ≤ i < n, 0 ≤ j < m)

Page 5: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-5

Matrix Multiplication

Multiplication of two matrices, A and B, produces the matrix C

whose elements, ci,j (0 ≤ i < n, 0 ≤ j < m), are computed as follows:

where A is an n × l matrix and B is an l × m matrix.

ci j, ai,kbk,jk 0=

l 1–∑=

Page 6: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-6

× =A B C

Matrix multiplication, C = A × B

i

j

ci,j

Row

ColumnMultiply Sum

results

Page 7: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-7

× =A b c

i ci

Rowsum

Matrix-Vector Multiplicationc = A × b

Matrix-vector multiplication follows directly from the definition of

matrix-matrix multiplication by making B an n × 1 matrix (vector).

Result an n × 1 matrix (vector).

Page 8: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-8

Relationship of Matrices to Linear Equations

A system of linear equations can be written in matrix form:

Ax = b

Matrix A holds the a constants

x is a vector of the unknowns

b is a vector of the b constants.

Page 9: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-9

Implementing Matrix Multiplication

Sequential Code

Assume throughout that the matrices are square (n × n matrices).

The sequential code to compute A × B could simply be

for (i = 0; i < n; i++)for (j = 0; j < n; j++) {c[i][j] = 0;for (k = 0; k < n; k++)c[i][j] = c[i][j] + a[i][k] * b[k][j];

}

This algorithm requires n3 multiplications and n3 additions, leading

to a sequential time complexity of Ο(n3). Very easy to parallelize.

Page 10: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-10

Parallel Code

With n processors (and n × n matrices), can obtain:

• Time complexity of O(n2) with n processorsEach instance of inner loop independent and can be done by aseparate processor

• Time complexity of O(n) with n2 processorsOne element of A and B assigned to each processor.

Cost optimal since O(n3) = n × O(n2) = n2 × O(n)].

• Time complexity of O(log n) with n3 processorsBy parallelizing the inner loop. Not cost-optimal since

O(n3)≠n3×O(log n)).

O(log n) lower bound for parallel matrix multiplication.

Page 11: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-11

Partitioning into Submatrices

Suppose matrix divided into s2 submatrices. Each submatrix has n/s × n/s elements. Using notation Ap,q as submatrix in submatrix rowp and submatrix column q:

for (p = 0; p < s; p++)for (q = 0; q < s; q++) {Cp,q = 0; /* clear elements of submatrix */for (r = 0; r < m; r++)/* submatrix multiplication &*/Cp,q = Cp,q + Ap,r * Br,q;/*add to accum. submatrix*/

}

The lineCp,q = Cp,q + Ap,r * Br,q;

means multiply submatrix Ap,r and Br,q using matrix multiplicationand add to submatrix Cp,q using matrix addition. Known as block

matrix multiplication.

Page 12: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-12

Block Matrix Multiplication

× =

Sum

A B C

p

qMultiply results

Page 13: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-13

a0,0 a0,1 a0,2 a0,3

a1,0

a2,0

a3,0

a1,2a1,1

a2,1

a3,1

a2,2

a3,2 a3,3

a1,3

a2,3

b0,0 b0,1 b0,2 b0,3

b1,0

b2,0

b3,0

b1,2b1,1

b2,1

b3,1

b2,2

b3,2 b3,3

b1,3

b2,3

a0,0 a0,1

a1,0 a1,1

b0,0 b0,1

b1,0 b1,1

a0,2 a0,3

a1,2 a1,3

b2,0 b2,1

b3,0 b3,1

(a) Matrices

(b) Multiplying A0,0 × B0,0

a0,0b0,0+a0,1b1,0 a0,0b0,1+a0,1b1,1

a1,0b0,0+a1,1b1,0 a1,0b0,1+a1,1b1,1

A0,0 B0,0 A0,1 B1,0

a0,2b2,0+a0,3b3,0 a0,2b2,1+a0,3b3,1

a1,2b2,0+a1,3b3,0 a1,2b2,1+a1,3b3,1

+

× + ×

=

=

a0,0b0,0+a0,1b1,0+a0,2b2,0+a0,3b3,0 a0,0b0,1+a0,1b1,1+a0,2b2,1+a0,3b3,1

a1,0b0,0+a1,1b1,0+a1,2b2,0+a1,3b3,0 a1,0b0,1+a1,1b1,1+a1,2b2,1+a1,3b3,1

= C0,0

Submatrix multiplication

×

to obtain C0,0

Page 14: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-14

b[][j]

a[i][]Row i

Column j

c[i][j]

Processor Pi,j

Direct Implementation

One processor to compute each element of C - n2 processors wouldbe needed. One row of elements of A and one column of elementsof B needed. Some of same elements sent to more than oneprocessor. Can use submatrices.

Page 15: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-15

P0 P1 P2 P3

P0

P0 P2

c0,0

a0,0 b0,0 a0,1 b1,0 a0,2 b2,0 a0,3 b3,0

×

+

×××

+

+

Performance ImprovementUsing tree construction n numbers can be added in log n stepsusing n processors:

Computational timecomplexity of Ο(log n)

using n3 processors.

Page 16: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-16

App

Aqp Aqq

Apq

i j

i

j

Bpp

Bqp Bqq

Bpq Cpp

Cqp Cqq

Cpq

P1 P3P2P0

P0 + P1

P4 + P5 P6 + P7

P2 + P3

P5 P7P6P4

Recursive Implementation

Apply same algorithm on each submatrix recursivly.

Excellent algorithm for a shared memory systems because oflocality of operations.

Page 17: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-17

Recursive Algorithmmat_mult(App, Bpp, s){if (s == 1) /* if submatrix has one element */

C = A * B; /* multiply elements */else { /* continue to make recursive calls */

s = s/2; /* no of elements in each row/column */P0 = mat_mult(App, Bpp, s);P1 = mat_mult(Apq, Bqp, s);P2 = mat_mult(App, Bpq, s);P3 = mat_mult(Apq, Bqq, s);P4 = mat_mult(Aqp, Bpp, s);P5 = mat_mult(Aqq, Bqp, s);P6 = mat_mult(Aqp, Bpq, s);P7 = mat_mult(Aqq, Bqq, s);Cpp = P0 + P1; /* add submatrix products to */Cpq = P2 + P3; /* form submatrices of final matrix */Cqp = P4 + P5;Cqq = P6 + P7;

}return (C); /* return final matrix */}

Page 18: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-18

Mesh Implementations

• Cannon’s algorithm

• Fox’s algorithm (not in textbook but similar complexity)

• Systolic array

All involve using processor arranged a mesh and shifting elements

of the arrays through the mesh. Accumulate the partial sums at

each processor.

Page 19: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-19

Mesh ImplementationsCannon’s Algorithm

Uses a mesh of processors with wraparound connections (a torus) to shift the Aelements (or submatrices) left and the B elements (or submatrices) up.

1.Initially processor Pi,j has elements ai,j and bi,j (0 ≤ i < n, 0 ≤ k < n).2. Elements are moved from their initial position to an “aligned” position. The

complete ith row of A is shifted i places left and the complete jth column ofB is shifted j places upward. This has the effect of placing the element ai,j+iand the element bi+j,j in processor Pi,j,. These elements are a pair of thoserequired in the accumulation of ci,j.

3.Each processor, Pi,j, multiplies its elements.4. The ith row of A is shifted one place right, and the jth column of B is shifted

one place upward. This has the effect of bringing together the adjacentelements of A and B, which will also be required in the accumulation.

5. Each processor, Pi,j, multiplies the elements brought to it and adds theresult to the accumulating sum.

6. Step 4 and 5 are repeated until the final result is obtained (n - 1 shifts withn rows and n columns of elements).

Page 20: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-20

B

A

Movement of A and B elements

j

i

Pi,j

Page 21: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-21

B

A

Step 2 — Alignment of elements of A and B

j

i

bi+j,j

ai,j+i

i places

j places

Page 22: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-22

B

A

Step 4 — One-place shift of elements of A and B

j

i

Pi,j

Page 23: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-23

c0,0 c0,1 c0,2 c0,3

c1,0 c1,1 c1,2 c1,3

c2,0 c2,1 c2,2 c2,3

c3,0 c3,1 c3,2 c3,3

b3,0b2,0b1,0b0,0

b3,3b2,3b1,3b0,3

b3,2b2,2b1,2b0,2

b3,1b2,1b1,1b0,1

a0,3 a0,2 a0,1a0,0

a3,3 a3,2 a3,1 a3,0

a2,3 a2,2 a2,1a2,0

a1,3 a1,2 a1,1a1,0

Systolic array

Pumpingaction

One cycle delay

Page 24: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-24

c0

c1

c2

c3

b3b2b1b0

a0,3 a0,2 a0,1 a0,0

a3,3 a3,2 a3,1 a3,0

a2,3 a2,2 a2,1 a2,0

a1,3 a1,2 a1,1 a1,0

Pumpingaction

Matrix-Vector Multiplication

Page 25: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-25

Solving a System of Linear Equations

an−1,0x0 + an−1,1x1 + an−1,2x2 … + an−1,n−1xn−1 = bn−1

.

.

.a2,0x0 + a2,1x1 + a2,2x2 … + a2,n−1xn−1 = b2a1,0x0 + a1,1x1 + a1,2x2 … + a1,n−1xn−1 = b1a0,0x0 + a0,1x1 + a0,2x2 … + a0,n−1xn−1 = b0

which, in matrix form, is

Ax = b

Objective is to find values for the unknowns, x0, x1, …, xn−1, givenvalues for a0,0, a0,1, …, an−1,n−1, and b0, …, bn .

Page 26: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-26

Solving a System of Linear Equations

Dense matrices

Gaussian Elimination - parallel time complexity O(n2)

Sparse matrices

By iteration - depends upon iteration method and number ofiterations but typically O(log n)

• Jacobi iteration• Gauss-Seidel relaxation (not good for parallelization)• Red-Black ordering• Multigrid

Page 27: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-27

Gaussian Elimination

Convert general system of linear equations into triangular system ofequations. Then be solved by Back Substitution.

Uses characteristic of linear equations that any row can be replacedby that row added to another row multiplied by a constant.

Starts at the first row and works toward the bottom row. At the ithrow, each row j below the ith row is replaced by row j + (row i) (−aj,i/ai,i). The constant used for row j is −aj,i/ai,i. Has the effect of makingall the elements in the ith column below the ith row zero because

a j i, a j i, ai i,a j i,–

ai i,------------

+ 0= =

Page 28: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-28

Clearedto zero

Alreadyclearedto zero

Row i

Column i

Column

Row

Gaussian elimination

Row jStep through

aji

Page 29: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-29

Partial Pivoting

If ai,i is zero or close to zero, we will not be able to compute the

quantity −aj,i/ai,i.

Procedure must be modified into so-called partial pivoting by

swapping the ith row with the row below it that has the largest

absolute element in the ith column of any of the rows below the ith

row if there is one. (Reordering equations will not affect the system.)

In the following, we will not consider partial pivoting.

Page 30: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-30

Sequential Code

Without partial pivoting:

for (i = 0; i < n-1; i++) /* for each row, except last */for (j = i+1; j < n; j++) {/*step thro subsequent rows */m = a[j][i]/a[i][i]; /* Compute multiplier */for (k = i; k < n; k++)/*last n-i-1 elements of row j*/a[j][k] = a[j][k] - a[i][k] * m;

b[j] = b[j] - b[i] * m;/* modify right side */}

The time complexity is O(n3).

Page 31: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-31

Already

Row i

Column

Row

Broadcastith row

n − i +1 elements(including b[i])

clearedto zero

Parallel Implementation

Page 32: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-32

AnalysisCommunication

n − 1 broadcasts performed sequentially. ith broadcast contains n −i + 1 elements.

Time complexity of Ο(n2) (see textbook)

Computation

After row broadcast, each processor Pj beyond broadcast processorPi will compute its multiplier, and operate upon n − j + 2 elements ofits row. Ignoring the computation of the multiplier, there are n − j + 2multiplications and n − j + 2 subtractions.

Time complexity of Ο(n2) (see textbook).Efficiency will be relatively low because all the processors beforethe processor holding row i do not participate in the computationagain.

Page 33: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-33

Broadcast

P0 P1 P2 Pn-1

rows

Row

Pipeline implementation of Gaussian elimination

Page 34: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-34

P0

P1

P3

P2

0

n/p

2n/p

3n/p

RowStrip Partitioning

Poor processor allocation! Processors do not participate incomputation after their last row is processed.

Page 35: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-35

P0

P1

0

n/p

2n/p

3n/p

Row

Cyclic-Striped Partitioning

An alternative which equalizes the processor workload:

Page 36: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-36

Iterative Methods

Time complexity of direct method at Ο(N2) with N processors, is

significant.

Time complexity of iteration method depends upon:

• the type of iteration,• number of iterations• number of unknowns, and • required accuracy

but can be less than the direct method especially for a few

unknowns i.e a sparse system of linear equations.

Page 37: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-37

Jacobi Iteration

Iteration formula - ith equation rearranged to have ith unknown on

left side:

Superscript indicates iteration:

is kth iteration of xi , is (k−1)th iteration of xj.

xki

1

ai i,--------- bi ai j, x j

k 1–

j i≠∑–=

xki x j

k 1–

Page 38: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-38

Example of a Sparse System of Linear Equations

Laplace’s Equation

Solve for f over the two-dimensional x-y space.

For a computer solution, finite difference methods are appropriate

Two-dimensional solution space is “discretized” into a large number

of solution points.

f2∂

x2∂

---------f

2∂

y2∂

---------+ 0=

Page 39: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-39

∆ ∆

f(x, y)

Solution space

y

x

Finite Difference Method

Page 40: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-40

If distance between points, ∆, made small enough:

Substituting into Laplace’s equation, we get

Rearranging, we get

Rewritten as an iterative formula:

f k(x, y) - kth iteration, f k−1(x, y) - (k − 1)th iteration.

f2∂

x2∂

---------1

∆2------ f x ∆ y,+( ) 2 f x y,( )– f x ∆– y,( )+[ ]≈

f2∂

y2∂

---------1

∆2------ f x y ∆+,( ) 2 f x y,( )– f x y ∆–,( )+[ ]≈

1

∆2------ f x ∆ y,+( ) f x ∆– y,( ) f x y ∆+,( ) f x y ∆–,( ) 4 f x y,( )–+ + +[ ] 0=

f x y,( ) f x ∆ y,–( ) f x y ∆–,( ) f x ∆+ y,( ) f x y ∆+,( )+ + +[ ]

4------------------------------------------------------------------------------------------------------------------------------------=

fk

x y,( )f

k 1–x ∆– y,( ) f

k 1–x y ∆–,( ) f

k 1–+ x ∆ y,+( ) f

k 1–+ x y ∆+,( )+[ ]

4--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------=

Page 41: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-41

x1 x4x3x2 x8x7x6x5 x9

x31 x34x33x32 x38x37x36x35 x39

x41 x44x43x42 x48x47x46x45 x49

x51 x54x53x52 x58x57x56x55 x59

x61 x64x63x62 x68x67x66x65 x69

x71 x74x73x72 x78x77x76x75 x79

x11 x14x13x12 x18x17x16x15 x19

x21 x24x23x22 x28x27x26x25 x29

x81 x84x83x82 x88x87x86x85 x89

x60

x70

x80

x90

x40

x50

x30

x20

x10

x91 x94x93x92 x98x97x96x95 x99 x100

Boundary points (see text)

Natural Order

Page 42: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-42

Relationship with a General System of Linear Equations

Using natural ordering, ith point computed from ith equation:

or

xi−n + xi−1 − 4xi + xi+1+ xi+n = 0

which is a linear equation with five unknowns (except those withboundary points).

In general form, the ith equation becomes:

ai,i−nxi−n + ai,i−1xi−1 + ai,ixi + ai,i+1xi+1+ ai,i+nxi+n = 0

where ai,i = −4, and ai,i−n = ai,i−1 = ai,i+1 = ai,i+n = 1.

xixi n– xi 1– xi 1+ xi n++ + +

4--------------------------------------------------------------------------=

Page 43: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-43

−4−4

−4

−4

−4

1 1 11 1

1 1

1 11

ai,i ai,i+nai,i−1 ai,i+1ai,i−n1 1ith equation

11

11

1

1

×

x1

=

To includeboundary values

11

A x00

00

and some zeroentries (see text)

x2

xN

xN-1

Those equations with a boundary point on diagonal unnecessary

for solution

Page 44: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-44

Point

Point to be computed

computed

Sequential order of computation

Gauss-Seidel RelaxationUses some newly computed values to compute other values in thatiteration.

Basic formnot suitablefor parallelization

Page 45: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-45

Gauss-Seidel Iteration Formula

where the superscript indicates the iteration.

With natural ordering of unknowns, formula reduces to

xki = (−1/ai,i )[ai,i−n x

ki−n + a i,i−1 x

ki−1 + a i,i+1 x

k−1i+1+ a i,i+n x

k−1i+n ]

At the kth iteration, two of the four values (before the ith element)taken from the kth iteration and two values (after the ith element)taken from the (k−1)th iteration. We have:

xki

1ai i,--------- bi ai j, x

kj ai j, x

k 1–j

j i 1+=

N

∑–j 1=

i 1–

∑–=

fk

x y,( )f

kx ∆– y,( ) f

kx y ∆–,( ) f

k 1–+ x ∆ y,+( ) f

k 1–+ x y ∆+,( )+[ ]

4--------------------------------------------------------------------------------------------------------------------------------------------------------------------=

Page 46: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-46

Red-Black OrderingFirst, black points computed. Next, red points computed. Blackpoints computed simultaneously, and red points computedsimultaneously.

Red

Black

Page 47: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-47

Red-Black Parallel Code

forall (i = 1; i < n; i++)forall (j = 1; j < n; j++)if ((i + j) % 2 == 0) /* compute red points */f[i][j] = 0.25*(f[i-1][j] + f[i][j-1] + f[i+1][j] + f[i][j+1]);

forall (i = 1; i < n; i++)forall (j = 1; j < n; j++)if ((i + j) % 2 != 0) /* compute black points */f[i][j] = 0.25*(f[i-1][j] + f[i][j-1] + f[i+1][j] + f[i][j+1]);

Page 48: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-48

Higher-Order Difference Methods

More distant points could be used in the computation. The followingupdate formula:

fk

x y,( ) =

1

60------ 16 f

k 1–x ∆– y,( ) 16 f

k 1–x y ∆–,( ) 16 f

k 1–+ x ∆ y,+( ) 16 f

k 1–+ x y ∆+,( )1

1-+

f–k 1–

x 2∆– y,( ) fk 1–

x y 2∆–,( )– fk 1–

x 2∆+ y,( )– fk 1–

x y 2∆+,( )–

Page 49: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-49

Nine-point stencil

Page 50: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-50

Overrelaxation

Improved convergence obtained by adding factor (1 − ω)xi to Jacobior Gauss-Seidel formulae. Factor ω is the overrelaxation parameter.

Jacobi overrelaxation formula

where 0 < ω < 1.

Gauss-Seidel successive overrelaxation

where 0 < ω ≤ 2. If ω = 1, we obtain the Gauss-Seidel method.

xik ω

aii------ bi aij xi

k 1–

j i≠∑– 1 ω–( )xi

k 1–+=

xik ω

aii------ bi aij xi

kaij xi

k 1–

j i 1+=

N

∑–j 1=

i 1–

∑– 1 ω–( )xik 1–

+=

Page 51: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-51

Multigrid Method

First, a coarse grid of points used. With these points, iterationprocess will start to converge quickly.

At some stage, number of points increased to include points of thecoarse grid and extra points between the points of the coarse grid.Initial values of extra points found by interpolation. Computationcontinues with this finer grid.

Grid can be made finer and finer as computation proceeds, orcomputation can alternate between fine and coarse grids.

Coarser grids take into account distant effects more quickly andprovide a good starting point for the next finer grid.

Page 52: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-52

Coarsest grid points Finer grid pointsProcessor

Multigrid processor allocation

Page 53: Numerical Algorithmsgu/teaching/courses/csc429/slides/slides11.pdf · Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.

slides11-53

(Semi) Asynchronous Iteration

As noted early, synchronizing on every iteration will cause

significant overhead - best if one can is to synchronize after a

number of iterations.