Laplace-Example with MPI and PETSc · 42. — Laplace-Example with MPI and PETSc — 42. 42-6 Laplace-Example with MPI & PETSc Slide 11 Höchstleistungsrechenzentrum Stuttgart Rolf

42. — Laplace-Example with MPI and PETSc — 42.42-1

Laplace-Example with MPI & PETScSlide 1 Höchstleistungsrechenzentrum Stuttgart

Rolf [email protected]

University of StuttgartHigh-Performance Computing-Center Stuttgart (HLRS)

www.hlrs.de

Laplace-Example with MPI and PETSc

MPI PET1

Rolf RabenseifnerLaplace-Example with MPI & PETScSlide 2 / 58 Höchstleistungsrechenzentrum Stuttgart

Laplace Example

• Compute steady temperature distribution for given temperatures on a boundary

• i.e., solve Laplace partial differential equation (PDE)

–∆u(x,y) = –[ u(x,y) + u(x,y) ] = 0 on Ω ⊂ R2

• with boundary condition

u(x,y) = φ(x,y) on ∂Ω

• area Ω = [xmin,xmax]x[ymin,ymax]

• Compare: – Chap. [6] A Heat-Transfer Example with MPI – Explicit time-step integration of the the unsteady heat conduction

∂u/∂t = ∆u

∂2

∂x2∂2

∂y2



Discretization

• xi = (i+1)h + xmin i= –1, 0,…, m–1, m• yi = (j+1)h + ymin j= –1, 0,…, n–1, n

• same discretization in x and y: h = (xmax – xmin)/(m+1)= (ymax – ymin)/(n +1)

• ui,j = u(xi, yj)

• Boundaries (∂Ω)i = –1, j = 0,…,n–1i = m, j = 0,…,n–1i = 0,…,m–1, j = –1i = 0,…,m–1, j = n *)

• Area to be solved (Ω–∂Ω)i = 0,…,m–1, j = 0,…,n–1

*) corners are unused

x–1 x0 xi xm–1 xm= xmin = xmax

ymax=ynyn–1

yj

y0ymin= y–1


From PDE to the linear difference equations system

• Differentiation:u(xi+½, yj) = (ui+1,j – ui,j) / h ui,j = (ui+1,j – 2ui,j + ui–1,j) / h2

ui,j = ( ui+½,j – ui–½,j) / h ui,j = (ui,j+1 – 2ui,j + ui,j–1) / h2

• –∆u(x,y) = 0

⇒ – ui–1,j – ui,j–1 + 4ui,j – ui,j+1 – ui+1,j = 0 for i=0,…,m-1, j=0,…,n-1

∂∂x

∂_∂x

∂_∂x

∂2

∂x2

∂2

∂x2

∂2

∂y2



Boundary conditions in the linear difference equations system

• – ui–1,j – ui,j–1 + 4ui,j – ui,j+1 – ui+1,j = 0 for i=0,…,m-1, j=0,…,n-1

⇒ Boundary condition are used in the equations with i=0, i=m-1, j=0, j=n-1:i=0, j=0 +4ui,j –ui,j+1 –ui+1,j = u–1,0 + u0,–1

i=0, 0<j<n–1 –ui,j–1 +4ui,j –ui,j+1 –ui+1,j = u–1,j

i=0, j=n–1 –ui,j–1 +4ui,j –ui+1,j = u–1,n–1 + u0,n

0<i<m–1, j=0 –ui–1,j +4ui,j –ui,j+1 –ui+1,j = ui,–1

0<i<m–1, 0<j<n–1 –ui–1,j –ui,j–1 +4ui,j –ui,j+1 –ui+1,j = 00<i<m–1, j=n–1 –ui–1,j –ui,j–1 +4ui,j –ui+1,j = ui,n

i=m–1, j=0 –ui–1,j +4ui,j –ui,j+1 = um,0 + um–1,–1

i=m–1, 0<j<n–1 –ui–1,j –ui,j–1 +4ui,j –ui,j+1 = um,j

i=m–1, j=n–1 –ui–1,j –ui,j–1 +4ui,j = um, n–1 + um–1,n


Matrix notation

• Ordering – lexicographical mapping (i,j) Ii,j= 0,0; 0,1; … 0,n-1; 1,0; 1,1; … 1,n-1; … m-1,0; … m-1,n-1 I= 0; 1; … n-1; n; n+1; … 2n-1; … (m-1)n; … mn-1

• Matrix equation: Au=bB –I

–I B –IA = (AIJ)I=0, mn–1 = –I B … ∈ Rmn×mn “Laplace Matrix”

J=0, mn–1 … … –I–I B

4 –1 –1 4 –1

with B = –1 4 … ∈ Rn×n,… … –1

–1 4

1 1

I = 1 ∈ Rn×n

…1



Matrix notation, continued

• –∆u(x,y) = 0 ⇔ Au=b with

i=0, j=0 u0,0 u–1,0 + u0,–1... ...i=0, 0<j<n–1 u0,j u–1,j... ...i=0, j=n–1 u0,n–1 u–1,n–1 + u0,n

... ...0<i<m–1, j=0 ui,0 ui,–1... ...0<i<m–1, 0<j<n–1 u = ui,j b = 0

... ...0<i<m–1, j=n–1 ui,n–1 ui,n

... ...i=m–1, j=0 um–1,0 um,0 + um–1,–1... ...i=m–1, 0<j<n–1 um–1,j um,j... ...i=m–1, j=n–1 um–1,n–1 um, n–1 + um–1,n

repeated for 0 < i < m

-1


Laplace example: boundary and solution

• Boundary & Solution: u(x,y) = x on [0,1]×[0,1]

• (uij)i=–1,...,m, j=–1,...,n =

u–1, –1 u–1, 0 ... u–1, n–1 u–1, n

u 0, –1 u 0, 0 ... u 0, n–1 u 0, n

... ... ... ... ...um–1, –1 um–1, 0 ... um–1, n–1 um–1, n

um, –1 um, 0 ... um, n–1 um, n

0 0 ... 0 01 1 ... 1 1... ... ... ... ...m m ... m mm+1m+1 ... m+1 m+1

:= h·

with h = = __1__m+1

__1__n+1

x

y



Example with n=m=4, general boundary

4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4

-1 o o oo -1 o oo o -1 oo o o -1

o o o oo o o oo o o oo o o o


4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4

-1 o o oo -1 o oo o -1 oo o o -1


-1 o o oo -1 o oo o -1 oo o o -1

4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4

-1 o o oo -1 o oo o -1 oo o o -1


-1 o o oo -1 o oo o -1 oo o o -1

4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4


-1 o o oo -1 o oo o -1 oo o o -1


· =

u-1,0 + u 0,-1 u-1,1u-1,2u-1,3 + u 0,4

u 1,-1oou 1,4u 2,-1oou 2,4

u 4,0 + u 3,-1 u 4,1u 4,2u 4,3 + u 3,4

u 0,0u 0,1u 0,2u 0,3u 1,0u 1,1u 1,2u 1,3u 2,0u 2,1u 2,2u 2,3u 3,0u 3,1u 3,2u 3,3


Example with n=m=4, with solution & boundary u(x,y) := x

· =

0.20.20.20.20.40.40.40.40.60.60.60.60.80.80.80.8

0.0 + 0.20.00.00.0 + 0.2

0.4oo0.40.6oo0.6

1.0 + 0.81.01.01.0 + 0.8

4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4

-1 o o oo -1 o oo o -1 oo o o -1



4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4

-1 o o oo -1 o oo o -1 oo o o -1


-1 o o oo -1 o oo o -1 oo o o -1

4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4

-1 o o oo -1 o oo o -1 oo o o -1


-1 o o oo -1 o oo o -1 oo o o -1

4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4


-1 o o oo -1 o oo o -1 oo o o -1




Rolf Rabenseifner (slides)[email protected]

Rolf Rabenseifner, Gerrit Schulz, Michael Speck, Traugott Streicher, Felix Triebel (program code)


www.hlrs.de


1st practical: Writing a parallel MPI program with a CG-solver

MPI PET1


Solving Laplace equation with CG Solver

Initialize matrix A; Initialize boundary condition vector b;Initialize i_max (≤ size of A); Initialize ε (>0); Initialize solution vector x;/∗ p = b – Ax ; ∗ / p = x; /∗ Reason: ∗ //∗ substituted by ∗ / v = Ap; /∗ Parallelization halo needed ∗ /

p = b – v; /∗ for same vector (p) as in loop ∗ / r = p;α = (|| r ||2)2 ;for ( i=0; (i < i_max) && (α > ε); i++) v = Ap;

λ = α / (v,p)2 ;x = x + λp;r = r – λv;αnew = (|| r ||2)2 ;p = r + (αnew/α)p;α = αnew;

Print x, √α, ||b–Ax||2;

See, e.g.,Andreas Meister: Numerik linearer Gleichungssysteme.Vieweg, 1999, p. 124.



The parallelization

• Distribute Laplace matrix and vectors in slices (chunks)

· =

0.20.20.20.20.40.40.40.40.60.60.60.60.80.80.80.8

0.0 + 0.20.00.00.0 + 0.2

0.4oo0.40.6oo0.6

1.0 + 0.81.01.01.0 + 0.8

4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4

-1 o o oo -1 o oo o -1 oo o o -1



4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4

-1 o o oo -1 o oo o -1 oo o o -1


-1 o o oo -1 o oo o -1 oo o o -1

4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4

-1 o o oo -1 o oo o -1 oo o o -1


-1 o o oo -1 o oo o -1 oo o o -1

4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4


-1 o o oo -1 o oo o -1 oo o o -1


Process 0

Process 1

Process 2

• 2- and 3-dimensional distribution — depends on form of the domains– choose global indexing:

– first lines of the matrix = elements of 1st domain,– next lines … = … 2nd domain, …


Matrix-Vector-Multiply

• Data needed from other processes Halo information• Halo need not to be contiguous!!!

– Depends on entries of the matrix

· =

0.20.20.20.20.40.40.40.40.60.60.60.60.80.80.80.8

0.0 + 0.20.00.00.0 + 0.2

0.4oo0.40.6oo0.6

1.0 + 0.81.01.01.0 + 0.8

4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4

-1 o o oo -1 o oo o -1 oo o o -1



4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4

-1 o o oo -1 o oo o -1 oo o o -1


-1 o o oo -1 o oo o -1 oo o o -1

4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4

-1 o o oo -1 o oo o -1 oo o o -1


-1 o o oo -1 o oo o -1 oo o o -1

4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4


-1 o o oo -1 o oo o -1 oo o o -1


Example: Process 1

Input Input Result

6

0.2

o

0.2



Data-structures — Sparse Matrix

· =

0.20.20.20.20.40.40.40.40.60.60.60.60.80.80.80.8

0.0 + 0.20.00.00.0 + 0.2

0.4oo0.40.6oo0.6

1.0 + 0.81.01.01.0 + 0.8

4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4

-1 o o oo -1 o oo o -1 oo o o -1



4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4

-1 o o oo -1 o oo o -1 oo o o -1


-1 o o oo -1 o oo o -1 oo o o -1

4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4

-1 o o oo -1 o oo o -1 oo o o -1


-1 o o oo -1 o oo o -1 oo o o -1

4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4


-1 o o oo -1 o oo o -1 oo o o -1


Example: Process 1start = 6

end1 = 11

0

6

11

160 4 8 9 12 15

A.row[0][1][2][3][4]

.size

.J

.J_loc

.value

44 8 9 12

-1 4 -1 -1– 2 3 – J_loc = J – start


·

4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4

-1 o o oo -1 o oo o -1 oo o o -1



4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4

-1 o o oo -1 o oo o -1 oo o o -1


-1 o o oo -1 o oo o -1 oo o o -1

4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4

-1 o o oo -1 o oo o -1 oo o o -1


-1 o o oo -1 o oo o -1 oo o o -1

4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4


-1 o o oo -1 o oo o -1 oo o o -1



end1 = 11

0 2 3 4 5 11 12 13 14

Data-structures — Vector & Halo

• vectors: contiguous array with 2 meanings (own data & halo)– fast matrix-vector-multiply– halo allocated according to row-wise matrix analysis

V00V01V02V03V04V05V06V07V08V09V10V11V12V13V14v15

v06v07v08v09v10v02v05v03

v04v12v13v14

v11

v[0][1][2][3][4][5][6][7][8][9]

[10][11][12]

localindex

values

owndata

halo(sha-dow,ghostcells)

0

6

11

16



·

4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4

-1 o o oo -1 o oo o -1 oo o o -1



4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4

-1 o o oo -1 o oo o -1 oo o o -1


-1 o o oo -1 o oo o -1 oo o o -1

4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4

-1 o o oo -1 o oo o -1 oo o o -1


-1 o o oo -1 o oo o -1 oo o o -1

4 -1 o o -1 4 -1 oo -1 4 -1o o -1 4


-1 o o oo -1 o oo o -1 oo o o -1



end1 = 11

V00V01V02V03V04V05V06V07V08V09V10V11V12V13V14v15

v06v07v08v09v10v02v05v03

v04v12v13v14

v11

localindexI_loc

values

Data-structures — Vector & Halo

A.row[0][1][2][3][4]

.size

.J

.J_loc

.value

44 8 9 12

-1 4 -1 -19 2 3 10

6789

10253

114

121314

globalindexI

v[0][1][2][3][4][5][6][7][8][9]

[10][11][12]

0 4 8 9 12 15

01234567

halo_pos= index in halo_info

0

6

11

16


Data-structures — Updating the halos

• Needed: – receiving structure for each neighbor

• Given by left diagram• Global index rank

– sending structure for each neighbor• Must be constructed from the

receiving structure of the neighborv06v07v08v09v10v02v05v03

v04v12v13v14

v11

values

6789

10253

114

121314

v[0][1][2][3][4][5][6][7][8][9]

[10][11][12]

recv_halo [0][1][2]

2

sizeI_locscratch

45 6 7 9V02

V05

V03

V04

Example in Process 1 – receiving structure

sizeI_locscratch

48 10 11 12V11

V12

V13

V14

02

index=neighbor‘s rank

recv_numrecv_ranks

size 0

only ranks withcommunication,i.e. local_size > 0

in p

roce

ss1 local

indexI_loc

globalindexI

01234567

halo_pos= index in halo_info



Data-structures — Corresponding sending structure

recv_halo [0][1][2]

sizeI_locscratch

45 6 7 9V02

V05

V03

V04

Example in Process 1 – receiving structure

index=source rank

send_halo [0][1][2]

sizeI_locscratch

42 5 3 4V02

V05

V03

V04

Corresponding sending structure in Process 0

index=destination rank

• Can be calculated in the receiving processusing the halo_pos (=local index)and the corresponding global columnminus start-value in the sending process.

• send_num and send_ranks also needed

v06v07v08v09v10v02v05v03

v04v12v13v14

v11

values

6789

10253

114

121314

v[0][1][2][3][4][5][6][7][8][9]

[10][11][12]

in process 1

localindexI_loc

globalindexI

v00v01v02v03v04

v[0][1][2][3][4][5] v05

in process 0


Halo communication

v00v01v02v03v04v05v06v07v08v09

process 0

v02v03v04v05v06v07v08v09v10v11v12v13v14

process 1

v07v08v09v10v11v12v13v14v15

process 2

halo

halo

halo

halo

sender receiversend

_hal

o

recv

_hal

o

• transferring the halo-data• before each matrix-vector-

multiply

• transferring the index-information• part of the initialization• “size” with MPI_Alltoall• “indexes” with MPI_Irecv, MPI_Send



Vector Routines

• Allocate vector and initialize with 0 • Store vector-value in row j (global index)

• Vector calculation routines:– duplicate (v2 = v1)– add (v3 = v1 + βv2)– dot product (β = (v1, v2)2 )

• with MPI_Allreduce(…MPI_SUM…)

– sqr_norm (β = (||v1||2)2 )• with MPI_Allreduce(…MPI_SUM…)

– max_norm (β = ||v1||∞ )• with MPI_Allreduce(…MPI_MAX…)

Initialize matrix A;Initialize boundary condition vector b;Initialize i_max (≤ size of A); Initialize ε (>0);Initialize solution vector x; p = x; v = Ap; p = b – v;r = p;α = (|| r ||2)2 ;for ( i=0; (i < i_max) && (α > ε); i++) v = Ap;

λ = α / (v,p)2 ;x = x + λp;r = r – λv;αnew = (|| r ||2)2 ;p = r + (αnew/α)p;α = αnew ;


2ab

c

d

e

f

g


Matrix Routines, I.

• Allocate matrix and initialize as empty• Allocate storage for one row

• Store matrix-values in row I (global index)

• Matrix-vector-multiply (v2 = Av1)– initializing vector halo structure

• analyzing the matrix entriesin columns outside of [start .. end1–1]

– initializing vector halo receiver info• analyzing the vector halo structure




3a

b

4

5

b



Matrix Routines, II.

– initializing vector halo sender info, part 1• analyzing the receiver info• receiver must inform the sender:

The local_size can be transferred with MPI_Alltoall (MPI-1) or MPI_Put (MPI-2).

– initializing vector halo sender info, part 2• The send_neighbor.local_index values

can be transferred with MPI_Irecv, MPI_Send, MPI_Waitall.

– updating the halo of v1• with MPI_Irecv, MPI_Isend, MPI_Waitall

– the multiplication (v2 = Av1)• using the halo information of v1




6

7

8

9


Distribution, Printing, CG-solver, and Application program

• Distribution:– init(mat_size, num_procs)

• internal module data: mat_size,chunk_size = (mat_size–1) / num_procs + 1

– start(rank), end1(rank)– rank(row)

• Domain Decomposition• Application:

– Initialize matrix A• only rows [start .. end1–1]

– Initialize boundary vector b• only rows [start .. end1–1]

• Printing:– Print matrix A– Print solution vector x

• CG-solver• main


λ = α / (v,p)2 ;x = x + λp;r = r – λv;αnew = (|| r ||2)2 ;p = r + (αnew/α)p;α = α new;


1

1314

11

15

12

a

bc

16

10



Naming scheme — Indexes

• m,n dimensions of the physical problem• i,j index in physics (0..m–1, 0..n–1 – without boundary)• I global row index in Laplace matrix and vector (0 .. nm–1)• J global column index in the Laplace matrix (0 .. nm–1)

– process-local data: start .. end1–1

• I_loc local row index in Laplace matrix and vector (and in halo)• J_loc local column in Laplace matrix

– process-local data: 0 .. end1–start–1– I_loc = I – start

• halo_pos index in the halo (0 .. halo_info.size–1) – halo_pos = I_loc – (end1–start)


Domain Decomposition & Data Distribution

• load-balanced distribution of the physical data matrix– each physical data entry

= one row in the Laplacematrix

– same amount of physical entries on each process

j=0 1 2 3 4 5 6 7 8

i=0 I = 0 1 2 3 4 5 6 7 81 9 10 11 12 13 14 15 16 172 18 19 20 21 22 23 24 25 263 27 28 29 30 31 32 33 34 354 36 37 38 39 40 41 42 43 445 45 46 47 48 49 50 51 52 536 54 55 56 57 58 59 60 61 62

j=0 1 2 3 4 5 6 7 8

i=0 I = 0 1 2 3 4 20 21 22 23 1 5 6 7 8 9 24 25 26 272 10 11 12 13 14 28 29 30 313 15 16 17 18 19 32 33 34 35 4 36 37 38 39 40 51 52 53 54 5 41 42 43 44 45 55 56 57 58 6 46 47 48 49 50 59 60 61 62

1-dimensional

2-dimensionaldomain

decomposition



Domain Decomposition & Data Distribution

• communication-optimizeddistribution of the physical data matrix

– full horizontal set of physical entries on each process

– full domainon each process

j=0 1 2 3 4 5 6 7 8

i=0 I = 0 1 2 3 4 5 6 7 81 9 10 11 12 13 14 15 16 172 18 19 20 21 22 23 24 25 263 27 28 29 30 31 32 33 34 354 36 37 38 39 40 41 42 43 445 45 46 47 48 49 50 51 52 536 54 55 56 57 58 59 60 61 62

j=0 1 2 3 4 5 6 7 8

i=0 I = 0 1 2 3 4 20 21 22 23 1 5 6 7 8 9 24 25 26 272 10 11 12 13 14 28 29 30 313 15 16 17 18 19 32 33 34 35 4 36 37 38 39 40 51 52 53 54 5 41 42 43 44 45 55 56 57 58 6 46 47 48 49 50 59 60 61 62

1-dimensional

2-dimensionaldomain

decomposition

if m is multiple of m_procs

and n is multiple of n_procs

thenload-opt. == comm.-opt.


The Laplace-Equation with arbitrary domain decomposition

• arbitrary mapping – physical 2-dim indexes (i,j) I= global row index of the Laplace matrix– (i,j) J = global column index of the Laplace matrix

and global row index of the vectors• choosing identical mappings: I(i,j) = J(i,j) := functions ij2I_1dim(i,j)

and ij2I_2dim(i,j)

• – ui–1,j – ui,j–1 + 4ui,j – ui,j+1 – ui+1,j = 0 for i=0,…,m-1, j=0,…,n-1

• at matrix row I(i,j): at boundary vector row I(i,j):

atcolumn J(i-1,j), J(i,j-1), J(i,j), J(i,,j+1), J(i+1,j)

( -1 -1 4 -1 -1 )

ui–1,j

ui,j–1

ui,j

ui,j+1

ui+1,j

= 0 + ui–1,j + ui,j–1 + ui,j+1 + ui+1,j

at solution vector column J(i,j)

only ifi<m-1

only ifj<n-1

only ifj>0

only ifi>0

only ifi=m-1

only ifj=n-1

only ifj=0

only ifi=0



Practical• Tasks – each group works on one task only!!!!

– no./ difficulty (1=simple .. 5=hard) lines of real code– 00 / – / global decl. & memory– 01 / 1 / distribution: global index range ranks 7– 02 / 3 / vector routines 25– 03 / 2 / matrix allocation & store 14– 04 / 3 / build halo vector info 22– 05 / 4 / build halo recv info 38– 06 / 4 / build halo send info, part 1 (communicate “size” with MPI_Alltoall) 11– 07 / 5 / build halo send info, part 2 (communicate “indexes” with Irecv & Send) 44– 08 / 5 / communicate vector data from “own data” to “halo” 40– 09 / 2 / matrix-vector-multiply 9– 10 / 1 / 1-dim domain decomposition, and / 4 / 2-dim domain decomp. 2+26– 11 / 4 / initialization of Laplace matrix A 9– 12 / 3 / initialization of boundary vector b and exact solution vector u 15– 13 / – / printing application data– 14 / – / printing vectors, matrices, halo information, …– 15 / 2 / the CG solver 19– 16 / – / main and options reading and distributing– Sum: 13 tasks as practical, 3 tasks and all declarations are given, to do: 281

MPI routines neededpart of domain decomposition or distribution, but without communication

>


Practical — Working environment

• Your working directory: ~/CG/<nr>• Choose your task: <task>• Fetch your skeleton: cp ~/CG/skel/cg_<task>.c .• Add your code, compile, run and test it (correct result?, same as serial result?)• If your task works:

– extract your part (from /*== task_ii begin ==*/ to /*== task_ii end ==*/ )into cgp<task>.c

• Advanced exercise: Implement the communication-optimized distribution– in a copy of your cg_<task>.c – compare execution time: 1-dim decomposition / 2-dim load optimal / 2-dim comm.-opt.

• When all groups have finished, everyone can check the total result with:– ls –l ../*/cgp*.c– cat ../00/cgp00.c ../*/cgp01.c ../*/cgp02.c ../*/cgp03.c ../*/cgp04.c ../*/cgp05.c

../*/cgp06.c ../*/cgp07.c ../*/cgp08.c ../*/cgp09.c ../*/cgp10.c ../*/cgp11.c

../*/cgp12.c ../00/cgp13.c ../00/cgp14.c ../*/cgp15.c ../00/cgp16.c > cg_all.c– duplicate parts must be selected by hand (<nr> instead of *)– missing parts may be fetched also from ../source/parts/cgp<task>.c– Compile and run cg_all.c

Do not modify any lines outside of your

task segment



Practical — Options

• Compile-time options [default]:–Dserial — compile without MPI and without distribution [parallel]

• Run-time options [default]:–m <m> — vertical dimension of physical heat area [4]–n <n> — horizontal dimension … [4]–imax <iter_max> — maximum number of iterations in the CG solver [500]–eps <epsilon> — abort criterion of the solver for residual vector [1e-6]–twodims — choose 2-dimensional domain decomposition [1-dim]–mprocs <m_procs> — choose number of processors, vertical, (–twodims needed)–nprocs <n_procs> — … and horizontal [given by MPI_Dims_create]–prtlev 0|1|2|3|4|5 — printing and debug level [1]:

1 = only || result – exact solution || and partial result matrix2 = and residual norm after each iteration3 = and result of physical heat matrix4 = and all vector and matrix information in 1st iteration5 = and in all iterations


Goals of the practical

• Major goal: – You get time to understand the parallelization of solvers

• Minor goals:– You get additional experience with MPI– You are involved with domain decomposition – You are involved in programming iterative solvers

• Nice, but not necessary– All groups together write this parallel program

• tasks



Rolf [email protected]


www.hlrs.de


2nd practical: PETSc

MPI PET1


Solving Laplace equation with PETSc *)

• Initialization of PETSc

• Initialization of Laplace matrix A

• Initialization of the boundary condition b

– Data: Vector u := predefined exact solutionVector b := boundary condition (RHS)Vector x := approximate solution computed

– Initialization of b, u: b, u – see previous slides [ heat, u(x,y)=x ]or with –random_exact_sol: u = random values, b := Au

• Solving Ax=b

• Checking the solution error_norm = || x – u ||2

*) based on petsc/src/sles/examples/tutorials/ex2.c



Initialization of PETSc

21: /* Include "petscsles.h" so that we can use SLES solvers. Note that this fileautomatically includes:

petsc.h - base PETSc routines petscvec.h - vectorspetscsys.h - system routines petscmat.h - matricespetscis.h - index sets petscksp.h - Krylov subspace methodspetscviewer.h - viewers petscpc.h - preconditioners */

28: #include petscsles.h33: int main(int argc,char **args)34: 35: Vec x, b, u; /* approx solution, RHS, exact solution */36: Mat A; /* linear system matrix */37: SLES sles; /* linear solver context */38: PetscRandom rctx; /* random number generator context */39: PetscReal norm; /* norm of solution error */40: int i,j, I,J, Istart, Iend, ierr, m = 4, n = 4, its; 41: PetscTruth flg; 42: PetscScalar v, h, one = 1.0, neg_one = –1.0; 43: KSP ksp; KSPType ksptype; PC pc; PCType pctype; 45: PetscInitialize(&argc, &args, (char *)0, help); 46: PetscOptionsGetInt(PETSC_NULL,"–m",&m,PETSC_NULL); 47: PetscOptionsGetInt(PETSC_NULL,"–n", &n, PETSC_NULL);


Initialization of matrix A55: /* When using MatCreate(), the matrix format can be specified at runtime.

Also, the parallel partitioning of the matrix is determined by PETSc at runtime.Performance tuning note: For problems of substantial size, preallocation of matrix memory is crucial for attaining good performance. Since preallocation is not possible via the generic matrix creation routine MatCreate(), we recommend for practical problems instead to use the creation routine for a particularmatrix format, e.g., MatCreateMPIAIJ() – parallel AIJ (compressed sparse row)

MatCreateMPIBAIJ() – parallel block AIJSee the matrix chapter of the users manual for details. */

69: MatCreate(PETSC_COMM_WORLD,PETSC_DECIDE,PETSC_DECIDE,m*n,m*n,&A); 70: MatSetFromOptions(A); 73: /* Currently, all PETSc parallel matrix formats are partitioned by contiguous chunks of rows

across the processors. Determine which rows of the matrix are locally owned. */

77: MatGetOwnershipRange(A,&Istart,&Iend); 92: for (I=Istart; I<Iend; I++) 93: v = –1.0; i = I/n; j = I – i*n; 94: if (i>0) J = I – n; MatSetValues(A, 1,&I, 1,&J, &v, INSERT_VALUES); 95: if (i<m–1) J = I + n; MatSetValues(A, 1,&I, 1,&J, &v, INSERT_VALUES); 96: if (j>0) J = I – 1; MatSetValues(A, 1,&I, 1,&J, &v, INSERT_VALUES); 97: if (j<n–1) J = I + 1; MatSetValues(A, 1,&I, 1,&J, &v, INSERT_VALUES); 98: v = 4.0; MatSetValues(A,1,&I,1,&I,&v,INSERT_VALUES);

107: MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY); 108: MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);



Initialization of u, b, x

110: /* Create parallel vectors.

– We form 1 vector from scratch and then duplicate as needed.– When using VecCreate(), VecSetSizes and VecSetFromOptions()

in this example, we specify only thevector's global dimension; the parallel partitioning is determined at runtime.

– When solving a linear system, the vectors and matrices MUSTbe partitioned accordingly. PETSc automatically generatesappropriately partitioned matrices and vectors when MatCreate()and VecCreate() are used with the same communicator.

– The user can alternatively specify the local vector and matrixdimensions when more sophisticated partitioning is needed(replacing the PETSC_DECIDE argument in the VecSetSizes() statementbelow).

*/126: VecCreate(PETSC_COMM_WORLD,&u); 127: VecSetSizes(u,PETSC_DECIDE,m*n); 128: VecSetFromOptions(u); 129: VecDuplicate(u,&b); 130: VecDuplicate(b,&x);


Initializing the values of b (and u)

145: PetscOptionsHasName(PETSC_NULL,"–random_exact_sol",&flg); 146: if ( ! flg)

VecGetOwnershipRange(b,&Istart,&Iend); h = 1.0 / (m+1);for (I=Istart; I<Iend; I++)

v = 0; i = I/n; j = I – i*n; h = 1/(m+1);if (i==0) v = v + /* u(-1,j): */ h * 0;if (i==m-1) v = v + /* u(m,j): */ h * (m+1);if (j==0) v = v + /* u(i,-1): */ h * (i+1);if (j==n-1) v = v + /* u(i, n): */ h * (i+1);if (v != 0) ; VecSetValues(b,1,&I,&v,INSERT_VALUES);v = /* u(i, j): */ h * (i+1); VecSetValues(u,1,&I,&v,INSERT_VALUES);

VecAssemblyBegin(b); VecAssemblyEnd(b); VecAssemblyBegin(u); VecAssemblyEnd(u);

160: else PetscRandomCreate(PETSC_COMM_WORLD,RANDOM_DEFAULT,&rctx); VecSetRandom(rctx,u); PetscRandomDestroy(rctx);MatMult(A,u,b);

164: 167: /* View the exact solution vector if desired */169: PetscOptionsHasName(PETSC_NULL,"–view_exact_sol",&flg); 170: if (flg) VecView(u,PETSC_VIEWER_STDOUT_WORLD);



Solving Ax=b

173: /* - - - - - - Create the linear solver and set various options - - - - - - - */177: /* Create linear solver context */179: SLESCreate(PETSC_COMM_WORLD,&sles); 182: /* Set operators. Here the matrix that defines the linear system

also serves as the preconditioning matrix. */185: SLESSetOperators(sles,A,A,DIFFERENT_NONZERO_PATTERN); 188: /* Set linear solver defaults for this problem (optional).

– By extracting the KSP (Krylov subspace methods) and PC (Preconditioner) contexts fromthe SLES context, we can then directly call any KSP and PC routines to set various options.

– The following two statements are optional; all of these parameters could alternatively be specified at runtime via SLESSetFromOptions(). All of these defaults can be overridden at runtime, as indicated below. */

198: SLESGetKSP(sles,&ksp);199: KSPSetTolerances(ksp,1.e-2/((m+1)*(n+1)),1.e-50,PETSC_DEFAULT,PETSC_DEFAULT); 202: /* Set runtime options, e.g., -ksp_type <type> -pc_type <type> -ksp_monitor -ksp_rtol <rtol>

These options will override those specified above as long as SLESSetFromOptions() is called _after_ any other customization routines. */

208: SLESSetFromOptions(sles);

211: /* - - - - - - Solve the linear system - - - - - - - */214: SLESSolve(sles,b,x,&its);


Printing the solution

229: /* Draw solution grid */

233: PetscOptionsHasName(PETSC_NULL,"–view_sol_serial",&flg);234: if (flg) VecView( x, PETSC_VIEWER_STDOUT_WORLD);

236: PetscOptionsHasName(PETSC_NULL,"–view_sol",&flg);237: if (flg) 238: PetscScalar *xx;239: VecGetArray( x, &xx );240: VecGetOwnershipRange(x,&Istart,&Iend); 241: PetscPrintf( PETSC_COMM_WORLD,

"Solution Grid (without boundary conditions):\n" );242: for (I=Istart; I<Iend; I++) 243: i = I/n; j = I – i*n; 244: PetscSynchronizedPrintf( PETSC_COMM_WORLD, "%8.6f ", xx[I–Istart] );245: if (j == (n–1) ) PetscSynchronizedPrintf( PETSC_COMM_WORLD, “\n");246: 247: PetscSynchronizedFlush( PETSC_COMM_WORLD );248: VecRestoreArray( x, &xx );249:



Printing the solution via X Window

29: #include petscda.h251: PetscOptionsHasName(PETSC_NULL,"–view_sol_x",&flg); 252: if (flg) /* view solution grid in an X window */ 253: PetscScalar *xx; DA da;

AO ao; Vec x_da;257: DACreate2d(PETSC_COMM_WORLD,DA_NONPERIODIC,DA_STENCIL_STAR,258: n,m,PETSC_DECIDE,PETSC_DECIDE,1,0,PETSC_NULL,PETSC_NULL,&da);259: DACreateGlobalVector(da, &x_da); 260: DAGetAO(da, &ao); 261: VecGetOwnershipRange(x, &Istart, &Iend); 262: VecGetArray(x, &xx); 263: for (I=Istart; I<Iend; I++) 264: i = I; AOApplicationToPetsc(ao,1,&i); 265: VecSetValues(x_da, 1, &i, &xx[I-Istart], INSERT_VALUES); 266: 267: VecRestoreArray(x, &xx); 268: VecAssemblyBegin(x_da); VecAssemblyEnd(x_da); 269: PetscOptionsHasName(PETSC_NULL,"-view_sol_x_da",&flg);270: if (flg) VecView(x_da,PETSC_VIEWER_STDOUT_WORLD); 271: VecView(x_da, PETSC_VIEWER_DRAW_(PETSC_COMM_WORLD));272: DADestroy(da);

VecDestroy(x_da); 273:

Create a 2-dimensional vector

x_da

Copy values of x into x_da

View the solution as a 2-dimensional plot

x

y

Controling thedistribution of x_da


Check solution and clean up

283: /* Check the error */285: VecAXPY(&neg_one,u,x); 286: VecNorm(x,NORM_2,&norm); 287: /* Optional: Scale the norm: norm *= sqrt(1.0/((m+1)*(n+1))); */

290: /* Print convergence information. PetscPrintf() produces a single print statement from all processes that share a communicator.An alternative is PetscFPrintf(), which prints to a file. */

294: PetscPrintf(PETSC_COMM_WORLD,"Norm of error %A iterations %d\n",norm,its);

297: /* Free work space. All PETSc objects should be destroyed when they are no longer needed. */

300: SLESDestroy(sles); 301: VecDestroy(u); VecDestroy(x); 302: VecDestroy(b); MatDestroy(A);

305: /* Always call PetscFinalize() before exiting a program. This routine– finalizes the PETSc libraries as well as MPI– provides summary and diagnostic information if certain runtime

options are chosen (e.g., –log_summary). */310: PetscFinalize();



Program start

1: /* Program usage: mpirun –np <procs> ./heat_petsc [–help] [all PETSc options] */

3: static char help[ ] = "Solves a linear system in parallel with SLES: Compute steady \n4: temperature distribution for given temperatures on a boundary.\n5: Input parameters include: \n6: –random_exact_sol : use a random exact solution vector \n7: –view_exact_sol : write exact solution vector to stdout \n8: –view_sol_serial : write solution grid to stdout (1 item/line) \n9: –view_sol : write solution grid to stdout (as matrix) \n

10: –view_sol_x –draw_pause 3 : view solution x on a X window \n11: –view_mat_x –draw_pause 3 : view matrix A on a X window \n12: –m <mesh_x> : number of mesh points in x-direction \n13: –n <mesh_y> : number of mesh points in y-direction \n";...46: PetscInitialize(&argc, &args, (char *)0, help);


Other Options

–help prints all options

–ksp_type <type> e.g., cg (Conjugate Gradient), cr (Conjugate Residual), bcgs (BiCGSTAB),cgs (Conjugate Gradient Squared), tfqmr (Transpose-Free Quasi-Minimal Residual), bicg (BiConjugate Gradient), qmres (Generalized Minimal Residual)

–ksp_rtol <rtol> convergence criterion set by the program to 1.e-2/((m+1)*(n+1))

–pc_type <type> e.g., bjacobi (BlockJacobi), asm (Additive Schwarz)–sub_pc_type <type> e.g., jacobi (Block Jacobi), sor (SOR), ilu (Incomplete LU)

–ksp_monitor prints an estimate of the l2 -norm of the residual at each iteration–sles_view prints information on chosen KSP (solver) and PC (preconditioner)–log_summary prints statistical data–options_table prints all used options–options_left prints options table and unused options



Runtime Script Example, I.

1 #! /bin/csh2 # 3 # Sample script: Experimenting with linear solver options.4 # Can be used with, e.g., petsc/src/sles/examples/tutorials/ex2.c5 # or heat_petsc.c6 #7 set appl='./heat_petsc' # path of binary8 set options='–ksp_monitor –sles_view –log_summary –options_table –options_left

–m 10 –n 10' 9 set num='0' 10 foreach np (1 2 4 8) # number of processors11 foreach ksptype (gmres bcgs tfqmr) # Krylov solver12 set pctypes_parallel='bjacobi asm' # parallel preconditioners13 set pctypes_serial='ilu' # non-parallel preconditioners14 if ($np == 1) then15 set pctype_list="$pctypes_serial $pctypes_parallel"16 else17 set pctype_list="$pctypes_parallel"18 endif 19 foreach pctype ($pctype_list)20–49 ... (see next slide)50 end #for pctype51 end #for ksptype52 end #for np


Runtime Script Example, II.10 foreach np (1 2 4 8) # number of processors11 foreach ksptype (gmres bcgs tfqmr) # Krylov solver12 set pctypes_parallel='bjacobi asm' # parallel preconditioners13 set pctypes_serial='ilu' # non-parallel preconditioners14 if ($np == 1) then ; set pctype_list="$pctypes_serial $pctypes_parallel"16 else ; set pctype_list="$pctypes_parallel"18 endif 19 foreach pctype ($pctype_list)20 if ($pctype == ilu) then # non-parallel preconditioner21 foreach level (0 1 2) # level of fill for ILU(k)22 echo ' '23 echo '************************ Beginning new run ************************'24 echo ' '25 set cmd="mpirun –np $np $appl –ksp_type $ksptype –pc_type $pctype

–pc_ilu_levels $level $options"26 set num=èxpr $num + 1`; echo "$num : $cmd" 27 eval $cmd28 end #for level29 else # parallel preconditioner30–48 ... (see next slide)49 endif #pctype50 end #for pctype51 end #for ksptype52 end #for np



Runtime Script Example, III.20 if ($pctype == ilu) then # non-parallel preconditioner21–28 ...29 else # parallel preconditioner30 foreach subpctype (jacobi sor ilu) # subdomain solver31 if ($subpctype == ilu) then32 foreach level (0 1 2) # level of fill for ILU(k)34 echo '************************ Beginning new run ************************'36 set cmd="mpirun –np $np $appl –ksp_type $ksptype –pc_type $pctype

–sub_ksp_type preonly –sub_pc_type $subpctype–sub_pc_ilu_levels $level $options"

37 set num=èxpr $num + 1`; echo "$num : $cmd" 38 eval $cmd 39 end #for level40 else42 echo '************************ Beginning new run ************************'44 set cmd="mpirun –np $np $appl –ksp_type $ksptype –pc_type $pctype

–sub_ksp_type preonly –sub_pc_type $subpctype$options"

45 set num=èxpr $num + 1`; echo "$num : $cmd" 46 eval $cmd 47 endif #subpctype48 end #for subpctype 49 endif #pctype


Output Example

t3e> setenv PETSC_DIR /usr/local/lib/PETSC ; setenv PETSC_ARCH t3et3e> make BOPT=O heat_petsct3e> mpirun –np 3 ./heat_petsc –ksp_type cg –m 4 –n 4 –ksp_monitor

–sles_view –view_sol –log_summary –options_table –options_left0 KSP Residual norm 1.242025913946e+00

...6 KSP Residual norm 8.610435306905e–04 7 KSP Residual norm 2.704366376622e–04

KSP Object:type: cgmaximum iterations=10000, initial guess is zerotolerances: relative=0.0004, absolute=1e–50, divergence=10000left preconditioning

PC Object:type: bjacobi

block Jacobi: number of blocks = 3KSP Object:(sub_)

type: preonlytolerances: relative=1e–05, absolute=1e–50, left preconditioning

PC Object:(sub_)type: ilu

ILU: 0 levels of fillILU: max fill ratio allocated 1ILU: tolerance for zero pivot 1e–12

...

Solution Grid (without boundary conditions):

0.199977 0.199891 0.199995 0.199985 0.400010 0.400072 0.400007 0.399864 0.600126 0.600078 0.599905 0.599989 0.800019 0.799936 0.799993 0.800014

Norm of error 0.000269002 iterations 7

Max Max/Min Avg Total Time (sec): 7.033e–02 1.01369 6.971e–02Objects: 4.100e+01 1.00000 4.100e+01Flops: 1.071e+03 1.28571 9.370e+02 2.811e+03Flops/sec: 1.523e+04 1.26883 1.343e+04 4.030e+04

Solved !!!

x

y



Makefile

ALL: heat_petsc

CFLAGS = FFLAGS = CPPFLAGS =FPPFLAGS =

include $PETSC_DIR/bmake/common/base

heat_petsc: heat_petsc.o chkopts<TAB> –$CLINKER –o heat_petsc heat_petsc.o $PETSC_SNES_LIB<TAB> $RM heat_petsc.o


Installation

• Set the environmental variable PETSC_DIR to the full path of the PETSchome directory, for example:

setenv PETSC_DIR /home/username/petsc-2.1.3

• Set the environmental variable PETSC_ARCH, which indicates the architecture on which PETSc will be configured. For example, use

setenv PETSC_ARCH solaris_gnusetenv PETSC_ARCH `$PETSC_DIR/bin/petscarch`

• In the PETSc home directory, type make BOPT=g all >& make_log

to build a debugging version of the PETSc ormake BOPT=O all >& make_log

to build optimized version of the PETSc libraries.



Customized installation

Under the following circumstances it might be necessary to customize your installation of PETSc:

• packages like BLAS or Lapack are not installed in the default directories

• you want to use additional packages like Matlab or BlockSolve

• you want to use special compiler or linker options


Customized installation

The PETSc Makefile System is located in $PETSC_DIR/bmake.This directory has subdirectories for each supported platform.If you want to customize your installation you have to edit the following files:

• $PETSC_DIR/bmake/$PETSC_ARCH/packages– locations of all needed packages

• $PETSC_DIR/bmake/$PETSC_ARCH/variables– definitions of compilers, linkers, etc.



Example - /bmake/linux/packages

# $Id: packages,v 1.63 2001/10/10 18:50:03 balay Exp $ # This file contains site-specific information. The definitions below# should be changed to match the locations of libraries at your site.# The following naming convention is used:# XXX_LIB - location of library XXX# XXX_INCLUDE - directory for include files needed for library XXX# Location of BLAS and LAPACK.# See $PETSC_DIR/docs/intallation.html for information on # retrieving them.# BLASLAPACK_LIB = -L/home/petsc/software/blaslapack/linux

-lflapack -lfblasBLASLAPACK_LIB = -L/home/petsc/software/mkl_linux/LIB

-lmkl32_lapack -lmkl32_def -lpthread## Location of MPI (Message Passing Interface) software#MPI_HOME = /home/petsc/software/mpich-1.2.0/linuxMPI_LIB = -L$MPI_HOME/lib -lmpichMPI_INCLUDE = -I$MPI_HOME/includeMPIRUN = $MPI_HOME/bin/mpirun -machinefile

$PETSC_DIR/maint/hosts.local


Example - /bmake/linux/packages

# ----------------------------------------------------------------------------------------# Locations of OPTIONAL packages. Comment out those # you do not have.# ----------------------------------------------------------------------------------------# Location of X-windows softwareX11_INCLUDE =X11_LIB = -L/usr/X11R6/lib -lX11PETSC_HAVE_X11 = -DPETSC_HAVE_X11# Location of MPE# If using MPICH version 1.1.2 or higher use the flag #DPETSC_HAVE_MPE_INITIALIZED_LOGGING#MPE_INCLUDE = -I/home/petsc/mpich-1.1.1/mpe#MPE_LIB = -L/home/petsc/mpich-1.1.1/lib/LINUX/ch_p4

-lmpe -lpmpich#MPE_INCLUDE =#MPE_LIB = -L$MPI_HOME/lib -lmpe#PETSC_HAVE_MPE = -DPETSC_HAVE_MPE



Summary• Laplace equation: –∆u(x,y) = 0 on Ω ⊂ R2 with Ω = [xmin,xmax]x[ymin,ymax]• Boundary condition: u(x,y) given on ∂Ω• Discretization: – ui–1,j – ui,j–1 + 4ui,j – ui,j+1 – ui+1,j = 0 for i=0…m-1, j=0…n-1• 4 Boundaries: i= –1, i=m, j= –1, j=m• New ordering: (i,j)i=0..m-1, j=0..n-1 I= 0..mn-1• Matrix equation: Au = b, A=sparse matrix, b=based on u on ∂Ω, u=solution on Ω–∂Ω• Example with n=m=4, with solution & boundary u(x,y) := x• Linear Equation Solver (SLES) with PETSc

– MatSetValues(A, 1,&I, 1,&J, &v, INSERT_VALUES);– VecSetValues(b,1,&I,&v,INSERT_VALUES);– SLESSetOperators(sles,A,A,DIFFERENT_NONZERO_PATTERN);– SLESSolve(sles,b,x,&its); x is the solution vector, ordered with I= 0..mn-1– printing x in ordering (i,j)i=0..m-1, j=0..n-1 (transposed)

• mpirun –np 3 ./heat_petsc –ksp_type cg –m 4 –n 4 –ksp_monitor–sles_view –view_sol –log_summary –options_table –options_left

• Solved !!!Solution Grid (without boundary conditions):

0.199977 0.199891 0.199995 0.199985 0.400010 0.400072 0.400007 0.399864 0.600126 0.600078 0.599905 0.599989 0.800019 0.799936 0.799993 0.800014

Norm of error 0.000269002 iterations 7x

y


Practical

• Test the heat_petsc example:– cd ~/PETSC/#nr– mpirun –np 4 ./heat_petsc– mpirun –np 4 ./heat_petsc –wrong_option –options_left– mpirun –np 4 ./heat_petsc –ksp_monitor –view_mat_x –draw_pause 3 –op...– mpirun –np 4 ./heat_petsc –sles_view –view_sol –view_sol_x –draw_pause 3

• Which is default KSP? / Compare the execution time:– mpirun –np 4 ./heat_petsc –m 300 –n 300 –log_summary –options_left– mpirun –np 4 ./heat_petsc –m 300 –n 300 –ksp_type cg –log_summary –op...– mpirun –np 4 ./heat_petsc –m 300 –n 300 –ksp_type cr –log_summary ...– mpirun –np 4 ./heat_petsc –m 300 –n 300 –ksp_type bcgs –log_summary ...

• Calculate Speedup of CG:– mpirun –np 1 ./heat_petsc –m 300 –n 300 –ksp_type cg –log_summary ...– mpirun –np 16 ./heat_petsc –m 300 –n 300 –ksp_type cg –log_summary ...

• If you want to compile:– cp ../source/heat_petsc.c ../source/Makefile ./– setenv PETSC_DIR ... or export PETSC_DIR=...– setenv PETSC_ARCH ... or export PETSC_ARCH=...– make BOPT=O heat_petsc



Results for heat_petsc (300x300) – time

0

20

40

60

80

100

120timesec.

cg cr bcgs cgs tfqmr bicg gmres

1416

for n = 1 PC is ilufor n > 1 PC is bjacobi (sub=ilu)

3,3 3,3 3,3 3,1 3,52,9

Speedup Tn=4 / Tn=16

measured on Cray T3E-900


Results for heat_petsc (300x300) – iterationsfor n = 1 PC is ilufor n > 1 PC is bjacobi (sub=ilu)

0

100

200

300

400

500

600

700

800

900iter.

cg cr bcgs cgs tfqmr bicg gmres

1416

Laplace-Example with MPI and PETSc · 42. — Laplace-Example with MPI and PETSc — 42. 42-6 Laplace-Example with MPI & PETSc Slide 11 Höchstleistungsrechenzentrum Stuttgart Rolf

Documents