Preconditioned Iterative Preconditioned Iterative Linear Solvers for Linear Solvers for Unstructured Grids on the Unstructured Grids on the Earth Simulator Earth Simulator Kengo Nakajima Kengo Nakajima [email protected][email protected]Supercomputing Division, Information Technology Center, Supercomputing Division, Information Technology Center, The University of Tokyo The University of Tokyo Japan Agency for Marine-Earth Science and Technology Japan Agency for Marine-Earth Science and Technology April 22, 2008 April 22, 2008
115
Embed
Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator
Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator. Kengo Nakajima [email protected] Supercomputing Division, Information Technology Center, The University of Tokyo Japan Agency for Marine-Earth Science and Technology. April 22, 2008. Overview. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Preconditioned Iterative Linear Preconditioned Iterative Linear Solvers for Unstructured Grids Solvers for Unstructured Grids
Supercomputing Division, Information Technology Center, Supercomputing Division, Information Technology Center, The University of TokyoThe University of Tokyo
Japan Agency for Marine-Earth Science and TechnologyJapan Agency for Marine-Earth Science and Technology
• Optimization Strategy on the Earth Simulator– BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications– Matrix Assembling
• Summary & Future Works
08-APR22 14
• Parallel FEM platform for solid earth simulation.– parallel I/O, parallel linear solvers, parallel visualization– solid earth: earthquake, plate deformation, mantle/core convection, etc.
• Part of national project by STA/MEXT for large-scale earth science simulations using the Earth Simulator.• Strong collaborations between HPC and natural science (solid earth) communities.
Magnetic Field of the Earth : MHD codeMagnetic Field of the Earth : MHD codeComplicated Plate Model around Japan IslandsComplicated Plate Model around Japan Islands
Simulation of Earthquake Generation CycleSimulation of Earthquake Generation Cyclein Southwestern Japanin Southwestern Japan
TSUNAMI !!TSUNAMI !!
Transportation by Groundwater Flow Transportation by Groundwater Flow through Heterogeneous Porous Mediathrough Heterogeneous Porous Media
h=5.00
h=1.25
T=100 T=200 T=300 T=400 T=500
1708-APR22
Results by GeoFEM
08-APR22 18
Features of FEM applications (1/2)
• Local “element-by-element” operations– sparse coefficient matrices
do i= 1, N jS= index(i-1)+1 jE= index(i) do j= jS, jE in = item(j) Y(i)= Y(i) + AMAT(j)*X(in) enddoenddo
08-APR22 19
• In parallel computation …– comm. with ONLY neighbors (except “dot products” etc.)– amount of messages are relatively small because only values on domain-boundary are
exchanged. – communication (MPI) latency is critical
• Optimization Strategy on the Earth Simulator– BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications– Matrix Assembling
• Summary & Future Works
08-APR22 25
• Direct Methods– Gaussian Elimination/LU Factorization.
• compute A-1 directly.
– Robust for wide range of applications.– More expensive than iterative methods (memory, CPU)– Not suitable for parallel and vector computation due to its global operations.
• Iterative Methods– CG, GMRES, BiCGSTAB– Less expensive than direct methods, especially in memory.– Suitable for parallel and vector computing.– Convergence strongly depends on problems, boundary conditions (condition number etc.)
• Preconditioning is required
Direct/Iterative Methods for Linear Equations
08-APR22 26
• Convergence rate of iterative solvers strongly depends on the spectral properties (eigenvalue distribution) of the coefficient matrix A.
• A preconditioner M transforms the linear system into one with more favorable spectral properties – In "ill-conditioned" problems, "condition number" (ratio of max/min eigenvalue if A is symmetric) is large.– M transforms original equation Ax=b into A'x=b' where A'=M-1A, b'=M-1b
• ILU (Incomplete LU Factorization) or IC (Incomplete Cholesky Factorization) are well-known preconditioners.
Preconditioning for Iterative Methods
08-APR22 27
• Iterative method is the ONLY choice for large-scale parallel computing.
• Preconditioning is important– general methods, such as ILU(0)/IC(0), cover wide range of applications.– problem specific methods.
• Hybrid Parallel Programming Model– OpenMP+MPI– Re-Ordering for Vector/Parallel Performance– Comparison with Flat MPI
08-APR22 30
Flat MPI vs. Hybrid
PE
PE
PE
PE
Memory
PE
PE
PE
PE
Memory
Hybrid: Hierarchal Structure
PE
PE
PE
PE
Memory
PE
PE
PE
PE
Memory
Flat-MPI: Each PE -> Independent
08-APR22 31
Local Data StructureNode-based Partitioning
internal nodes - elements - external nodes
1 2 3 4 5
21 22 23 24 25
1617 18 19
20
1112 13 14
15
67 8 9
10
1 2 3
4 5
6 7
8 9 11
10
14 13
15
12
PE#0
7 8 9 10
4 5 6 12
3111
2
PE#1
7 1 2 3
10 9 11 12
568
4
PE#2
34
8
69
10 12
1 2
5
11
7
PE#3
1 2 3 4 5
21 22 23 24 25
1617 18 19
20
1112 13 14
15
67 8 9
10
PE#0PE#1
PE#2PE#3
1 2 3 4 5
21 22 23 24 25
1617 18 19
20
1112 13 14
15
67 8 9
10
08-APR22 32
1 SMP node => 1 domain for Hybrid Programming Model
MPI communication among domains
Node-0 Node-1
Node-2 Node-3
PEPEPEPEPEPEPEPE
MemoryNode-0 Node-1
Node-2 Node-3PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
Memory
Node-0 Node-1
Node-2 Node-3
PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
MemoryNode-0 Node-1
Node-2 Node-3PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
Memory
08-APR22 33
Basic Strategy for Parallel Programming on the Earth Simulator• Hypothesis
– Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors.
08-APR22 34
ILU(0)/IC(0) Factorization
do i= 2, n do k= 1, i-1 if ((i,k)∈ NonZero(A)) thenif ((i,k)∈ NonZero(A)) then aaik ik := a:= aikik/a/akkkk
endifendif do j= k+1, n if ((i,j)∈ NonZero(A)) thenif ((i,j)∈ NonZero(A)) then aaij ij := a:= aijij - a - aikik*a*akjkj
endifendif enddo enddo enddo
08-APR22 35
ILU/IC Preconditioning
M = (L+D)D-1(D+U)
Mz= r
D-1(D+U)= z1
Forward Substitution
(L+D)z1= r : z1= D-1(r-Lz1)
Backward Substitution
(I+ D-1 U)zNEW = z1
z= z - D-1Uz
need to solve this equation
08-APR22 36
ILU/IC Preconditioning
do i= 1, N WVAL= R(i) do j= 1, INL(i) WVAL= WVAL - AL(i,j) * Z(IAL(i,j)) enddo Z(i)= WVAL / D(i) enddo
do i= N, 1, -1 SW = 0.0d0 do j= 1, INU(i) SW= SW + AU(i,j) * Z(IAU(i,j)) enddo Z(i)= Z(i) – SW / D(i) enddo
Forward SubstitutionForward Substitution (L+D)z= r : z= D(L+D)z= r : z= D-1-1(r-Lz)(r-Lz)
Backward SubstitutionBackward Substitution (I+ D(I+ D-1-1 U)z U)znewnew= z= zoldold : z= z - D : z= z - D-1-1UzUz
Dependency…You need the most recent value of “z”of connected nodes.Vectorization/parallelization is difficult.
M =(L+D)DM =(L+D)D-1-1(D+U)(D+U)L,D,U: AL,D,U: A
08-APR22 37
Basic Strategy for Parallel Programming on the Earth Simulator
• Hypothesis– Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector pr
ocessors.• Re-Ordering for Highly Parallel/Vector Performance
– Local operation and no global dependency– Continuous memory access– Sufficiently long loops for vectorization
08-APR22 38
ILU/IC Preconditioning
do i= 1, N WVAL= R(i) do j= 1, INL(i) WVAL= WVAL - AL(i,j) * Z(IAL(i,j)) enddo Z(i)= WVAL / D(i) enddo
do i= N, 1, -1 SW = 0.0d0 do j= 1, INU(i) SW= SW + AU(i,j) * Z(IAU(i,j)) enddo Z(i)= Z(i) – SW / D(i) enddo
Forward SubstitutionForward Substitution (L+D)z= r : z= D(L+D)z= r : z= D-1-1(r-Lz)(r-Lz)
Backward SubstitutionBackward Substitution (I+ D(I+ D-1-1 U)z U)znewnew= z= zoldold : z= z - D : z= z - D-1-1UzUz
Dependency…You need the most recent value of “z”of connected nodes.Vectorization/parallelization is difficult.
Reordering:Directly connectednodes do not appearin RHS.
M =(L+D)DM =(L+D)D-1-1(D+U)(D+U)L,D,U: AL,D,U: A
08-APR22 39
Basic Strategy for Parallel Programming on the Earth Simulator
• Hypothesis– Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors.
• Re-Ordering for Highly Parallel/Vector Performance– Local operation and no global dependency– Continuous memory access– Sufficiently long loops for vectorization
• 3-Way Parallelism for Hybrid Parallel Programming– Inter Node : MPI– Intra Node : OpenMP– Individual PE : Vectorization
08-APR22 40
SMP ParallelSMP Parallel
VectorVector
Re-Ordering Technique for Vector/Parallel Architectures
Cyclic DJDS(RCM+CMC) Re-Ordering(Doi, Washio, Osoda and Maruyama (NEC))
These processes can be substituted by traditional multi-coloring (MC).
08-APR22 41
Reordering = Coloring• COLOR: Unit of independent sets.• Elements grouped in the same “color” are independent from
each other, thus parallel/vector operation is possible.• Many colors provide faster convergence, but shorter vector
length.: Trade-Off !!
Red-Black (2 colors) 4 colors RCM (Reverse CM)
08-APR22 42
1D-Storage (CRS)memory saved, short vector length
2D-Storagelong vector length, many ZERO’s
Large Scale Sparse Matrix Storage for Unstructured Grids
08-APR22 43
Re-Ordering in each Coloraccording to Non-Zero Off-Diag. Component #
Elements on the same color are independent, thereforeElements on the same color are independent, thereforeintra-hyperplane re-ordering does not affect results intra-hyperplane re-ordering does not affect results DJDS : Descending-Order Jagged Diagonal StorageDJDS : Descending-Order Jagged Diagonal Storage
08-APR22 44
Cyclic DJDS (MC/CM-RCM) Cyclic Re-Ordering for SMP units
• Hybrid Parallel Programming Model on SMP Cluster Architecture with Vector Processors for Unstructured Grids
• Nice parallel performance for both inter/intra SMP node on ES, 3.8TFLOPS for 2.2G DOF on 176 nodes (33.8%) in 3D linear-elastic problem using BIC(0)-CG method.– N.Kushida (student of Prof.Okuda) attained >10 TFLOPS
using 512 nodes for >3G DOF problem.
• Re-Ordering is really required
08-APR22 78
Summary (cont.)
• Hybrid vs. Flat MPI – Flat-MPI is better for small number of SMP nodes.
– Hybrid is better for large number of SMP nodes: Especially when problem size is rather small.
– Flat MPI: Communication, Hybrid: Memory
– depends on application, problem size etc.
– Hybrid is much more sensitive to color numbers than flat MPI due to synchronization overhead of OpenMP.
• In Mat-Vec. operations, difference is not so significant.
• Optimization Strategy on the Earth Simulator– BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications– Matrix Assembling
• Summary & Future Works
08-APR22 80
“CUBE” Benchmark• 3D linear elastic applications on cubes for a wide range of problem si
ze.• Hardware
– Single CPU– Earth Simulator– AMD Opteron (1.8GHz)
x
y
z
Uz=0 @ z=Zmin
Ux=0 @ x=Xmin
Uy=0 @ y=Ymin
Uniform Distributed Force in z-dirrection @ z=Zmin
Ny-1
Nx-1
Nz-1
x
y
z
Uz=0 @ z=Zmin
Ux=0 @ x=Xmin
Uy=0 @ y=Ymin
Uniform Distributed Force in z-dirrection @ z=Zmin
Ny-1
Nx-1
Nz-1
08-APR22 81
Time for 3x643=786,432 DOF
Matrix34.2(240)
21.7(3246)
Solver
DJDS originalsec.
(MFLOPS)
55.9Total
28.6(291)
360(171)
DCRSsec.
(MFLOPS)
389
Matrix12.4(663)
271(260)
Solver
283Total
10.2(818)
225(275)
235
ES8.0 GFLOPS
Opteron1.8GHz
3.6 GFLOPS
DJDSDJDSDCRSDCRS
08-APR22 82
Matrix+Solver
0
10
20
30
40
50
60
41472 98304 192000 375000 786432
DOF
se
c.
0
100
200
300
41472 98304 192000 375000 786432
DOF
se
c.
DJDS on ESoriginal
DJDS on Opteronoriginal
MatrixSolver
08-APR22 83
Computation Time vs. Problem Size
0
10
20
30
40
41472 98304 192000 375000 786432
DOF
se
c.
0
100
200
300
41472 98304 192000 375000 786432
DOF
se
c.
0
100
200
300
41472 98304 192000 375000 786432
DOF
se
c.
ES (DJDS original)
Opteron (DJDS original)
Opteron (DCRS)
Matrix Solver
Total
08-APR22 84
Matrix assembling/formation part is rather expensive
• This part should be also optimized for vector processors.• For example, in nonlinear simulations such as elasto-plastic solid simulations,
or fully coupled Navier-Stokes flow simulations, matrices must be updated for every nonlinear iterations.
• This part strongly depends on applications/physics, therefore it’s very difficult to develop general libraries, such as those of iterative linear solvers.
– also includes complicated processes which are difficult to be “vectorized”
08-APR22 85
Typical Procedure for Calculating Coefficient Matrix in FEM
• Apply Galerkin’s method on each element.• Integration over each element, and get element-matrix.• Element matrices are accumulated to each node, and glo
bal matrices are obtained => Global linear equations
• Matrix assembling/formation is embarrassingly parallel procedure due to its element-by-element feature
08-APR22 86
Element-by-Element Operations• Integration over each element => element-matrix• Element matrices are accumulated to each node
=> global-matrix• Linear equations for each node
19
13
7
1
20
14
8
2
21
15
9
3
22
16
10
4
23
17
11
5
24
18
12
6
Elements
08-APR22 87
Element-by-Element Operations• Integration over each element => element-matrix• Element matrices are accumulated to each node
=> global-matrix• Linear equations for each node
Nodes19
13
7
1
20
14
8
2
21
15
9
3
22
16
10
4
23
17
11
5
24
18
12
6
29 30 31 32 33 34 35
22
15
8
1
23
16
9
2
24
17
10
3
25
18
11
4
26
19
12
5
27
20
13
6
28
21
14
7
08-APR22 88
Element-by-Element Operations• Integration over each element => element-matrix• Element matrices are accumulated to each node
=> global-matrix• Linear equations for each node
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
X X
08-APR22 89
Element-by-Element Operations• Integration over each element => element-matrix• Element matrices are accumulated to each node
=> global-matrix• Linear equations for each node
44434241
34333231
24232221
14131211
EEEE
EEEE
EEEE
EEEE
08-APR22 90
Element-by-Element Operations• Integration over each element => element-matrix• Element matrices are accumulated to each node
=> global-matrix• Linear equations for each node
29
22
15
8
1
30
23
16
9
2
31
24
17
10
3
32
25
18
11
4
33
26
19
12
5
34
27
20
13
6
35
28
21
14
7
08-APR22 91
Element-by-Element Operations• Integration over each element => element-matrix• Element matrices are accumulated to each node
=> global-matrix• Linear equations for each node
35
34
33
3
2
1
35
34
33
3
2
1
35,3534,35
35,3434,34
33,33
3,3
2,21,2
2,11,1
.........
f
f
f
f
f
f
u
u
u
u
u
u
aa
aa
a
a
aa
aa
08-APR22 92
Element-by-Element Operations
13
7
14
8
22
15
8
23
16
9
24
17
10
• If you calculate a23,16 and a16,23, you have to consider
contribution by both of 13th and 14th elements.
35
34
33
3
2
1
35
34
33
3
2
1
35,3534,35
35,3434,34
33,33
3,3
2,21,2
2,11,1
.........
f
f
f
f
f
f
u
u
u
u
u
u
aa
aa
a
a
aa
aa
08-APR22 93
Current Approach
13
7
14
8
22
15
8
23
16
9
24
17
10
do icel= 1, ICELTOT do ie= 1, 4 do je= 1, 4 - assemble element-matrix - accumulated element-matrix to global-matrix enddo enddo enddo
08-APR22 94
Current Approach
13
7
14
8
22
15
8
23
16
9
24
17
10
do icel= 1, ICELTOT do ie= 1, 4 do je= 1, 4 - assemble element-matrix - accumulat element-matrix to global-matrix enddo enddo enddo
Local Node ID
1 2
4 3
Local Node IDfor each bi-linear 4-nodeelement
08-APR22 95
Current Approach
13
7
14
8
22
15
8
23
16
9
24
17
10
• Nice for cache reuse because of localized operations• Not suitable for vector processors
– a16,23 and a23,16 might not be calculated properly.
– Short innermost loops– There are many “if-then-else” s
do icel= 1, ICELTOT do ie= 1, 4 do je= 1, 4 - assemble element-matrix - accumulat element-matrix to global-matrix enddo enddo enddo
Local Node ID
Matrix34.2(240)
21.7(3246)
Solver
DJDS originalsec.
(MFLOPS)
55.9Total
28.6(291)
360(171)
DCRSsec.
(MFLOPS)
389
Matrix12.4(663)
271(260)
Solver
283Total
10.2(818)
225(275)
235
ES8.0 GFLOPS
Opteron1.8GHz
3.6 GFLOPS
Matrix34.2(240)
21.7(3246)
Solver
DJDS originalsec.
(MFLOPS)
55.9Total
28.6(291)
360(171)
DCRSsec.
(MFLOPS)
389
Matrix12.4(663)
271(260)
Solver
283Total
10.2(818)
225(275)
235
ES8.0 GFLOPS
Opteron1.8GHz
3.6 GFLOPS
08-APR22 96
Inside the loop: integration at Gaussian quadrature points
do jpn= 1, 2do ipn= 1, 2 coef= dabs(DETJ(ipn,jpn))*WEI(ipn)*WEI(jpn)
• a16,23 and a23,16 might not be calculated properly. – coloring the elements: elements which do not share any nodes are
in same color.
13
7
14
8
22
15
8
23
16
9
24
17
10
do icel= 1, ICELTOT do ie= 1, 4 do je= 1, 4 - assemble element-matrix - accumulat element-matrix to global-matrix enddo enddo enddo
08-APR22 98
Coloring of Elements
29
22
15
8
1
30
23
16
9
2
31
24
17
10
3
32
25
18
11
4
33
26
19
12
5
34
27
20
13
6
35
28
21
14
7
08-APR22 99
Coloring of Elements
29
1
30
16
2
31
3
32
25
18
11
4
33
26
19
12
5
34
27
20
13
6
35
28
21
14
7
22
15
8
23
9
24
17
10
Elements sharing the 16th node are assigned to different colors
08-APR22 100
Remedy
• a16,23 and a23,16 might not be calculated properly. – coloring the elements: elements which do not share any nodes are in same color.
• Short innermost loops– loop exchange
13
7
14
8
22
15
8
23
16
9
24
17
10
do icel= 1, ICELTOT do ie= 1, 4 do je= 1, 4 - assemble element-matrix - accumulat element-matrix to global-matrix enddo enddo enddo
08-APR22 101
Remedy
• a16,23 and a23,16 might not be calculated properly. – coloring the elements: elements which do not share any nodes are in same color.
• Short innermost loops– loop exchange
• There are many “if-then-else” s– define ELEMENT-to-MATRIX array
13
7
14
8
22
15
8
23
16
9
24
17
10
do icel= 1, ICELTOT do ie= 1, 4 do je= 1, 4 - assemble element-matrix - accumulat element-matrix to global-matrix enddo enddo enddo
08-APR22 102
Define ELEMENT-to-MATRIX array
13
7
14
8
22
15
8
23
16
9
24
17
10
13
22
15
23
16
14
23
16
24
17
ELEMmat(icel, ie, je)
① ②
④ ③
① ②
④ ③
Element ID
Local Node ID
Local Node ID
if kkU= index_U(16-1+k) and item_U(kkU)= 23 then
ELEMmat(13,2,3)= +kkU ELEMmat(14,1,4)= +kkUendif
if kkL= index_L(23-1+k) and item_L(kkL)= 16 then
ELEMmat(13,3,2)= -kkL ELEMmat(14,4,1)= -kkLendif
08-APR22 103
Define ELEMENT-to-MATRIX array
13
7
14
8
22
15
8
23
16
9
24
17
10
if kkU= index_U(16-1+k) and item_U(kkU)= 23 then
ELEMmat(13,2,3)= +kkU ELEMmat(14,1,4)= +kkUendif
13
22
15
23
16
14
23
16
24
17
if kkL= index_L(23-1+k) and item_L(kkL)= 16 then
ELEMmat(13,3,2)= -kkL ELEMmat(14,4,1)= -kkLendif
① ②
④ ③
① ②
④ ③
Local Node ID
“ELEMmat” specifies relationship between node pairs of each node of each element and address of global coefficient matrix.
08-APR22 104
Optimized Proceduredo icol= 1, NCOLOR_E_tot do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - define “ELEMmat” array enddo enddo enddoenddo
do icol= 1, NCOLOR_E_tot do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - assemble element-matrix enddo enddo enddo
do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - accumulate element-matrix to global-matrix enddo enddo enddoenddo
Extra Storage for• ELEMmat array• element-matrix components
for elements in each color• < 10% increase
Extra Computation for• ELEMmat
08-APR22 105
Optimized Proceduredo icol= 1, NCOLOR_E_tot do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - define “ELEMmat” array enddo enddo enddoenddo
do icol= 1, NCOLOR_E_tot do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - assemble element-matrix enddo enddo enddo
do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - accumulate element-matrix to global-matrix enddo enddo enddoenddo
PART I“Integer” operations for “ELEMmat”In nonlinear cases, this part should be done just once (before initial iteration), as long as mesh connectivity does not change.
08-APR22 106
Optimized Proceduredo icol= 1, NCOLOR_E_tot do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - define “ELEMmat” array enddo enddo enddoenddo
do icol= 1, NCOLOR_E_tot do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - assemble element-matrix enddo enddo enddo
do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - accumulate element-matrix to global-matrix enddo enddo enddoenddo
PART I“Integer” operations for “ELEMmat”In nonlinear cases, this part should be done just once (before initial iteration), as long as mesh connectivity does not change.
PART II“Floating” operations for matrix assembling/accumulation.
In nonlinear cases, this part is repeated for every nonlinear iteration.
08-APR22 107
Time for 3x643=786,432 DOF
Matrix34.2(240)
21.7(3246)
Solver
DJDS originalsec.
(MFLOPS)
55.9Total
12.5(643)
21.7(3246)
DJDS improvedsec.
(MFLOPS)
34.2
28.6(291)
360(171)
DCRSsec.
(MFLOPS)
389
Matrix12.4(663)
271(260)
Solver
283Total
21.2(381)
271(260)
292
10.2(818)
225(275)
235
ES8.0 GFLOPS
Opteron1.8GHz
3.6 GFLOPS
08-APR22 108
Time for 3x643=786,432 DOF
Matrix34.2(240)
21.7(3246)
Solver
DJDS originalsec.
(MFLOPS)
55.9Total
12.5(643)
21.7(3246)
DJDS improvedsec.
(MFLOPS)
34.2
28.6(291)
360(171)
DCRSsec.
(MFLOPS)
389
Matrix12.4(663)
271(260)
Solver
283Total
21.2(381)
271(260)
292
10.2(818)
225(275)
235
ES8.0 GFLOPS
Opteron1.8GHz
3.6 GFLOPS
Slower than originalbecause of long innermost loops (data locality has been lost)
08-APR22 109
Matrix+Solver
0
10
20
30
40
50
60
41472 98304 192000 375000 786432
DOF
se
c.
0
100
200
300
41472 98304 192000 375000 786432
DOF
se
c.
0
100
200
300
41472 98304 192000 375000 786432
DOF
se
c.
0
10
20
30
40
50
60
41472 98304 192000 375000 786432
DOF
se
c.
DJDS on ESoriginal
DJDS on Opteronoriginal
MatrixSolver
DJDS on ESimproved
DJDS on Opteronimproved
08-APR22 110
Computation Time vs. Problem Size
0
100
200
300
41472 98304 192000 375000 786432
DOF
se
c.
0
10
20
30
40
41472 98304 192000 375000 786432
DOF
se
c.
0
100
200
300
41472 98304 192000 375000 786432
DOF
se
c.
ES (DJDS improved)
Opteron (DJDS improved)
Matrix Solver
Total
Opteron (DCRS)
08-APR22 111
“Matrix” computation time for improved version of DJDS
0
5
10
15
20
25
41472 98304 192000 375000 786432
DOF
se
c.
0
5
10
15
20
25
41472 98304 192000 375000 786432
DOF
se
c.
IntegerFloating
ES Opteron
08-APR22 112
Optimization of “Matrix” assembling/formation on ES
• DJDS has been much improved compared to the original one, but it’s still slower than DCRS version on Opteron.
• “Integer” operation part is slower.• But, “floating” operation is much faster than Opteron.
• In nonlinear simulations, “integer” operation is executed only once (just before initial iteration), therefore, ES outperforms Opteron if the number of nonlinear iterations is more than 2.
08-APR22 113
Suppose “virtual” mode where …
• On scalar processor– “Integer” operation part
• On vector processor– “floating” operation part– linear solvers
• Scalar performance of ES (500MHz) is smaller than that of Pentium III
08-APR22 114
Time for 3x643=786,432 DOF
Matrix1.88
(4431)
21.7(3246)
Solver
DJDS virtualsec.
(MFLOPS)
23.6Total
12.5(643)
21.7(3246)
DJDS improvedsec.
(MFLOPS)
34.2
28.6(291)
360(171)
DCRSsec.
(MFLOPS)
389
Matrix
Solver
Total
21.2(381)
271(260)
292
10.2(818)
225(275)
235
ES8.0 GFLOPS
Opteron1.8GHz
3.6 GFLOPS
08-APR22 115
Summary: Vectorization of FEM appl.• NOT so easy• FEM’s good features of local operations are not necessarily suitable for vector processors.
– Preconditioned iterative solvers can be vectorized rather easier, because their target is “global” matrix.
• Sometimes, major revision of original codes are required– Usually, more memory, more lines, additional operations …
• Performance of optimized codes for vector processor is not necessarily good on scalar processors (e.g. matrix assembling of FEM)