Recent advances on the solution phase of direct solvers with multiple sparse right-hand sides P. Amestoy 1 , J.-Y. L’Excellent 2 , G. Moreau 2 1. Universit´ e de Toulouse INPT and IRIT , 2. Universit´ e de Lyon, Inria and LIP-ENS Lyon , [email protected]Mumps User’s Days, Montbonnot, June 1-2 2017
46
Embed
Recent advances on the solution phase of direct solvers with …mumps.enseeiht.fr/doc/ud_2017/Moreau_Talk.pdf · 2017. 9. 30. · Recent advances on the solution phase of direct solvers
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Recent advances on the solution phaseof direct solvers with multiple sparseright-hand sides
P. Amestoy1, J.-Y. L’Excellent2, G. Moreau2
1. Universite de Toulouse INPT and IRIT , 2. Universite de Lyon, Inria and LIP-ENS Lyon ,[email protected]
Linear systems of equations :Ax = b, A is sparseSolve phase (Ly = b,Ux = y) maybe critical.
01234D
epth
(km
)
0Dip (km)
5
10
15
20
Cross (
km)
5 10 15 20
3000 4000 5000 6000m/s
Application coming from Helmholtz or Maxwell equations:
name n (million) nrhs nnz/nrhs Tfacto Tsolve
sei70m 2.9 2302 587 1258 1267
sei50m 7.1 2302 486 6289 2985
E1 0.33 8000 9.8 55.2 291
E3 2.8 8000 7.5 1951 5610
Table: Characteristics of matrices and right-hand-sides.
2 / 19
Introduction
Objectives:
• focus on the forward solution phase Ly = b;
• exploit sparsity of right-hand-sides;
• limit the number of operations (∆);
3 / 19
Overview
Exploitation of sparse right-hand-sidesContext of studyTree pruningExploitation of subintervals of columns at each node
Minimizing the number of operationsPermutation of columnsAdapted blocking technique
Conclusion
4 / 19
Context: analysis phase
Ordering: reorder variables of the matrix A to reduce fill-in and buildelimination tree:
• Nested Dissection ⇒ build tree of separators.
1
4
2
5
10
11
13
143
6
12
15
7
9
8
16
18
17
19
20
21
2223
24
25
26
27
u15
u14
u13u10
u7
u6u3
u7
•••••••••••••••••••••••••••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
• •
•
•
•
•
•••
••
•••• • • •
◦
◦ ◦
◦
◦ ◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦
◦ ◦
◦
◦ ◦◦
◦
◦ ◦◦
◦ ◦
◦ ◦◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦ ◦
◦
◦
◦ ◦◦ ◦◦ ◦◦ ◦ ◦◦ ◦◦ ◦◦
◦ ◦ ◦
◦◦
◦◦◦
◦◦
◦
◦◦
◦◦
◦
◦
◦◦
◦
◦◦◦◦◦ ◦
3D physical domain (cube)
separator tree L factor
5 / 19
Context: analysis phase
Ordering: reorder variables of the matrix A to reduce fill-in and buildelimination tree:
• Nested Dissection ⇒ build tree of separators.
1
4
2
5
10
11
13
143
6
12
15
7
9
8
16
18
17
19
20
21
2223
24
25
26
27
u15
u14
u13u10
u7
u6u3
u7
•••••••••••••••••••••••••••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
• •
•
•
•
•
•••
••
•••• • • •
◦
◦ ◦
◦
◦ ◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦
◦ ◦
◦
◦ ◦◦
◦
◦ ◦◦
◦ ◦
◦ ◦◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦ ◦
◦
◦
◦ ◦◦ ◦◦ ◦◦ ◦ ◦◦ ◦◦ ◦◦
◦ ◦ ◦
◦◦
◦◦◦
◦◦
◦
◦◦
◦◦
◦
◦
◦◦
◦
◦◦◦◦◦ ◦
3D physical domain (cube) separator tree
L factor
5 / 19
Context: analysis phase
Ordering: reorder variables of the matrix A to reduce fill-in and buildelimination tree:
• Nested Dissection ⇒ build tree of separators.
1
4
2
5
10
11
13
143
6
12
15
7
9
8
16
18
17
19
20
21
2223
24
25
26
27
u15
u14
u13u10
u7
u6u3
u7
•••••••••••••••••••••••••••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
• •
•
•
•
•
•••
••
•••• • • •
◦
◦ ◦
◦
◦ ◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦
◦ ◦
◦
◦ ◦◦
◦
◦ ◦◦
◦ ◦
◦ ◦◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦ ◦
◦
◦
◦ ◦◦ ◦◦ ◦◦ ◦ ◦◦ ◦◦ ◦◦
◦ ◦ ◦
◦◦
◦◦◦
◦◦
◦
◦◦
◦◦
◦
◦
◦◦
◦
◦◦◦◦◦ ◦
3D physical domain (cube) separator tree
L factor
5 / 19
Context: analysis phase
Ordering: reorder variables of the matrix A to reduce fill-in and buildelimination tree:
• Nested Dissection ⇒ build tree of separators.
1
4
2
5
10
11
13
143
6
12
15
7
9
8
16
18
17
19
20
21
2223
24
25
26
27
u15
u14
u13u10
u7
u6u3
u7
•••••••••••••••••••••••••••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
• •
•
•
•
•
•••
••
•••• • • •
◦
◦ ◦
◦
◦ ◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦
◦ ◦
◦
◦ ◦◦
◦
◦ ◦◦
◦ ◦
◦ ◦◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦ ◦
◦
◦
◦ ◦◦ ◦◦ ◦◦ ◦ ◦◦ ◦◦ ◦◦
◦ ◦ ◦
◦◦
◦◦◦
◦◦
◦
◦◦
◦◦
◦
◦
◦◦
◦
◦◦◦◦◦ ◦
3D physical domain (cube) separator tree
L factor
5 / 19
Context: analysis phase
Ordering: reorder variables of the matrix A to reduce fill-in and buildelimination tree:
• Nested Dissection ⇒ build tree of separators.
1
4
2
5
10
11
13
143
6
12
15
7
9
8
16
18
17
19
20
21
2223
24
25
26
27
u15
u14
u13u10
u7
u6u3
u7
•••••••••••••••••••••••••••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
• •
•
•
•
•
•••
••
•••• • • •
◦
◦ ◦
◦
◦ ◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦
◦ ◦
◦
◦ ◦◦
◦
◦ ◦◦
◦ ◦
◦ ◦◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦ ◦
◦
◦
◦ ◦◦ ◦◦ ◦◦ ◦ ◦◦ ◦◦ ◦◦
◦ ◦ ◦
◦◦
◦◦◦
◦◦
◦
◦◦
◦◦
◦
◦
◦◦
◦
◦◦◦◦◦ ◦
3D physical domain (cube) separator tree L factor
5 / 19
Context: analysis phase
Ordering: reorder variables of the matrix A to reduce fill-in and buildelimination tree:
• Nested Dissection ⇒ build tree of separators.
1
4
2
5
10
11
13
143
6
12
15
7
9
8
16
18
17
19
20
21
2223
24
25
26
27
u15
u14
u13u10
u7
u6u3
u7
•••••••••••••••••••••••••••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
• •
•
•
•
•
•••
••
•••• • • •
◦
◦ ◦
◦
◦ ◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦
◦ ◦
◦
◦ ◦◦
◦
◦ ◦◦
◦ ◦
◦ ◦◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦ ◦
◦
◦
◦ ◦◦ ◦◦ ◦◦ ◦ ◦◦ ◦◦ ◦◦
◦ ◦ ◦
◦◦
◦◦◦
◦◦
◦
◦◦
◦◦
◦
◦
◦◦
◦
◦◦◦◦◦ ◦
3D physical domain (cube) separator tree L factor
5 / 19
Forward solution process
u15
u14u7
0
u15
u7
u14
Block operations:
• y1 ← L−111 b1
• b2 ← b2 − L21y1
#flops for node u is given by:
Fu = 2 ∗ (#entries in L11 + L21)
Total #flops: ∆ =∑u∈TFu
6 / 19
Forward solution process
u15
u14u7
0
u15
u7
u14
Block operations:
• y1 ← L−111 b1
• b2 ← b2 − L21y1
#flops for node u is given by:
Fu = 2 ∗ (#entries in L11 + L21)
Total #flops: ∆ =∑u∈TFu
6 / 19
Forward solution process
u15
u14u7
0
u15
u7
u14
Block operations:
• y1 ← L−111 b1
• b2 ← b2 − L21y1
#flops for node u is given by:
Fu = 2 ∗ (#entries in L11 + L21)
Total #flops: ∆ =∑u∈TFu
6 / 19
Forward solution process
u15
u14u7
0
u15
u7
u14
Block operations:
• y1 ← L−111 b1
• b2 ← b2 − L21y1
#flops for node u is given by:
Fu = 2 ∗ (#entries in L11 + L21)
Total #flops: ∆ =∑u∈TFu
6 / 19
Forward solution process
u15
u14u7
0
u15
u7
u14
Block operations:
• y1 ← L−111 b1
• b2 ← b2 − L21y1
#flops for node u is given by:
Fu = 2 ∗ (#entries in L11 + L21)
Total #flops: ∆ =∑u∈TFu
6 / 19
Forward solution process
u15
u14u7
0
u15
u7
u14
Block operations:
• y1 ← L−111 b1
• b2 ← b2 − L21y1
#flops for node u is given by:
Fu = 2 ∗ (#entries in L11 + L21)
Total #flops: ∆ =∑u∈TFu
6 / 19
Forward solution process
Forward solve phase processes the tree from bottom to top:
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
×
×
×ff
ff × : initial non-zeros
f : filled non-zeros
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
Computation follows paths in the tree T [Gilbert, 1994].
↪→ Tree pruning (T → Tp(b)) to reduce computation:
∆ =∑
u∈Tp(b)
Fu
7 / 19
Forward solution process
Forward solve phase processes the tree from bottom to top:
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
×
×
×ff
ff × : initial non-zeros
f : filled non-zeros
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
Computation follows paths in the tree T [Gilbert, 1994].
↪→ Tree pruning (T → Tp(b)) to reduce computation:
∆ =∑
u∈Tp(b)
Fu
7 / 19
Forward solution process
Forward solve phase processes the tree from bottom to top:
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
×
×
×ff
ff × : initial non-zeros
f : filled non-zeros
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
Computation follows paths in the tree T [Gilbert, 1994].
↪→ Tree pruning (T → Tp(b)) to reduce computation:
∆ =∑
u∈Tp(b)
Fu
7 / 19
Forward solution process
Forward solve phase processes the tree from bottom to top:
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
×
×
×ff
ff × : initial non-zeros
f : filled non-zeros
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
Computation follows paths in the tree T [Gilbert, 1994].
↪→ Tree pruning (T → Tp(b)) to reduce computation:
∆ =∑
u∈Tp(b)
Fu
7 / 19
Forward solution process
Forward solve phase processes the tree from bottom to top:
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
×
×
×ff
ff × : initial non-zeros
f : filled non-zeros
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
Computation follows paths in the tree T [Gilbert, 1994].
↪→ Tree pruning (T → Tp(b)) to reduce computation:
∆ =∑
u∈Tp(b)
Fu
7 / 19
Forward solution process
Forward solve phase processes the tree from bottom to top:
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
×
×
×ff
ff × : initial non-zeros
f : filled non-zeros
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
Computation follows paths in the tree T [Gilbert, 1994].
↪→ Tree pruning (T → Tp(b)) to reduce computation:
∆ =∑
u∈Tp(b)
Fu
7 / 19
Exposition of padded zeros
When B is a matrix with multiple columns:
• use of BLAS 3 operations for efficiency;
• Tp(B) =⋃Tp(Bi ), where Bi is column i of B;
×
f
f×
f
f×
×
ff
f
×
×ff
×f
f
f
×ff
×ff×
×
f
ff
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
nrhs
1 2 3 4 5 6
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
But still, extra computations are done ...
∆ = nrhs ×∑
u∈Tp(B)Fu
8 / 19
Exposition of padded zeros
When B is a matrix with multiple columns:
• use of BLAS 3 operations for efficiency;
• Tp(B) =⋃Tp(Bi ), where Bi is column i of B;
×
f
f×
f
f×
×
ff
f
×
×ff
×f
f
f
×ff
×ff×
×
f
ff
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
nrhs
1 2 3 4 5 6
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
But still, extra computations are done ...
∆ = nrhs ×∑
u∈Tp(B)Fu
8 / 19
Solutions
What are the possible alternatives ?
• Indirections: rebuilding data structures;
• Sequential: solution phase on each column⇒ optimal (∆ = ∆min) but not efficient;
• Regular blocking: how to build blocks ?◦ minimal access to factors (out of core) [Amestoy et al.,SISC,2012];◦ minimal number of operations (in core) [Yamazaki et al.,2013];
• Exploitation of subintervals of columns at each node [Amestoy et
al.,SISC,2015].
9 / 19
Solutions
What are the possible alternatives ?
• Indirections: rebuilding data structures;
• Sequential: solution phase on each column⇒ optimal (∆ = ∆min) but not efficient;
• Regular blocking: how to build blocks ?◦ minimal access to factors (out of core) [Amestoy et al.,SISC,2012];◦ minimal number of operations (in core) [Yamazaki et al.,2013];
• Exploitation of subintervals of columns at each node [Amestoy et
Table: Number of groups created for each strategy with a tolerance such that∆ < 1.01×∆tol .
16 / 19
Conclusion
Achievements:
• implementation of two heuristics (permutation, blocking);
• 90% decrease in flops by exploiting sparsity;
• Up to 40% decrease in time for forward solve w.r.t. INT strategyand Nested Dissection ordering (sequential).
Perspectives:
• adapt the Flat Tree algorithm to unbalanced trees;
• parallelism and sparsity aspects of Flat Tree permutation;
• extend to more general test cases.
17 / 19
Acknowledgements
• LIP laboratory for access to the machines;
• EMGS et SEISCOPE for providing test cases;
• This work was performed within the frameworks of both theMUMPS consortium and the LABEX MILYON(ANR-10-LABX-0070) of Universite de Lyon, within the program”Investissements d’Avenir” (ANR-11-IDEX-0007) operated by theFrench National Research Agency (ANR).