Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication‐ Avoiding for Pivot Vectors Takahiro Katagiri (Information Technology Center, The University of Tokyo) Jun'ichi Iwata and Kazuyuki Uchida (Department of Applied Physics School of Engineering, The University of Tokyo) Thursday, February 20, Room: Salon A, 10:35‐10:55 MS34 Auto‐tuning Technologies for Extreme‐Scale Solvers ‐ Part I of III SIAM PP14, Feb.18‐21, 2014, Marriott Portland Downtown Waterfront, Portland, OR., USA
31
Embed
Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size
Matrices Using A Communication‐Avoiding for Pivot Vectors
Takahiro Katagiri (Information Technology Center, The University of Tokyo)
Jun'ichi Iwata and Kazuyuki Uchida (Department of Applied Physics School of Engineering,
The University of Tokyo)Thursday, February 20, Room: Salon A, 10:35‐10:55 MS34 Auto‐tuning Technologies for Extreme‐Scale Solvers ‐ Part I of IIISIAM PP14, Feb.18‐21, 2014, Marriott Portland Downtown Waterfront, Portland, OR., USA
Outline• Target Application: RSDFT• Parallel Algorithm of Symmetric Eigensolver for Small Matrices
• Performance Evaluation with 76,800 cores of the Fujitsu FX10
• Conclusion
Outline• Target Application: RSDFT• Parallel Algorithm of Symmetric Eigensolver for Small Matrices
• Performance Evaluation with 76,800 cores of the Fujitsu FX10
• Conclusion
RSDFT (Real Space Density Functional Theory)RSDFT (Real Space Density Functional Theory)
)()()(
][)(21 2 rr
rrrrr jjj
XCion
Edv
Kohn-Sham equation is solved as afinite-difference equation
J.-I. Iwata et al., J. Comp. Phys. 229, 2339 (2010).
10648-atom cell of Si crystal and its electron density Volume of Si crystal vs. Total Energy
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
18 18.5 19 19.5 20 20.5 21
Energy/atom (eV)
Volume/atom
10648 atoms21952 atoms
Volume / atom
Ene
rgy
/ ato
m (e
V)
10,648 atoms 21,952 atoms
Structural properties of Si crystal
Requirements of Mathematical Software from RSDFT
• An FFT‐free algorithm.• All eigenvalues and eigenvectors computation fora dense real symmetric matrix.– Standard Eigenproblem.
– O(100) times are executed for SCF (Self Consistent Field) process.
• Re‐orthogonalization for eigenvectors.• Due to computational complexity, the parts of eigensolver and orthogonalization become a bottleneck.– Since these parts require O(N3) computations, while others require O(N2)
computations.
• Matrix and eigenvalues are distributed to obtain parallelism for the other parts to eigensolver.– It is difficult to obtain while data even if it is small.
Requirements of Mathematical Software from RSDFT (Cont’d)
• Other parts of the eigensolver in application are also time‐consuming.
Source: Y. Hasegawa et.al.: First‐principles calculations of electron states of a silicon nanowire with 100,000 atoms on the K computer, SC11, (2011)
Processes Execution Costs to whole time [%] Order
SCF 99.6% O(N3)
SD 47.2% O(N3)
Subspace Diag. 44.2% O(N3)
MatE 10.0% O(N3) DGEMM
Eigensolve 19.6% O(N3)
Rot V 14.6% O(N3)
CG (Conjugate Gradient) 26.0% O(N2)
GS (Gramm‐Schmidt Ort.) 25.8% O(N3) DGEMM
Others 0.6% ‐
RSDFT Processes Breakdown
Eigensolve and GS Parts will be bottleneck in large‐scale computation,
but other processes is needed to be considered.
• Required memory space is also needed to be considered.– Due to API of numerical library, such as re‐distribution of data, actual problem
size is limited as small sizes with respect to remainder memory space.
Our Assumption• Target : The eigensolver part in RSDFT• Exa‐scale computing: Total number of nodes is on the order of 1,000,000 (a million).
• Since the matrix is two‐dimensional (2D), the size of the matrix required in exa‐scale computers reaches the order of:10,000 * sqrt (1,000,000) = 10,000,000 (ten millions), if each node has matrix of N=10,000 .
• Since most dense solvers require O(N3) for computational complexity, the execution time with a matrix of N=10,000,000 (ten millions) is unrealistic in actual applications (in production‐run phase).
Our Assumption (Cont’d)• We presume that N=1,000 per node is the maximum size. The size in exa‐scale is on the order of N=1,000,000 (a million).
• The used memory size of a matrix per node is only on the order of 8 MB. – ! This is eigensolver part only.
• This is just the cache size for current CPUs.– Next generation CPUs may be having order of 100MB cache!
• Such as the IBM Power8 with e‐DRAM (3D Stacked Memory) for L4 cache.
Originalities of Our Eigensolver1. Non‐blocking Computation Algorithm Since data in cache in our assumption in exa‐scale
computing. 2. Communication reducing and
communication avoiding algorithm Tridiagonalization and Householder inverse
transformation of symmetric eigensolvers. By duplicating Householder vectors.
3. Hybrid MPI‐OpenMP execution With a full system of a peta‐scale supercomputer
(The Fujitsu FX10) consisting of 4800 nodes (76,800 cores).
Outline• Target Application: RSDFT• Parallel Algorithm of Symmetric Eigensolver for Small Matrices
• Performance Evaluation with 76,800 cores of the Fujitsu FX10
• Conclusion
A Classical Householder Algorithm (Standard Eigenproblem )xAx
Symmetric Dense Matrix
A
1. Householder Transformation
QAQ=TTri-diagonalization
16
)( 3nO
T
Tridiagonalmatrix
4. Householder Inverse TransformationA: Dense matrix All eigenvectors: X = QY
)( 3nO
Q=H1 H2 … Hn-2
2. BisectionT: Tridiagonal matrix All eigenvalues :Λ
3. Inverse IterationT : Tridiagonal matrixAll eigenvectors: Y
)(~)( 32 nOnO)( 2nOMRRR:
Whole Parallel Processes on the Eigensolver
A
Tridiagonalization
T
Gather All Elements T T
T T
UpperLower
Compute Upper and Lower limitsFor eigenvalues
1,2,3,4… (Rising Order)Λ
1,2,3,4… (Corresponding toRising Order for the eigenvalues
Only 4x increase with 2x problem sizein O(N3) algorithm
Execution Time in Pure MPIbetween ScaLAPACK PDSYEVD and Ours
ScaLAPACK (version 1.8) on the Fujitsu FX10. Fujitsu Optimized BLAS is used.The best block size is specified for each ScaLAPACK execution in range between 1, 8, 16, 32, 64, 128, and 256.
4.26
10.96
25.76
1.794.61
15.52
0
5
10
15
20
25
30
N=4800 (8x8) 64cores
N=9600 (16x16) 256cores
N=19200 (32x32)1024 cores
ScaLAPACKOurs
[Time in Seconds]
Better
Conclusion• Our eigensolver is effective for very small matrices to utilize communication reducing and avoiding techniques.– By halving duplicate Householder vectors in Tridiagonalization and Householder Inverse Transformation phases.
– By using reduced communications for multiple sending with 2D splitting for process grid.
– By using packing messages for Householder Inverse Transformation part.
• Selection of implementations in communication processes is the target of AT.– The best implementation depends on process grids, the number of processors, and block size for data packing.
Conclusion (Cont’d)• One of drawbacks is increase of memory space.
– , where process grid is p * q.– Since memory space for matrix is in cache size, the increase of memory space can be ignored.
• Comparison with new blocking algorithms is future work.– 2‐step method with block Householder tridiagonalization.• Eigen‐K (Riken)• ELPA (Technische Universität München)• A new implementation of PLASMA and MAGMA
)/( 2 pNO
Acknowledgements • Computational resource of Fujitsu FX10 was awarded by “Large‐scale HPC Challenge” Project, Information Technology Center, The University of Tokyo.
This topic was submitted to Parallel Computing.(As of December 2013.)