Porting VASP to GPU using OpenACC Martijn Marsman, Stefan Maintz, Andreas Hehn, Markus Wetzstein, and Georg Kresse SC19, Denver, 19 th Nov. 2019
Porting VASP to GPU using OpenACC
Martijn Marsman, Stefan Maintz, Andreas Hehn, Markus Wetzstein,and Georg Kresse
SC19, Denver, 19th Nov. 2019
The Vienna Ab-initioSimulation Package: VASP
Electronic structure from first principles:
𝐻𝜓 = 𝐸𝜓
• Approximations:
• Density Functional Theory (DFT)
• Hartree-Fock/DFT-HF hybrid functionals
• Random-Phase-Approximation(GW, ACFDT)
• 3500+ licensed academic and industrial groups world wide.
• 10k+ publications in 2015 (Google Scholar),and rising.
• Developed in the group of Prof. G. Kresse at the University Vienna.
NVIDIA BOOTH @ SC19
NVIDIA BOOTH @ SC19
VASP: Computational Characteristics
VASP does:
• Lots of “smallish” FFTs:(e.g. 100⨉100⨉100)
• Matrix-Matrix multiplication(DGEMM and ZGEMM)
• Matrix diagonalization: 𝒪 𝑁3
(𝑁 ≈ #-of-electrons)
• All-2-all communication
Using:
• fftw3d (or fftw-wrappers to mkl-ffts)
• LAPACK BLAS3 (mkl, OpenBLAS)
• scaLAPACK (or ELPA)
• MPI (OpenMPI, impi, …) [+ OpenMP]
VASP is pretty well characterized by the SPECfp2006 benchmark
NVIDIA BOOTH @ SC19
VASP on GPU
• VASP has organically grown over more than 25 years(450k+ lines of Fortran 77/90/2003/2008/… code)
• Current release: some features were ported with CUDA C(DFT and hybrid functionals)
• Upcoming VASP6 release: re-ported to GPU using OpenACC
• The OpenACC port is more complete already than the CUDA port(Gamma-only version and support for reciprocal space projectors)
NVIDIA BOOTH @ SC19
Porting VASP to GPU using OpenACC
• Compiler-directive based: single source, readability, maintainability, …
• cuFFT, cuBLAS, cuSOLVER, CUDA aware MPI, NCCL
• Some dedicated kernel versions: e.g. batching FFTs, loop re-ordering
• “Manual” deep copies of derived types(nested and/or with pointer members)
• Multiple MPI ranks sharing a GPU (using MPS)
• Combine OpenACC and OpenMP(OpenMP threads driving asynchronous execution queues)
NVIDIA BOOTH @ SC19
OpenACC directives
25
OPENACC DIRECTIVESData directives are designed to be optional
Manage
Data
Movement
Initiate
Parallel
Execution
Optimize
Loop
Mappings
!$acc data copyin(a,b) copyout(c)
...!$acc parallel
!$acc loop gang vectordo i=1, n
c(i) = a(i) + b(i)...
enddo!$acc end parallel...
!$acc end data
NVIDIA BOOTH @ SC19
Nested derived types
• OpenACC + Unified Memory not an option yet
• OpenACC 2.6 manual deep copy was key
• Requires large numbers of directives in some cases,
• ... but well encapsulated
• Future versions of OpenACC (3.0) will add true deep copy, require far fewer data directives
• When CUDA Unified Memory + HMM supports all classes of data, potential for a VASP port with no data directives at all
11
MANAGING VASP AGGREGATEDATA STRUCTURES
• OpenACC + Unified Memory not an option today, some aggregates have static members
• OpenACC 2.6 manual deep copy was key
• Requires large numbers of directives in some cases, but well encapsulated (107 lines for COPYIN)
• Future versions of OpenACC (3.0) will add true deep copy, require far fewer data directives
• When CUDA Unified Memory + HMM supports all classes of data, potential for a VASP port with no data directives at all
Derived Type 1Members:3 dynamic
1 derived type 2
Derived Type 2Members:
21 dynamic1 derived type 31 derived type 4
Derived Type 3Members:only static
Derived Type 4Members:8 dynamic
4 derived type 52 derived type 6
Derived Type 5Members:3 dynamic
Derived Type 6Members:8 dynamic
+12 lines of code
+48 lines of code
+26 lines of code
+8 lines of code
+13 lines of code
NVIDIA BOOTH @ SC19
VASP on GPU benchmarks
CuC_vdW
• C@Cu surface (Ω≅ 2800 Å3)
• 96 Cu + 2 C atoms (1064 e−)
• vdW-DFT
• RMM-DIIIS
• OpenACC port outperformsthe previous CUDA port …
1.0 1.0 1.0
1.7
2.32.5
2.2
3.3
3.7
2.9
4.1
4.7
3.3
5.4
6.6
0
1
2
3
4
5
6
7
VASP 5 VASP 6 VASP 6+
Spee
du
p v
s. C
PU
CPU 1 V100 2 V100 4 V100 8 V100
• CPU: 2⨉ E5-2698 v4 @ 2.20 GHz: 40 physical cores
NVIDIA BOOTH @ SC19
CUDA C vs. OpenACC port
• Full benchmark timings are interesting for time-to-solution, but are not an ‘apples- to-apples’ comparison between the CUDA and OpenACC versions:
• Amdahl’s law for non-GPU accelerated parts of code affects both implementations, but blurs differences
• Using OpenACC allowed to port additional kernels with minimal effort, has not been undertaken with CUDA version
• OpenACC version uses GPU-aware MPI to help more communication heavy parts, like orthonormalization
• OpenACC version was forked out of a more recent version of CPU code, while CUDA implementation is older
Can we find a fairer comparison? Let’s look at the RMM-DIIS algorithm …
NVIDIA BOOTH @ SC19
Iterative diagonalization: RMM-DIIS (EDDRMM)
• EDDRMM part has comparable GPU- coverage for CUDA and OpenACC versions
• CUDA version uses kernel fusing, OpenACC version uses two refactored kernels
• minimal amount of MPI communication
• OpenACC version improves scaling with number of GPUs
18
0
4
8
12
16
20
1 2 4 8Sp
ee
du
pNumber of V100 GPUs
EDDRMM section (silica_IFPEN), speedup over CPU
VASP 5.4.4
dev_OpenACC
VASP OPENACC PERFORMANCEEDDRMM section of silica_IFPEN on V100
• EDDRMM takes 17% of total runtime
• benefits for expectation values included
• These high speedups are not the single aspect for overall improvement, but an important contribution
• OpenACC improves scaling yet again
• MPS always helps, but does not pay off in total time due to start-up overhead
CPU: dual socket Broadwell E5-2698 v4, compiler Intel 17.0.1GPU: 5.4.4 compiler Intel 17.0.1; dev_OpenACC compiler: PGI 18.3 (CUDA 9.1)
NVIDIA BOOTH @ SC19
Orthonormalization
• GPU-aware MPI benefits from NVLink latency and B/W
• Data remains on GPU, CUDA port streamed data for GEMMs
• Cholesky on CPU saves a (smaller) mem-transfer
• 180 ms (40%) are saved by GPU-aware MPI alone
• 33 ms (7.5%) by others
15
VASP OPENACC PERFORMANCESection-level comparison for orthonormalization
CUDA CPORT
OPENACCPORT
Redistributing wavefunctions
Host-only MPI
(185 ms)
GPU-aware MPI
(110 ms)
Matrix-Matrix-MulsStreamed data
(19 ms)
GPU local data
(15 ms)
Cholesky decomposition
CPU-only
(24 ms)
cuSolver
(12 ms)
Matrix-Matrix-MulsDefault scheme
(30 ms)
better blocking
(13 ms)
Redistributing wavefunctions
Host-only MPI
(185 ms)
GPU-aware MPI
(80 ms)
• GPU-aware MPI benefits from NVLink latency and B/W
• Data remains on GPU, CUDA port streamed data for GEMMs
• Cholesky on CPU saves a (smaller) mem-transfer
• 180 ms (40%) are saved by GPU-aware MPI alone
• 33 ms (7.5%) by others
NVIDIA BOOTH @ SC19
VASP on GPU benchmarks
Si256_VJT_HSE06
• Vacancy in Si (Ω ≅5200 Å3)
• 255 Si atoms (1020 e−)
• DFT/HF-hybrid functional
• Conjugate gradient
• Batched FFTs
• Explicit overlay of computationand communication usingnon-blocking collectives (NCCL)
• CPU: 2⨉ E5-2698 v4 @ 2.20 GHz: 40 physical cores
1.0 1.04.7 4.7
8.8 9.0
15.7 15.9
28.1 28.7
0
5
10
15
20
25
30
35
VASP 6 VASP NV
Spe
ed
up
vs.
CP
U
CPU 1 V100 2 V100 4 V100 8 V100
NVIDIA BOOTH @ SC19
The OpenACC port: current limitations
• Some bottlenecks must be addressed: computation of the local potential is still done CPU-side.
• Not all features are ported yet: Currently we are porting the linear response solvers and cubic-scaling ACFDT (RPA total energies)
• Some features of VASP, e.g. cubic-scaling RPA, are very (very) memory intensive, and involve diagonalization of large complex matrices(> 100k ⨉ 100k): e.g. cusolverMgSyevd
• PGI compilers only
NVIDIA BOOTH @ SC19
New Release: VASP6
• …
• Cubic-scaling RPA (ACFDT,GW)
• On-the-fly machine learned force-fields
• Electron-Phonon coupling
• MPI+OpenMP
• OpenACC port
• …
• Caveat: the OpenACC port is stillregarded to be “experimental” at this stage
• Actively gather feedback(from HPC sites)
• Intensive support effort
https://www.vasp.at/wiki/index.php/Category:VASP6
https://www.vasp.at/wiki/index.php/Category:VASP6
NVIDIA BOOTH @ SC19
THE END
Special thanks to Stefan Maintz, Andreas Hehn, and Markus Wetzsteinfrom NVIDIA and PGI!
And to Ani Anciaux-Sedrakian and Thomas Guignon at IFPEN!
And to you for listening!