Porting VASP to GPU using OpenACC · VASP on GPU •VASP has organically grown over more than 25 years (450k+ lines of Fortran 77/90/2003/2008/… code) •Current release: some features

Porting VASP to GPU using OpenACC

Martijn Marsman, Stefan Maintz, Andreas Hehn, Markus Wetzstein,and Georg Kresse

SC19, Denver, 19th Nov. 2019

The Vienna Ab-initioSimulation Package: VASP

Electronic structure from first principles:

𝐻𝜓 = 𝐸𝜓

• Approximations:

• Density Functional Theory (DFT)

• Hartree-Fock/DFT-HF hybrid functionals

• Random-Phase-Approximation(GW, ACFDT)

• 3500+ licensed academic and industrial groups world wide.

• 10k+ publications in 2015 (Google Scholar),and rising.

• Developed in the group of Prof. G. Kresse at the University Vienna.

NVIDIA BOOTH @ SC19

NVIDIA BOOTH @ SC19

VASP: Computational Characteristics

VASP does:

• Lots of “smallish” FFTs:(e.g. 100⨉100⨉100)

• Matrix-Matrix multiplication(DGEMM and ZGEMM)

• Matrix diagonalization: 𝒪 𝑁3

(𝑁 ≈ #-of-electrons)

• All-2-all communication

Using:

• fftw3d (or fftw-wrappers to mkl-ffts)

• LAPACK BLAS3 (mkl, OpenBLAS)

• scaLAPACK (or ELPA)

• MPI (OpenMPI, impi, …) [+ OpenMP]

VASP is pretty well characterized by the SPECfp2006 benchmark

NVIDIA BOOTH @ SC19

VASP on GPU

• VASP has organically grown over more than 25 years(450k+ lines of Fortran 77/90/2003/2008/… code)

• Current release: some features were ported with CUDA C(DFT and hybrid functionals)

• Upcoming VASP6 release: re-ported to GPU using OpenACC

• The OpenACC port is more complete already than the CUDA port(Gamma-only version and support for reciprocal space projectors)

NVIDIA BOOTH @ SC19

Porting VASP to GPU using OpenACC

• Compiler-directive based: single source, readability, maintainability, …

• cuFFT, cuBLAS, cuSOLVER, CUDA aware MPI, NCCL

• Some dedicated kernel versions: e.g. batching FFTs, loop re-ordering

• “Manual” deep copies of derived types(nested and/or with pointer members)

• Multiple MPI ranks sharing a GPU (using MPS)

• Combine OpenACC and OpenMP(OpenMP threads driving asynchronous execution queues)

NVIDIA BOOTH @ SC19

OpenACC directives

25

OPENACC DIRECTIVESData directives are designed to be optional

Manage

Data

Movement

Initiate

Parallel

Execution

Optimize

Loop

Mappings

!$acc data copyin(a,b) copyout(c)

...!$acc parallel

!$acc loop gang vectordo i=1, n

c(i) = a(i) + b(i)...

enddo!$acc end parallel...

!$acc end data

NVIDIA BOOTH @ SC19

Nested derived types

• OpenACC + Unified Memory not an option yet

• OpenACC 2.6 manual deep copy was key

• Requires large numbers of directives in some cases,

• ... but well encapsulated

• Future versions of OpenACC (3.0) will add true deep copy, require far fewer data directives

• When CUDA Unified Memory + HMM supports all classes of data, potential for a VASP port with no data directives at all

11

MANAGING VASP AGGREGATEDATA STRUCTURES

• OpenACC + Unified Memory not an option today, some aggregates have static members

• OpenACC 2.6 manual deep copy was key

• Requires large numbers of directives in some cases, but well encapsulated (107 lines for COPYIN)

• Future versions of OpenACC (3.0) will add true deep copy, require far fewer data directives

• When CUDA Unified Memory + HMM supports all classes of data, potential for a VASP port with no data directives at all

Derived Type 1Members:3 dynamic

1 derived type 2

Derived Type 2Members:

21 dynamic1 derived type 31 derived type 4

Derived Type 3Members:only static


4 derived type 52 derived type 6



+12 lines of code

+48 lines of code

+26 lines of code

+8 lines of code

+13 lines of code

NVIDIA BOOTH @ SC19

VASP on GPU benchmarks

CuC_vdW

• C@Cu surface (Ω≅ 2800 Å3)

• 96 Cu + 2 C atoms (1064 e−)

• vdW-DFT

• RMM-DIIIS

• OpenACC port outperformsthe previous CUDA port …

1.0 1.0 1.0

1.7

2.32.5

2.2

3.3

3.7

2.9

4.1

4.7

3.3

5.4

6.6

0

1

2

3

4

5

6

7

VASP 5 VASP 6 VASP 6+

Spee

du

p v

s. C

PU

CPU 1 V100 2 V100 4 V100 8 V100

• CPU: 2⨉ E5-2698 v4 @ 2.20 GHz: 40 physical cores

NVIDIA BOOTH @ SC19

CUDA C vs. OpenACC port

• Full benchmark timings are interesting for time-to-solution, but are not an ‘apples- to-apples’ comparison between the CUDA and OpenACC versions:

• Amdahl’s law for non-GPU accelerated parts of code affects both implementations, but blurs differences

• Using OpenACC allowed to port additional kernels with minimal effort, has not been undertaken with CUDA version

• OpenACC version uses GPU-aware MPI to help more communication heavy parts, like orthonormalization

• OpenACC version was forked out of a more recent version of CPU code, while CUDA implementation is older

Can we find a fairer comparison? Let’s look at the RMM-DIIS algorithm …

NVIDIA BOOTH @ SC19

Iterative diagonalization: RMM-DIIS (EDDRMM)

• EDDRMM part has comparable GPU- coverage for CUDA and OpenACC versions

• CUDA version uses kernel fusing, OpenACC version uses two refactored kernels

• minimal amount of MPI communication

• OpenACC version improves scaling with number of GPUs

18

0

4

8

12

16

20

1 2 4 8Sp

ee

du

pNumber of V100 GPUs

EDDRMM section (silica_IFPEN), speedup over CPU

VASP 5.4.4

dev_OpenACC

VASP OPENACC PERFORMANCEEDDRMM section of silica_IFPEN on V100

• EDDRMM takes 17% of total runtime

• benefits for expectation values included

• These high speedups are not the single aspect for overall improvement, but an important contribution

• OpenACC improves scaling yet again

• MPS always helps, but does not pay off in total time due to start-up overhead

CPU: dual socket Broadwell E5-2698 v4, compiler Intel 17.0.1GPU: 5.4.4 compiler Intel 17.0.1; dev_OpenACC compiler: PGI 18.3 (CUDA 9.1)

NVIDIA BOOTH @ SC19

Orthonormalization

• GPU-aware MPI benefits from NVLink latency and B/W

• Data remains on GPU, CUDA port streamed data for GEMMs

• Cholesky on CPU saves a (smaller) mem-transfer

• 180 ms (40%) are saved by GPU-aware MPI alone

• 33 ms (7.5%) by others

15

VASP OPENACC PERFORMANCESection-level comparison for orthonormalization

CUDA CPORT

OPENACCPORT

Redistributing wavefunctions

Host-only MPI

(185 ms)

GPU-aware MPI

(110 ms)

Matrix-Matrix-MulsStreamed data

(19 ms)

GPU local data

(15 ms)

Cholesky decomposition

CPU-only

(24 ms)

cuSolver

(12 ms)

Matrix-Matrix-MulsDefault scheme

(30 ms)

better blocking

(13 ms)

Redistributing wavefunctions

Host-only MPI

(185 ms)

GPU-aware MPI

(80 ms)

• GPU-aware MPI benefits from NVLink latency and B/W

• Data remains on GPU, CUDA port streamed data for GEMMs

• Cholesky on CPU saves a (smaller) mem-transfer

• 180 ms (40%) are saved by GPU-aware MPI alone

• 33 ms (7.5%) by others

NVIDIA BOOTH @ SC19

VASP on GPU benchmarks

Si256_VJT_HSE06

• Vacancy in Si (Ω ≅5200 Å3)

• 255 Si atoms (1020 e−)

• DFT/HF-hybrid functional

• Conjugate gradient

• Batched FFTs

• Explicit overlay of computationand communication usingnon-blocking collectives (NCCL)

• CPU: 2⨉ E5-2698 v4 @ 2.20 GHz: 40 physical cores

1.0 1.04.7 4.7

8.8 9.0

15.7 15.9

28.1 28.7

0

5

10

15

20

25

30

35

VASP 6 VASP NV

Spe

ed

up

vs.

CP

U

CPU 1 V100 2 V100 4 V100 8 V100

NVIDIA BOOTH @ SC19

The OpenACC port: current limitations

• Some bottlenecks must be addressed: computation of the local potential is still done CPU-side.

• Not all features are ported yet: Currently we are porting the linear response solvers and cubic-scaling ACFDT (RPA total energies)

• Some features of VASP, e.g. cubic-scaling RPA, are very (very) memory intensive, and involve diagonalization of large complex matrices(> 100k ⨉ 100k): e.g. cusolverMgSyevd

• PGI compilers only

NVIDIA BOOTH @ SC19

New Release: VASP6

• …

• Cubic-scaling RPA (ACFDT,GW)

• On-the-fly machine learned force-fields

• Electron-Phonon coupling

• MPI+OpenMP

• OpenACC port

• …

• Caveat: the OpenACC port is stillregarded to be “experimental” at this stage

• Actively gather feedback(from HPC sites)

• Intensive support effort

https://www.vasp.at/wiki/index.php/Category:VASP6

https://www.vasp.at/wiki/index.php/Category:VASP6

NVIDIA BOOTH @ SC19

THE END

Special thanks to Stefan Maintz, Andreas Hehn, and Markus Wetzsteinfrom NVIDIA and PGI!

And to Ani Anciaux-Sedrakian and Thomas Guignon at IFPEN!

And to you for listening!

Porting VASP to GPU using OpenACC · VASP on GPU •VASP has organically grown over more than 25 years (450k+ lines of Fortran 77/90/2003/2008/… code) •Current release: some features

Documents