VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT CODES

Presenters: Sarah Tariq and Przemyslaw Tredak

Authors: Jeroen Bedorf, Przemyslaw Tredak , Dusan Stosic, Arash Ashari, Paul Springer, Darko Stosic, Sarah Tariq, Paul Fleurat-Lessard and Anciaux Sedrakian (Ens-lyon, IFPEN), Maxwell Hutchinson (University of Chicago) and Michael Widom (CMU)

GPU VASP COLLABORATION Collaborators

Project Scope Minimization algorithms to calculate electronic ground state

— Blocked Davidson (ALGO = NORMAL & FAST)

— RMM-DIIS (ALGO = VERYFAST & FAST)

Earlier work — Speeding up plane-wave electronic-structure calculations using graphics-processing units. Maintz, Eck,

Dronskowski. (2011)

— VASP on a GPU: application to exact-exchange calculations of the stability of elemental boron. Hutchinson, Widom. (2011)

— Accelerating VASP Electronic Structure Calculations Using Graphic Processing Units. Hacene, Anciaux-Sedrakian, Rozanska, Klahr, Guignon, Fleurat-Lessard. (2012)

VASP OVERVIEW

Atomic scale materials modeling from first principles

Simulate atoms (mostly solids/surfaces)

Liquids, crystals, magnetism, semiconductor/insulators, surfaces, catalysts

Solve many-body Schrödinger equation

Density Functional Theory (DFT): Kohn-Sham equations

Optionally add exact-exchange using Hybrid Hartree Fock functionals (HF)

THEORY

Self-consistent Kohn-Sham system

— Self-consistency loop until convergence

— Compute Kohn-Sham potential 𝒗𝑲𝑺 𝒓

— Solve Kohn-Sham eigenproblem

— Obtain electronic density 𝒏 𝒓

Kohn-Sham eigenproblem

— Diagonalize Hamiltonian matrix 𝑯 𝑲𝑺

— Problem: often 𝑯 𝑲𝑺 is very big

— Solution: Iterative matrix diagonalization schemes

— Blocked Davidson, RMM-DIIS

— Find lowest few 𝝋𝒊 eigenstates of 𝑯 𝑲𝑺

𝒏𝟎(𝒓)

𝒗𝑲𝑺(𝒓)

𝑯 𝑲𝑺𝝋𝒊 𝒓 = 𝑬𝒊𝝋𝒊 𝒓

𝒏 𝒓 = 𝝋𝒊 𝒓𝟐

𝒊

stop?

end

yes

no

SIMILARITIES IN PW DFT CODES

Rely heavily on math libraries BLAS and FFT

— Easily offloaded using cuBLAS and cuFFT

Don’t need to write a lot of specialized routines

— Focus is on keeping GPU busy, and reducing communication instead of optimizing kernels

TARGET WORKLOADS Silica

— 7 Å thick slab of amorphous silica, 240 atoms (Si68O148H24)

— RMM-DIIS (ALGO = VERYFAST)

NiAl-MD — Liquid metal molecular dynamics sample of Nickel-

based superalloy

— 500 atoms, 9 chemical species

— Blocked Davidson (ALGO = NORMAL)

VERSION AND HARDWARE The GPU port is on VASP version 5.2.12

Code accelerated includes RMM-DIIS and Blocked Davidson routines and also exact-exchange work from CMU

We have run the code on Fermi and Kepler boards

The code has been tested for functional correctness on more than 25 benchmarks

We present performance results on 2 benchmarks at the end of this presentation

OPTIMIZATION DETAILS

RUNTIME DISTRIBUTION FOR SILICA

Time in sec for 1 K40 GPU + 1 IvyBridge core

0 500 1000 1500 2000 2500 3000 3500

Optimized GPU port

original GPU port

CPU

Memcopy

Gemm

FFT

Other

OUTLINE

Reduce communication

Port more work to the GPU

Optimize for small benchmarks

Batch work

Improve MPI scaling

REDUCE COMMUNICATION

REDUCE COMMUNICATION

PCIe Bus

K40: 288GB/s

theoretical

peak memory

bandwidth on

chip

PCIe Gen3:

16GB/s

theoretical

peak per

direction

REDUCE COMMUNICATION – EDDRM AND EDDIAG

Overlap transfers with compute by passing stream index into pipeline of FFT subroutines

Unnecessary idle time

FFT

Memcopy

Default stream

Time

REDUCE COMMUNICATION – EDDRM AND EDDIAG

Overlap transfers with compute by passing stream index into pipeline of FFT subroutines

Stream 1

Stream 2

Stream 3

Much better GPU utilization – 40% speedup

in EDDRM and 144% in EDDIAG!

FFT

Memcopy

Time

REDUCE COMMUNICATION – EDDIAG

Before

After

REDUCE COMMUNICATION – FORCE AND STRESS

Downstream CPU work

FFT

Memcopy

HtoD DtoH

CPU

HtoD DtoH

Time

Memory copies taking more time than the kernel!

CPU


FFT

Memcopy

HtoD DtoH HtoD DtoH

Time

Memory copies taking more time than the kernel!

Port downstream CPU work to GPU GPU


Port downstream CPU work to GPU

FFT

Memcopy

HtoD DtoH

CPU

HtoD DtoH

Time

GPU

Unnecessary



Remove unnecessary memory copies

FFT

Memcopy

HtoD

CPU

HtoD

Time

GPU



Remove unnecessary memory copies FFT

Memcopy

HtoD

CPU

HtoD

Time

GPU




When possible, initialize data on the GPU FFT

Memcopy

CPU

Time

GPU

HtoD HtoD





Memcopy

CPU

Time

GPU





Memcopy

CPU

Time

GPU




When possible, initialize data on the GPU

Use streams to overlap computation and transfers

FFT

Memcopy

CPU

Time

GPU


117 ms

14 ms

14ms

8.3x

speedup

Over

original

GPU

version

REDUCE COMMUNICATION – HIGH LEVEL RMM-DIIS PORT

Typical drop-in replacement may not work well for small CPU functions

CPU CPU CPU



CPU CPU

GPU HtoD DtoH1

Slowdown!



Porting more functions and keeping data on the GPU reduces communication and improves results!

GPU HtoD DtoH1 GPU GPU



Porting more functions and keeping data on the GPU reduces communication and improves results!

GPU GPU GPU

High level RMM-DIIS port – 18%

improvement!

BATCH AND STREAM WORK

BATCH WORK AND STREAM WORK

GPU is massively parallel

Need to launch sufficient work to

saturate it

A single call to a zgemm of (50x50)

* (50x50) only launches 2 blocks

which fit on one SM

- Not sufficient to fully utilize the

GPU!

Can launch multiple independent

pieces of work simultaneously


STREAMED BATCHED

for(int i=0;i<N;i++)

cublasZgemm();

for(int i=0;i<N;i++){

cublasSetStream();

cublasZgemm();

}

cublasZgemmBatched();

Improved

zgemm

zgemm

zgemm

zgemm Kernel

launch

overhead

Not improved

Kernel

launch

overhead

zgemmBatched


for(int i=0;i<N;i++)

cublasZgemm();

GEMM

0

20

40

60

80

100

GPU

utl

izati

on

time

GEMM GEMM GEMM GEMM


GEMM

0

50

100

GPU

utl

izati

on

time

Kolumna1

GEMM

GEMM

GEMM

GEMM for(int i=0;i<N;i++){

cublasSetStream();

cublasZgemm();

}

STREAMED

…

Improved Not improved

0

50

100

GPU

utl

izati

on

time

Kolumna1

…

Kernel

launch

overhead


GEMM

0

20

40

60

80

100

GPU

utl

izati

on

time

BATCHED

cublasZgemmBatched();

BATCH WORK – INVERSE REAL-SPACE PROJECTION

Padding with 0 required to have

same sizes of all gemms

0 0

data

data

data

BATCH WORK - RPROMU

Problem: How to easily batch it?

for i in 1..N

for j in 1..M

kernel<<<B,T,0,stream(i)>>>(…i,j);

Code Result

Time

BATCH WORK - RPROMU


Use more grid dimensions and extract i and j from blockIdx.y and blockIdx.z

for i in 1..N

for j in 1..M


Code Result

Time

BATCH WORK - RPROMU


Use more grid dimensions and extract i and j from blockIdx.y and blockIdx.z

for i in 1..N

for j in 1..M


Code Result

Time

dim3 blocks(B,M,N);

kernel<<<blocks,T>>>(…);

STREAM WORK: GRAHM-SCHMIDT ORTHONORMALIZATION (ORTHCH) MULTI BASIS MATRIX MATRIX MULTIPLY (LINCOM)

Original

New

Running on K20X with 14 SMs

Kernel launches 12 blocks

Because of register usage can run 3 blocks per SM

Theoretically can run 14*3 = 42 blocks

Use streams to launch

multiple independent

Zgemms and fill all the

SMs

MODIFY PARAMETERS TO IMPROVE BATCH SIZES

N = 2*NSIM

Increasing NSIM is an easy way

to improve the performance

without changing the numerical

accuracy of the results

REDUCE ALLOCATION / DEALLOCATION ON GPU

REDUCE ALLOCATION/DEALLOCATION ON GPU

Allocation / Deallocation on GPU is expensive, same as CPU

— Try to allocate once and use many times, even for temporary data

Allocations also cause expensive synchronization with the host, that introduces gaps in the GPU utilization

Allocations and deallocations may be tracked using CUDA API Trace functionality of CUDA Visual Profiler

GPU HtoD DtoH Allocate Deallocate


Time

cudaMalloc(…);

cudaMemcpy(…);

kernel<<<…>>>(…);

cudaMemcpy(…);

cudaFree(…);

GPU HtoD DtoH


Time

cudaMalloc(…);

cudaMemcpy(…);

kernel<<<…>>>(…);

cudaMemcpy(…);

cudaFree(…);

cudaMalloc(…);

cudaMemcpy(…);

Kernel<<<…>>>(…);

cudaMemcpy(…);

if(size < size_needed)

cudaFree(…);

1.4ms

0.3ms

Unnecessary

REDUCE ALLOCATION/DEALLOCATION ON GPU - ECCP

REDUCE ALLOCATION/DEALLOCATION ON GPU – FORCE AND STRESS

Cufft plan create Cufft plan destroy

Now: no plan create or destroy

REDUCE CPU WORK

PORT ADDITIONAL WORK TO THE GPU

Setup precond – 9.3x speedup

— Change from executing many times on the CPU in the new bands loop to executing only once on the GPU after the new bands loop

Potlok

CPU

2% of runtime

Initial GPU

7% of runtime

GPU

15% of runtime

Optimize

other parts GPU

6% of runtime

Port GGA (~50% of

Potlok) to GPU

REMOVE UN NECESSARY CPU WORK

Example: Daxpy and Dscal in EDDRM

135K

elements

1,143K

elements

K

space

real

space DSCAL

FFT

DAXPY

DSCAL DAXPY

1,143K

elements



135K

elements

1,143K

elements

K

space

real

space DSCAL

FFT

DAXPY x DSCAL DAXPY

1,143K

elements x



135K

elements

1,143K

elements

K

space

real

space

FFT

1.24x speedup for

EDDRM routine

DSCAL DAXPY

1,143K

elements

USING MORE CPU CORES

CPU, 436

Memcopy, 68

Gemm, 120

FFT, 288

Other, 165

SILICA, 1K40 + 1 Ivy bridge core

Left over

CPU work

USING MORE CPU CORES

0

0.5

1

1.5

2

2.5

3

1 2 3 4 6

Speedup v

s. 1

GPU

1 c

ore

Cores per GPU

Performance improvement with using multiple CPU cores

1 GPU

2 GPUs

4 GPUs

USE MULTI PROCESS SERVICE (MPS)

Performance issues with running multiple MPI ranks per GPU

— Increased MPI communication

— Each rank running in its own context on the GPU

Use the MPS functionality introduced in cuda 5.5 to have multiple MPI ranks run on the same GPU at the same time

— Allows kernels from multiple MPI ranks to run at the same time on the GPU

1 GPU + 1 core

USING MULTIPLE CPU CORES PER GPU 1 GPU + 2 cores

zgemm

zgemm

zgemm

zgemm

zgemm

zgemm

zgemm

zgemm

Time 1 Time 2

Context 1,

MPI rank 1

Context

switch Context 2,

MPI rank 2

USING MULTIPLE CPU CORES PER GPU

0.8

1.3

1.8

2.3

2.8

3.3

1 2 3 4 6

Speedup v

s. 1

core

Cores per GPU

Performance improvement with using multiple CPU cores

1 GPU

1 GPU+MPS

2 GPU

2 GPU + MPS

4 GPU

4 GPU + MPS14%

13%

11%

OPTIMIZATION FOR SMALL BENCHMARKS

SMALL BENCHMARK - PROBLEMS

Launch latency, memory copies and bookkeeping relatively large part of time

Small kernels don’t saturate GPU, wasting resources

SMALL BENCHMARK - SOLUTION

Group independent parts together

Merge independent calls into one kernel

Group independent iterations together

AFTER BEFORE

SMALL BENCHMARK – EXAMPLE I3 LOOP

Setup kernel

arguments

Launch Daxpy

kernel

Launch

Reduction kernel

Copy results to

CPU

Process results

For each sim

in nsim

Launch Daxpy kernel

Launch Reduction

kernel

Copy results to CPU

Setup kernel

arguments

For each sim

in nsim

CPU

work in

parallel

Process results For each sim

in nsim

CPU

work in

parallel

RESULTS FOR I3 LOOP

3.75x improvement for Pdo

— Small benchmark with only 87 ions

1.3x improvement for SILICA

SCALING

MPI SCALING

Number of

GPUs

EDDIAG [seconds, scaling]

EDDRM [seconds, scaling]

ORTHCH [seconds, scaling]

1 GPU 4.2s, 100% 6.7s, 100% 1.5s, 100%

2 GPUs 2.8s, 75% 3.4s, 99% 1.5s, 50%

4 GPUs 2.7s, 39% 1.8s, 95% 2.4s, 15%

8 GPUs 1.9s, 27% 0.9s, 93% 1.4s, 13%

Compute

intensive routine

: good Scaling

MPI intensive routines :

bad Scaling

OVERLAPPING MPI AND GPU WORK

Reordered such that MPI overlaps with computation

GPU compute

Memcopy

Default stream

Time

MPI

OVERLAPPING MPI AND GPU WORK

Reordered such that MPI overlaps with computation

Stream 1

Stream 2

Hide MPI communication and memory copies.

3x improvement in Striploop in EDDIAG

GPU compute

Memcopy

Time

MPI

PRE-ALLOCATING MEMORY IN ONE CONTIGUOUS CHUNK

VASP allocates hundreds of small buffers at the start of the RMM-DIIS iterations.

— Memory allocations require locks and syncs and can therefore be relatively expensive.

— This cost increases with multiple GPUs

Instead:

— Do a single large memory allocation

— Divide the large memory buffer over the hundreds of small buffers

— Memory allocation phase over 100x faster.

AFTER

BEFORE

USING GPU DIRECT

GPU

CPU

NIC NIC

CPU

GPU

GPU

NIC NIC

GPU

USING GPU DIRECT

Use CUDA Aware MPI

— As simple as calling MPI_Send, MPI_Recv with pointers to the GPU data

Performance improvements

Number of

GPUs

Time ORTCH –

without

Time ORTHCH

– with

%

improvement

2 GPUs 1.32s 0.99s 33%

4 GPUs 0.87s 0.63s 37%

RESULTS

RESULTS SILICA (RMM-DIIS) – VASP 5.2.2

• all results measured on K40

and dual socket sandy bridge

with 8 cores per socket

running at 2.9GHz

0

1

2

3

4

5

6

7

8

9

10

0 5 10

Sp

eed

up

vs.

Sin

gle

CP

U S

ocket

Number of CPU Sockets

2 GPU : 1 CPU ratio(1-2 cores/GPU)

CPU only(8 cores/CPU)

1 GPU : 1 CPU ratio(2-6 cores/GPU)2.5x

2.4x

2.3x

2.9x 2.9x

3.7x

3.6x

RESULTS SILICA (RMM-DIIS) – VASP 5.2.2

• all results measured on K40

and dual socket sandy bridge

with 8 cores per socket

running at 2.9GHz

0

1

2

3

4

5

6

7

8

9

10

0 5 10

Sp

eed

up

vs.

Sin

gle

CP

U S

ocket





1 node with two GPUs

is faster than 10 CPU

Sockets (5 nodes)

RESULTS NIAL-MD (BLOCKED DAVIDSON) , VASP 5.2.2

0

1

2

3

4

5

6

7

8

9

10

0 2 4 6 8

Sp

eed

up

vs.

Sin

gle

CP

U S

ocket


2 GPU : 1 CPU ratio(1 core/GPU)



4x

6.9x

4.8x

4.9x

3.5x

3.4x

• all results measured on K40 and

dual socket sandy bridge with 8

cores per socket running at

2.9GHz

• Running with more cores per GPU

runs out of memory

RESULTS NIAL-MD (BLOCKED DAVIDSON) , VASP 5.2.2

0

1

2

3

4

5

6

7

8

9

10

0 2 4 6 8

Sp

eed

up

vs.

Sin

gle

CP

U S

ocket





• all results measured on K40 and

dual socket sandy bridge with 8

cores per socket running at

2.9GHz

• Running with more cores per GPU

runs out of memory

1 node with one GPU

is faster than 8 CPU

Sockets (4 nodes)

VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

Documents