Page 1
VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT CODES
Presenters: Sarah Tariq and Przemyslaw Tredak
Authors: Jeroen Bedorf, Przemyslaw Tredak , Dusan Stosic, Arash Ashari, Paul Springer, Darko Stosic, Sarah Tariq, Paul Fleurat-Lessard and Anciaux Sedrakian (Ens-lyon, IFPEN), Maxwell Hutchinson (University of Chicago) and Michael Widom (CMU)
Page 2
GPU VASP COLLABORATION Collaborators
Project Scope Minimization algorithms to calculate electronic ground state
— Blocked Davidson (ALGO = NORMAL & FAST)
— RMM-DIIS (ALGO = VERYFAST & FAST)
Earlier work — Speeding up plane-wave electronic-structure calculations using graphics-processing units. Maintz, Eck,
Dronskowski. (2011)
— VASP on a GPU: application to exact-exchange calculations of the stability of elemental boron. Hutchinson, Widom. (2011)
— Accelerating VASP Electronic Structure Calculations Using Graphic Processing Units. Hacene, Anciaux-Sedrakian, Rozanska, Klahr, Guignon, Fleurat-Lessard. (2012)
Page 3
VASP OVERVIEW
Atomic scale materials modeling from first principles
Simulate atoms (mostly solids/surfaces)
Liquids, crystals, magnetism, semiconductor/insulators, surfaces, catalysts
Solve many-body Schrödinger equation
Density Functional Theory (DFT): Kohn-Sham equations
Optionally add exact-exchange using Hybrid Hartree Fock functionals (HF)
Page 4
THEORY
Self-consistent Kohn-Sham system
— Self-consistency loop until convergence
— Compute Kohn-Sham potential 𝒗𝑲𝑺 𝒓
— Solve Kohn-Sham eigenproblem
— Obtain electronic density 𝒏 𝒓
Kohn-Sham eigenproblem
— Diagonalize Hamiltonian matrix 𝑯 𝑲𝑺
— Problem: often 𝑯 𝑲𝑺 is very big
— Solution: Iterative matrix diagonalization schemes
— Blocked Davidson, RMM-DIIS
— Find lowest few 𝝋𝒊 eigenstates of 𝑯 𝑲𝑺
𝒏𝟎(𝒓)
𝒗𝑲𝑺(𝒓)
𝑯 𝑲𝑺𝝋𝒊 𝒓 = 𝑬𝒊𝝋𝒊 𝒓
𝒏 𝒓 = 𝝋𝒊 𝒓𝟐
𝒊
stop?
end
yes
no
Page 5
SIMILARITIES IN PW DFT CODES
Rely heavily on math libraries BLAS and FFT
— Easily offloaded using cuBLAS and cuFFT
Don’t need to write a lot of specialized routines
— Focus is on keeping GPU busy, and reducing communication instead of optimizing kernels
Page 6
TARGET WORKLOADS Silica
— 7 Å thick slab of amorphous silica, 240 atoms (Si68O148H24)
— RMM-DIIS (ALGO = VERYFAST)
NiAl-MD — Liquid metal molecular dynamics sample of Nickel-
based superalloy
— 500 atoms, 9 chemical species
— Blocked Davidson (ALGO = NORMAL)
Page 7
VERSION AND HARDWARE The GPU port is on VASP version 5.2.12
Code accelerated includes RMM-DIIS and Blocked Davidson routines and also exact-exchange work from CMU
We have run the code on Fermi and Kepler boards
The code has been tested for functional correctness on more than 25 benchmarks
We present performance results on 2 benchmarks at the end of this presentation
Page 8
OPTIMIZATION DETAILS
Page 9
RUNTIME DISTRIBUTION FOR SILICA
Time in sec for 1 K40 GPU + 1 IvyBridge core
0 500 1000 1500 2000 2500 3000 3500
Optimized GPU port
original GPU port
CPU
Memcopy
Gemm
FFT
Other
Page 10
OUTLINE
Reduce communication
Port more work to the GPU
Optimize for small benchmarks
Batch work
Improve MPI scaling
Page 11
REDUCE COMMUNICATION
Page 12
REDUCE COMMUNICATION
PCIe Bus
K40: 288GB/s
theoretical
peak memory
bandwidth on
chip
PCIe Gen3:
16GB/s
theoretical
peak per
direction
Page 13
REDUCE COMMUNICATION – EDDRM AND EDDIAG
Overlap transfers with compute by passing stream index into pipeline of FFT subroutines
Unnecessary idle time
FFT
Memcopy
Default stream
Time
Page 14
REDUCE COMMUNICATION – EDDRM AND EDDIAG
Overlap transfers with compute by passing stream index into pipeline of FFT subroutines
Stream 1
Stream 2
Stream 3
Much better GPU utilization – 40% speedup
in EDDRM and 144% in EDDIAG!
FFT
Memcopy
Time
Page 15
REDUCE COMMUNICATION – EDDIAG
Before
After
Page 16
REDUCE COMMUNICATION – FORCE AND STRESS
Downstream CPU work
FFT
Memcopy
HtoD DtoH
CPU
HtoD DtoH
Time
Memory copies taking more time than the kernel!
CPU
Page 17
REDUCE COMMUNICATION – FORCE AND STRESS
FFT
Memcopy
HtoD DtoH HtoD DtoH
Time
Memory copies taking more time than the kernel!
Port downstream CPU work to GPU GPU
Page 18
REDUCE COMMUNICATION – FORCE AND STRESS
Port downstream CPU work to GPU
FFT
Memcopy
HtoD DtoH
CPU
HtoD DtoH
Time
GPU
Unnecessary
Page 19
REDUCE COMMUNICATION – FORCE AND STRESS
Port downstream CPU work to GPU
Remove unnecessary memory copies
FFT
Memcopy
HtoD
CPU
HtoD
Time
GPU
Page 20
REDUCE COMMUNICATION – FORCE AND STRESS
Port downstream CPU work to GPU
Remove unnecessary memory copies FFT
Memcopy
HtoD
CPU
HtoD
Time
GPU
Page 21
REDUCE COMMUNICATION – FORCE AND STRESS
Port downstream CPU work to GPU
Remove unnecessary memory copies
When possible, initialize data on the GPU FFT
Memcopy
CPU
Time
GPU
HtoD HtoD
Page 22
REDUCE COMMUNICATION – FORCE AND STRESS
Port downstream CPU work to GPU
Remove unnecessary memory copies
When possible, initialize data on the GPU FFT
Memcopy
CPU
Time
GPU
Page 23
REDUCE COMMUNICATION – FORCE AND STRESS
Port downstream CPU work to GPU
Remove unnecessary memory copies
When possible, initialize data on the GPU FFT
Memcopy
CPU
Time
GPU
Page 24
REDUCE COMMUNICATION – FORCE AND STRESS
Port downstream CPU work to GPU
Remove unnecessary memory copies
When possible, initialize data on the GPU
Use streams to overlap computation and transfers
FFT
Memcopy
CPU
Time
GPU
Page 25
REDUCE COMMUNICATION – FORCE AND STRESS
117 ms
14 ms
14ms
8.3x
speedup
Over
original
GPU
version
Page 26
REDUCE COMMUNICATION – HIGH LEVEL RMM-DIIS PORT
Typical drop-in replacement may not work well for small CPU functions
CPU CPU CPU
Page 27
REDUCE COMMUNICATION – HIGH LEVEL RMM-DIIS PORT
Typical drop-in replacement may not work well for small CPU functions
CPU CPU
GPU HtoD DtoH1
Slowdown!
Page 28
REDUCE COMMUNICATION – HIGH LEVEL RMM-DIIS PORT
Typical drop-in replacement may not work well for small CPU functions
Porting more functions and keeping data on the GPU reduces communication and improves results!
GPU HtoD DtoH1 GPU GPU
Page 29
REDUCE COMMUNICATION – HIGH LEVEL RMM-DIIS PORT
Typical drop-in replacement may not work well for small CPU functions
Porting more functions and keeping data on the GPU reduces communication and improves results!
GPU GPU GPU
High level RMM-DIIS port – 18%
improvement!
Page 30
BATCH AND STREAM WORK
Page 31
BATCH WORK AND STREAM WORK
GPU is massively parallel
Need to launch sufficient work to
saturate it
A single call to a zgemm of (50x50)
* (50x50) only launches 2 blocks
which fit on one SM
- Not sufficient to fully utilize the
GPU!
Can launch multiple independent
pieces of work simultaneously
Page 32
BATCH WORK AND STREAM WORK
STREAMED BATCHED
for(int i=0;i<N;i++)
cublasZgemm();
for(int i=0;i<N;i++){
cublasSetStream();
cublasZgemm();
}
cublasZgemmBatched();
Improved
zgemm
zgemm
zgemm
zgemm Kernel
launch
overhead
Not improved
Kernel
launch
overhead
zgemmBatched
Page 33
BATCH WORK AND STREAM WORK
for(int i=0;i<N;i++)
cublasZgemm();
GEMM
0
20
40
60
80
100
GPU
utl
izati
on
time
GEMM GEMM GEMM GEMM
Page 34
BATCH WORK AND STREAM WORK
GEMM
0
50
100
GPU
utl
izati
on
time
Kolumna1
GEMM
GEMM
GEMM
GEMM for(int i=0;i<N;i++){
cublasSetStream();
cublasZgemm();
}
STREAMED
…
Improved Not improved
0
50
100
GPU
utl
izati
on
time
Kolumna1
…
Kernel
launch
overhead
Page 35
BATCH WORK AND STREAM WORK
GEMM
0
20
40
60
80
100
GPU
utl
izati
on
time
BATCHED
cublasZgemmBatched();
Page 36
BATCH WORK – INVERSE REAL-SPACE PROJECTION
Padding with 0 required to have
same sizes of all gemms
0 0
data
data
data
Page 37
BATCH WORK - RPROMU
Problem: How to easily batch it?
for i in 1..N
for j in 1..M
kernel<<<B,T,0,stream(i)>>>(…i,j);
Code Result
Time
Page 38
BATCH WORK - RPROMU
Problem: How to easily batch it?
Use more grid dimensions and extract i and j from blockIdx.y and blockIdx.z
for i in 1..N
for j in 1..M
kernel<<<B,T,0,stream(i)>>>(…i,j);
Code Result
Time
Page 39
BATCH WORK - RPROMU
Problem: How to easily batch it?
Use more grid dimensions and extract i and j from blockIdx.y and blockIdx.z
for i in 1..N
for j in 1..M
kernel<<<B,T,0,stream(i)>>>(…i,j);
Code Result
Time
dim3 blocks(B,M,N);
kernel<<<blocks,T>>>(…);
Page 40
STREAM WORK: GRAHM-SCHMIDT ORTHONORMALIZATION (ORTHCH) MULTI BASIS MATRIX MATRIX MULTIPLY (LINCOM)
Original
New
Running on K20X with 14 SMs
Kernel launches 12 blocks
Because of register usage can run 3 blocks per SM
Theoretically can run 14*3 = 42 blocks
Use streams to launch
multiple independent
Zgemms and fill all the
SMs
Page 41
MODIFY PARAMETERS TO IMPROVE BATCH SIZES
N = 2*NSIM
Increasing NSIM is an easy way
to improve the performance
without changing the numerical
accuracy of the results
Page 42
REDUCE ALLOCATION / DEALLOCATION ON GPU
Page 43
REDUCE ALLOCATION/DEALLOCATION ON GPU
Allocation / Deallocation on GPU is expensive, same as CPU
— Try to allocate once and use many times, even for temporary data
Allocations also cause expensive synchronization with the host, that introduces gaps in the GPU utilization
Allocations and deallocations may be tracked using CUDA API Trace functionality of CUDA Visual Profiler
Page 44
GPU HtoD DtoH Allocate Deallocate
REDUCE ALLOCATION/DEALLOCATION ON GPU
Time
cudaMalloc(…);
cudaMemcpy(…);
kernel<<<…>>>(…);
cudaMemcpy(…);
cudaFree(…);
Page 45
GPU HtoD DtoH
REDUCE ALLOCATION/DEALLOCATION ON GPU
Time
cudaMalloc(…);
cudaMemcpy(…);
kernel<<<…>>>(…);
cudaMemcpy(…);
cudaFree(…);
cudaMalloc(…);
cudaMemcpy(…);
Kernel<<<…>>>(…);
cudaMemcpy(…);
if(size < size_needed)
cudaFree(…);
Page 46
1.4ms
0.3ms
Unnecessary
REDUCE ALLOCATION/DEALLOCATION ON GPU - ECCP
Page 47
REDUCE ALLOCATION/DEALLOCATION ON GPU – FORCE AND STRESS
Cufft plan create Cufft plan destroy
Now: no plan create or destroy
Page 49
PORT ADDITIONAL WORK TO THE GPU
Setup precond – 9.3x speedup
— Change from executing many times on the CPU in the new bands loop to executing only once on the GPU after the new bands loop
Potlok
CPU
2% of runtime
Initial GPU
7% of runtime
GPU
15% of runtime
Optimize
other parts GPU
6% of runtime
Port GGA (~50% of
Potlok) to GPU
Page 50
REMOVE UN NECESSARY CPU WORK
Example: Daxpy and Dscal in EDDRM
135K
elements
1,143K
elements
K
space
real
space DSCAL
FFT
DAXPY
DSCAL DAXPY
1,143K
elements
Page 51
REMOVE UN NECESSARY CPU WORK
Example: Daxpy and Dscal in EDDRM
135K
elements
1,143K
elements
K
space
real
space DSCAL
FFT
DAXPY x DSCAL DAXPY
1,143K
elements x
Page 52
REMOVE UN NECESSARY CPU WORK
Example: Daxpy and Dscal in EDDRM
135K
elements
1,143K
elements
K
space
real
space
FFT
1.24x speedup for
EDDRM routine
DSCAL DAXPY
1,143K
elements
Page 53
USING MORE CPU CORES
CPU, 436
Memcopy, 68
Gemm, 120
FFT, 288
Other, 165
SILICA, 1K40 + 1 Ivy bridge core
Left over
CPU work
Page 54
USING MORE CPU CORES
0
0.5
1
1.5
2
2.5
3
1 2 3 4 6
Speedup v
s. 1
GPU
1 c
ore
Cores per GPU
Performance improvement with using multiple CPU cores
1 GPU
2 GPUs
4 GPUs
Page 55
USE MULTI PROCESS SERVICE (MPS)
Performance issues with running multiple MPI ranks per GPU
— Increased MPI communication
— Each rank running in its own context on the GPU
Use the MPS functionality introduced in cuda 5.5 to have multiple MPI ranks run on the same GPU at the same time
— Allows kernels from multiple MPI ranks to run at the same time on the GPU
Page 56
1 GPU + 1 core
USING MULTIPLE CPU CORES PER GPU 1 GPU + 2 cores
zgemm
zgemm
zgemm
zgemm
zgemm
zgemm
zgemm
zgemm
Time 1 Time 2
Context 1,
MPI rank 1
Context
switch Context 2,
MPI rank 2
Page 57
USING MULTIPLE CPU CORES PER GPU
0.8
1.3
1.8
2.3
2.8
3.3
1 2 3 4 6
Speedup v
s. 1
core
Cores per GPU
Performance improvement with using multiple CPU cores
1 GPU
1 GPU+MPS
2 GPU
2 GPU + MPS
4 GPU
4 GPU + MPS14%
13%
11%
Page 58
OPTIMIZATION FOR SMALL BENCHMARKS
Page 59
SMALL BENCHMARK - PROBLEMS
Launch latency, memory copies and bookkeeping relatively large part of time
Small kernels don’t saturate GPU, wasting resources
Page 60
SMALL BENCHMARK - SOLUTION
Group independent parts together
Merge independent calls into one kernel
Group independent iterations together
Page 61
AFTER BEFORE
SMALL BENCHMARK – EXAMPLE I3 LOOP
Setup kernel
arguments
Launch Daxpy
kernel
Launch
Reduction kernel
Copy results to
CPU
Process results
For each sim
in nsim
Launch Daxpy kernel
Launch Reduction
kernel
Copy results to CPU
Setup kernel
arguments
For each sim
in nsim
CPU
work in
parallel
Process results For each sim
in nsim
CPU
work in
parallel
Page 62
RESULTS FOR I3 LOOP
3.75x improvement for Pdo
— Small benchmark with only 87 ions
1.3x improvement for SILICA
Page 64
MPI SCALING
Number of
GPUs
EDDIAG [seconds, scaling]
EDDRM [seconds, scaling]
ORTHCH [seconds, scaling]
1 GPU 4.2s, 100% 6.7s, 100% 1.5s, 100%
2 GPUs 2.8s, 75% 3.4s, 99% 1.5s, 50%
4 GPUs 2.7s, 39% 1.8s, 95% 2.4s, 15%
8 GPUs 1.9s, 27% 0.9s, 93% 1.4s, 13%
Compute
intensive routine
: good Scaling
MPI intensive routines :
bad Scaling
Page 65
OVERLAPPING MPI AND GPU WORK
Reordered such that MPI overlaps with computation
GPU compute
Memcopy
Default stream
Time
MPI
Page 66
OVERLAPPING MPI AND GPU WORK
Reordered such that MPI overlaps with computation
Stream 1
Stream 2
Hide MPI communication and memory copies.
3x improvement in Striploop in EDDIAG
GPU compute
Memcopy
Time
MPI
Page 67
PRE-ALLOCATING MEMORY IN ONE CONTIGUOUS CHUNK
VASP allocates hundreds of small buffers at the start of the RMM-DIIS iterations.
— Memory allocations require locks and syncs and can therefore be relatively expensive.
— This cost increases with multiple GPUs
Instead:
— Do a single large memory allocation
— Divide the large memory buffer over the hundreds of small buffers
— Memory allocation phase over 100x faster.
Page 68
AFTER
BEFORE
USING GPU DIRECT
GPU
CPU
NIC NIC
CPU
GPU
GPU
NIC NIC
GPU
Page 69
USING GPU DIRECT
Use CUDA Aware MPI
— As simple as calling MPI_Send, MPI_Recv with pointers to the GPU data
Performance improvements
Number of
GPUs
Time ORTCH –
without
Time ORTHCH
– with
%
improvement
2 GPUs 1.32s 0.99s 33%
4 GPUs 0.87s 0.63s 37%
Page 71
RESULTS SILICA (RMM-DIIS) – VASP 5.2.2
• all results measured on K40
and dual socket sandy bridge
with 8 cores per socket
running at 2.9GHz
0
1
2
3
4
5
6
7
8
9
10
0 5 10
Sp
eed
up
vs.
Sin
gle
CP
U S
ocket
Number of CPU Sockets
2 GPU : 1 CPU ratio(1-2 cores/GPU)
CPU only(8 cores/CPU)
1 GPU : 1 CPU ratio(2-6 cores/GPU)2.5x
2.4x
2.3x
2.9x 2.9x
3.7x
3.6x
Page 72
RESULTS SILICA (RMM-DIIS) – VASP 5.2.2
• all results measured on K40
and dual socket sandy bridge
with 8 cores per socket
running at 2.9GHz
0
1
2
3
4
5
6
7
8
9
10
0 5 10
Sp
eed
up
vs.
Sin
gle
CP
U S
ocket
Number of CPU Sockets
2 GPU : 1 CPU ratio(1-2 cores/GPU)
CPU only(8 cores/CPU)
1 GPU : 1 CPU ratio(2-6 cores/GPU)
1 node with two GPUs
is faster than 10 CPU
Sockets (5 nodes)
Page 73
RESULTS NIAL-MD (BLOCKED DAVIDSON) , VASP 5.2.2
0
1
2
3
4
5
6
7
8
9
10
0 2 4 6 8
Sp
eed
up
vs.
Sin
gle
CP
U S
ocket
Number of CPU Sockets
2 GPU : 1 CPU ratio(1 core/GPU)
CPU only(8 cores/CPU)
1 GPU : 1 CPU ratio(1 core/GPU)
4x
6.9x
4.8x
4.9x
3.5x
3.4x
• all results measured on K40 and
dual socket sandy bridge with 8
cores per socket running at
2.9GHz
• Running with more cores per GPU
runs out of memory
Page 74
RESULTS NIAL-MD (BLOCKED DAVIDSON) , VASP 5.2.2
0
1
2
3
4
5
6
7
8
9
10
0 2 4 6 8
Sp
eed
up
vs.
Sin
gle
CP
U S
ocket
Number of CPU Sockets
2 GPU : 1 CPU ratio(1 core/GPU)
CPU only(8 cores/CPU)
1 GPU : 1 CPU ratio(1 core/GPU)
• all results measured on K40 and
dual socket sandy bridge with 8
cores per socket running at
2.9GHz
• Running with more cores per GPU
runs out of memory
1 node with one GPU
is faster than 8 CPU
Sockets (4 nodes)