-
Operated by Los Alamos National Security, LLC for the U.S.
Department of Energy’s NNSA
U N C L A S S I F I E D
Slide 1
Fast Quantum Molecular Dynamics on Multi-GPU Architectures
in
LATTE
S. Mniszewski*, M. Cawkwell, A. Niklasson
GPU Technology Conference
San Jose, California
March 18-21, 2013
*[email protected]
-
Operated by Los Alamos National Security, LLC for the U.S.
Department of Energy’s NNSA
U N C L A S S I F I E D Slide 2
Background: Quantum Molecular Dynamics
n In molecular dynamics simulation, the relative positions of
atoms evolve over a series of time steps according to the force
acting on each atom
n Employed in materials science, chemistry, and biology to
study structures, defects, and equilibrium and non-equilibrium
phenomena
n Dependence on an interatomic potential to calculate forces
and energy
n Quantum-based models capture the making and breaking of
covalent bonds, charge transfer between species of differing
electronegativities, and long-range electrostatic interactions
-
Operated by Los Alamos National Security, LLC for the U.S.
Department of Energy’s NNSA
U N C L A S S I F I E D
Quantum-based Interatomic Potentials n Electronic structure of
atoms and molecules is modeled explicitly
n Most accurate and reliable descriptions of interatomic
bonding
n Their prohibitive computational cost has prevented widespread
use – better algorithms and GPU architectures are important paths
forward
Slide 3
E = 2Tr ρH"# $%
fi = −2Tr ρ
∂H∂R i
$
%&
'
()
Energy
Force Number of atoms
Tim
e pe
r M
D t
ime
step
Quantum MD with improved algorithms and GPUs
Empirical pair potentials
n Hamiltonian matrix H
n The density matrix, rho, is computed self-consistently from
H
-
Operated by Los Alamos National Security, LLC for the U.S.
Department of Energy’s NNSA
U N C L A S S I F I E D
The Density Matrix Computation
n Typically, algorithms used in quantum-based models, most
notably matrix diagonalization, are not ideally suited to GPUs •
Due to their complexity • Difficulty in extracting thread-level
parallelism • Difficulty of avoiding branching within warps
n New approach in LATTE • Computed directly from the
Hamiltonian through a recursive expansion of the Fermi
Operator with the second order spectral projection (SP2)
algorithm • Based on a series of generalized matrix-matrix
multiplications • Only one matrix-matrix multiplication is
required per iteration • Maps very well to GPUs
Slide 4
-
Operated by Los Alamos National Security, LLC for the U.S.
Department of Energy’s NNSA
U N C L A S S I F I E D
0 0.5 1x
0
0.5
1
f(x
)
f(x) = x2
f(x) = 2x - x2
f8(f
7(...f
1(x) ...))
The Second Order Spectral Projection Algorithm (SP2) – Reduced
Complexity
Slide 5
ρ = θ µI − H$% &' = limi→∞ fi[ fi−1[… f0[X0 ]…]]
X0 =
εmaxI − Hεmax − εmin
fi[X i ] =X i
2 if 2Tr[X i ] ≥ N
2X i − X i2 if 2Tr[X i ] < Ne
Recursive Fermi Operator expansion
-
Operated by Los Alamos National Security, LLC for the U.S.
Department of Energy’s NNSA
U N C L A S S I F I E D
The GPU Implementation n Part of the LATTE codebase
• Employs a semi-empirical tight-binding model of interatomic
bonding that is based on the formalisms derived from density
functional theory
• Density matrix build is by far the slowest step in the
calculation • CPU version in Fortran 90
n Hardware/Software Architecture • Keeneland* cluster at the
National Institute for Computational Sciences • CPU - 2 Intel
hex-core Xeon CPUs per node, Intel Fortran Compiler, MKL • 3
Nvidia M2090 GPUs (previously M2070 GPUs) • CUDA 4.2, CUBLAS, and
a thread block size of 1024
n Use of CUDA Features on GPUs • Unified Virtual Addressing •
Peer to peer memory access/copy • Streams –sequence of commands •
Single thread access to all GPUs
Slide 6
*J.S. Vetter, R. Glassbrook, J. Dongarra, K. Schwan, B. Loftis,
S. McNally, J. Meredith, J. Rogers, P. Roth, K. Spafford, and S.
Yalamanchili, “Keeneland: Bringing heterogeneous GPU computing to
the computational science community,” IEEE Computing in Science and
Engineering, 13(5):90-5, 2011,
http://dx.doi.org/10.1109/MCSE.2011.83.
-
Operated by Los Alamos National Security, LLC for the U.S.
Department of Energy’s NNSA
U N C L A S S I F I E D
Estimate εmax and εmin X = (εmaxI-H)/(εmax-εmin) TraceX = Tr[X]
/* Trace kernel on GPU */ Until converged do
Xtmp = X Xtmp = X2+Xtmp /*CUBLAS xGEMM */ TraceXtmp = Tr[Xtmp]
/*Trace kernel on GPU */ if |2TraceX – 2TraceXtmp – Ne| >
|2TraceX + 2TraceXtmp –Ne| X = X + Xtmp TraceX = TraceX + TraceXtmp
else X = X – Xtmp TraceX = TraceX – TraceXtmp
end until ρ = X
SP2 Algorithm Using the Hybrid CPU/GPU Approach
Slide 7
-
Operated by Los Alamos National Security, LLC for the U.S.
Department of Energy’s NNSA
U N C L A S S I F I E D
SP2 Algorithm Using the Full GPU Approach
Slide 8
Estimate εmax and εmin X = (εmaxI-H)/(εmax-εmin) TraceX = Tr[X]
/* Trace kernel on GPU */ Until converged do
Xtmp = X Xtmp = X2+Xtmp /*CUBLAS xGEMM */ TraceXtmp = Tr[Xtmp]
/*Trace kernel on GPU */ if |2TraceX – 2TraceXtmp – Ne| >
|2TraceX + 2TraceXtmp –Ne| X = X + Xtmp /* CUBLAS xAXPY */ TraceX =
TraceX + TraceXtmp /* CUBLAS xAXPY */ else X = X – Xtmp /* CUBLAS
xAXPY */ TraceX = TraceX – TraceXtmp
end until ρ = X
-
Operated by Los Alamos National Security, LLC for the U.S.
Department of Energy’s NNSA
U N C L A S S I F I E D
CUBLAS Matrix Multiplication Performance (Nvidia M2070)
Slide 9
Song, F., Tomov, S., Dongarra, J. "Enabling and Scaling Matrix
Computations on Heterogeneous Multi-Core and Multi-GPU Systems,"
26th ACM International Conference on Supercomputing (ICS 2012),
ACM, San Servolo Island, Venice, Italy, June, 2012.
-
Operated by Los Alamos National Security, LLC for the U.S.
Department of Energy’s NNSA
U N C L A S S I F I E D
Array Padding for Performance (M x M)
Slide 10
0.30
0.35
0.40
0.45
Tim
e per
xG
EM
M e
xec
uti
on (
s)
3500 3600 3700 3800 3900 4000M
0.14
0.16
0.18
0.20
0.22
(a)
(b)
Double
Single
Average time to execute the CUBLAS xGEMM generalized
matrix-matrix multiplication for M × M matrices. (a) DGEMM, (b)
SGEMM. The broken and solid lines correspond computations performed
with and without the padding of the arrays, respectively.
-
Operated by Los Alamos National Security, LLC for the U.S.
Department of Energy’s NNSA
U N C L A S S I F I E D
Performance Analysis: Density Matrix Calculation (Nvidia M2070)
– Liquid Methane (10-1000 molecules)
Slide 11
100 1000Number of orbitals
0.01
0.1
1
10
100
Tim
e p
er ρ
cal
cula
tio
n (
s)
0 2000 4000 6000 8000Number of orbitals
50
100
150
200
250
300T
ime
per
ρ c
alcu
lati
on (
s)
DiagonalizationSP2: CPU algorithm 1SP2: CPU/GPU algorithm 2SP2:
GPU algorithm 3
Double
-
Operated by Los Alamos National Security, LLC for the U.S.
Department of Energy’s NNSA
U N C L A S S I F I E D
10-16
10-15
10-14
10-13
|| ρ
2 -
ρ ||
2
DiagonalizationSP2: CPU algorithm 1SP2: CPU/GPU algorithm 2SP2:
GPU algorithm 3
Error Analysis in Density Matrices – Idempotency Measure
Slide 12
The SP2 algorithm has errors that are independent of system size
whereas traditional diagonalization yields errors that increase
with the number of atoms.
ρ2 = ρ
Error = ρ2 − ρ
2
-
Operated by Los Alamos National Security, LLC for the U.S.
Department of Energy’s NNSA
U N C L A S S I F I E D
Multi-GPU Generalized Matrix-Matrix Multiplication
n Using multiple streams for sub-block matrix-matrix
multiplications, additions, and matrix traces
n Efficient reassembly of blocked matrix via native
functionality of CUBLAS DGEMM/SGEMM
Slide 13
!
!
=
GPU 0
GPU N-1
. . .
. . .
GPU 0
GPU N-1
-
Operated by Los Alamos National Security, LLC for the U.S.
Department of Energy’s NNSA
U N C L A S S I F I E D
Using Streams for SP2 Sub-block Matrix Operations
Slide 14
Matrix blocks assembled in GPU 0 and redistributed
X2 + X = Xtmp (XGEMM)
Tr[Xtmp]
X +/- Xtmp (xAXPY)
GPU 0 – Stream 0 GPU 1 – Stream 1
-
Operated by Los Alamos National Security, LLC for the U.S.
Department of Energy’s NNSA
U N C L A S S I F I E D
Performance Analysis: Density Matrix Calculation (Nvidia M2090)
– Liquid Methane (10 – 1250 molecules)
Slide 15
0 2000 4000 6000 8000 10000Matrix dimension
0
30
60
90
Tim
e per
den
sity
mat
rix b
uil
d (
s) SP2: 1 GPUSP2: 2 GPUsSP2: 3 GPUsSP2: CPUDiagonalization
-
Operated by Los Alamos National Security, LLC for the U.S.
Department of Energy’s NNSA
U N C L A S S I F I E D
Performance Analysis for 1-3 GPUs
Slide 16
1000 10000Matrix dimension
0.01
0.1
1
10
100T
ime
per
den
sity
mat
rix
bu
ild
(s) 1 GPU
2 GPUs3 GPUs
1 GPU optimal 2 GPUs optimal 3 GPUs optimal
-
Operated by Los Alamos National Security, LLC for the U.S.
Department of Energy’s NNSA
U N C L A S S I F I E D
Summary
n GPUs can be effectively used for the density matrix
computation in quantum mechanical models
n The recursive SP2 algorithm is well suited to the GPU
architecture
n Transfer of arrays between the CPU and GPU are a minor
performance contribution
n Array padding is important for performance
n The GPU version of the SP2 algorithm provides comparable or
better accuracy over traditional diagonalization
n Massive speed-ups with respect to traditional algorithms have
been seen with no loss of accuracy
Slide 17
-
Operated by Los Alamos National Security, LLC for the U.S.
Department of Energy’s NNSA
U N C L A S S I F I E D
Related Publications
1. Sanville EJ, Bock N, Coe J, Mniszewski SM, Niklasson AMN,
Cawkwell MJ, 2010, LATTE. Los Alamos National Laboratory
(LA-CC-10-004), http://savannah.nongpu.org/projects/latte.
2. Cawkwell MJ, Sanville EJ, Mniszewski SM, Niklasson AMN,
2012, Computing the Density Matrix in Electronic Structure Theory
on Graphics Processing Units. J. Chem. Theory Comput., Vol 8, Issue
11, pp. 4094-4101,
http://pubs.acs.org/doi/full/10.1021/ct300442w.
3. Mniszewski SM, Cawkwell MJ, Niklasson AMN, 2013,
Quantum-based Dynamics on Graphics Processing Units. 2013 Associate
Directorate for Theory, Simulation, and Computation (ADTSC)
Highlights (in press).
Slide 18