First principles modeling with Octopus: massive parallelization towards petaflop computing and more A. Castro, J. Alberdi and A. Rubio
Dec 28, 2015
First principles modeling with Octopus: massive parallelization towards petaflop computing and more
A. Castro, J. Alberdi and A. Rubio
2
Outline
Theoretical SpectroscopyThe octopus codeParallelization
3
Outline
Theoretical SpectroscopyThe octopus codeParallelization
4
Theoretical Spectroscopy
5
Theoretical Spectroscopy
Electronic excitations:~Optical absorption~Electron energy
loss~Inelastic X-ray
scattering
~Photoemission~Inverse
photoemission~…
6
Theoretical Spectroscopy
Goal: First principles (from electronic structure) theoretical description of the various spectroscopies (“theoretical beamlines”):
7
Theoretical Spectroscopy
Role: interpretation of (complex) experimental findings
8
Theoretical Spectroscopy
Role: interpretation of (complex) experimental findings
Theoretical atomistic structures, and corresponding TEM images.
9
Theoretical Spectroscopy
10
Theoretical Spectroscopy
11
Theoretical Spectroscopy
The European Theoretical Spectroscopy Facility (ETSF)
12
Theoretical SpectroscopyThe European Theoretical Spectroscopy
Facility (ETSF)
~ Networking~ Integration of tools (formalism,
software)~ Maintenance of tools~ Support, service, formation
13
Theoretical Spectroscopy
The octopus code is a member of a family of free software codes developed, to a large extent, within the ETSF:~abinit~octopus~dp
14
Outline
Theoretical SpectroscopyThe octopus codeParallelization
15
The octopus code
Targets:~Optical absorption spectra of molecules,
clusters, nanostructures, solids.~Response to lasers (non-perturbative response
to high-intensity fields)~Dichroic spectra, and other mixed (electric-
magnetic responses)~Adiabatic and non-adiabatic Molecular
Dynamics (for, e.g. infrared and vibrational spectra, or photochemical reactions).
~Quantum Optimal Control Theory for molecular processes.
16
The octopus code
Physical approximations and techniques:~Density-Functional Theory, Time-
Dependent Density-Functional Theory to describe the electron structure.• Comprehensive set of functionals through the
libxc library.
~Mixed quantum-classical systems.~Both real-time and frequency domain
response (“Casida” and “Sternheimer” formulations).
17
The octopus code
Numerics:~Basic representation:
real space grid.~Usually regular and
rectangular, occasionally curvilinear.
~Plane waves for some procedures (especially for periodic systems)
~Atomic orbitals for some procedures
18
The octopus code
Derivative in a point: sum over neighbor points.Cij depend on the points used: the stencil.More points -> more precision.Semi-local operation.
19
The octopus code
The key equations~Ground-state DFT: Kohn-Sham
equations.
~Time-dependent DFT: time-dependent KS eqs:
20
The octopus code
Key numerical operations:~Linear systems with sparse matrices.~Eigenvalue systems with sparse matrices.~Non-linear eigenvalue systems.~Propagation of “Schrödinger-like”
equations.
~The dimension can go up to 10 million points.
~The storage needs can go up to 10 Gb.
21
The octopus code
Use of libraries:~BLAS, LAPACK~GNU GSL mathematical library.~FFTW~NetCDF~ETSF input/output library~Libxc exchange and correlation library~Other optional libraries.
22
www.tddft.org/programs/octopus/
23
Outline
Theoretical SpectroscopyThe octopus codeParallelization
24
Objective
Reach petaflops computing, with a scientific codeSimulate photosynthesis of the light in chlorophyll
29
Multi levelparallelization
MPIKohnShamstates
Realspacedomains
In NodeOpenMPthreads OpenCL
tasksVectorization
CPU GPU
30
Target systems:Massive number of execution units ~Multi core
processors with vectorial FPUs
~IBM Blue Gene architecture
~Graphical processing units
31
High Level Parallelization
MPI parallelization
32
Parallelization by states/orbitals
Assign each processor a group of statesTime propagation is independent for each stateLittle communication requiredLimited by the number of states in the system
33
Domain parallelization
Assign each processor a set of grid pointsPartition libraries: Zoltan or Metis
34
Main operations in domain parallelization
Low level paralelization and vectorization
OpenMP andGPU
36
Two approaches
OpenMPThread programming based on compiler directivesIn node parallelizationLittle memory overhead compared to MPIScaling limited by memory bandwidthMultithreaded Blas and Lapack
OpenCLHundreds of execution unitsHigh memory bandwidth but with long latencyBehaves like a vector processor (length > 16)Separated memory: copy from/to main memory
37
Supercomputers
Corvo cluster~X86_64
VARGAS (in IDRIS)~Power6~67 teraflops
MareNostrum~PowerPC 970~94 teraflops
Jugene (image)~1 petaflops
38
Test Results
39
Laplacian operator
Comparison in performance of the finitedifference Laplacian operator
CPU uses 4 threadsGPU is 4 times fasterCache effects are visible
40
Timepropagation
Comparison in performance for a timepropagation
Fullerene moleculeThe GPU is 3 times fasterLimited by copying and non GPU code
41
Multi level parallelization
Clorophyll molecule: 650 atomsJugene Blue Gene/PSustained throughput: > 6.5 teraflopsPeak throughput: 55 teraflops
Scaling
43
Scaling (II)
Comparison of two atomic system in Jugene
44
Target system
Jugene all nodes~294 912 processor cores = 73 728
nodes~Maximum theoretical performance of
1002 MFlops
5879 atoms chlorophyll system~Complete molecule of spinach
45
Tests systems
Smaller molecules~180 atoms~441 atoms~650 atoms~1365 atoms
Partition of machines~Jugene and Corvo
46
Profiling
Profiled within the codeProfiled with Paraver tool~www.bsc.es/paraver
1 TD iteration
Some “inner” iterations
One “inner” iterationIreceive Isend Iwait
Poisson solver
2xAlltoallAllgather Allgather Scatter
51
ImprovementsMemory improvements in GS~Split the memory among the nodes~Use of ScaLAPACK
Improvements in the Poisson solver for TD~Pipeline execution • Execute Poisson while continues with an
approximation
~Use new algorithms like FFM~Use of parallel FFTs
52
Conclusions
Kohn Sham scheme is inherently parallelIt can be exploited for parallelization and vectorizationSuited to current and future computer architecturesTheoretical improvements for large system modeling