This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Wavelet-Based DFT calculations on MassivelyParallel Hybrid Architectures
Luigi Genovese
L_Sim – CEA Grenoble
February 7, 2012
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andBigDFT
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachPresent Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
GPU
Practical cases
Conclusion
A basis for nanosciences: the BigDFT project
STREP European project: BigDFT(2005-2008)Four partners, 15 contributors:CEA-INAC Grenoble (T.Deutsch), U. Basel (S.Goedecker),U. Louvain-la-Neuve (X.Gonze), U. Kiel (R.Schneider)
Aim: To develop an ab-initio DFT codebased on Daubechies Wavelets, to beintegrated in ABINIT.BigDFT 1.0 −→ January 2008
In this presentationPresent HPC scenario
GPU exploitation in BigDFT runs
Consideration of interest for Electronic StructrureCalculations
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andBigDFT
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachPresent Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
GPU
Practical cases
Conclusion
Outline
1 Parallel computing and architecturesFrom past to present: softwareHPC nowadaysMemory bottleneck
2 (DFT) Developer point of viewPresent SituationOptimization
3 User viewpointFrequent mistakesPerformance evaluationMaterial accelerators: Evaluating GPU gainPractical cases
4 Conclusion and Messages
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andBigDFT
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachPresent Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
GPU
Practical cases
Conclusion
Moore’s law
40 years of improvementsTransistor counts double everytwo years. . .
. . . but how?
Power is the limiting factor (around 100 W nowadays)
Power ∝ Frequency3 * Clock rate is limitedMultiple slower devices preferable than one superfast device* More performance with less power→ software problem?
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
8 CPU efficiency is poor (calculation is too fast)
8 Amdahl’s law not favorable (5x SU at most)
4 GPU SU is almost independent of the size
4 Users’ keyword: robustness and reliability
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andBigDFT
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachPresent Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
GPU
Practical cases
Conclusion
Hybrid and Heterogeneous runs with OpenCL
NVidia S2070 Connected eachto a NehalemWorkstation
BigDFT may runon both
ATI HD 6970
Sample BigDFT run: Graphene, 4 C atoms, 52 kpts
No. of Flop: 8.053 · 1012
MPI 1 1 4 1 4 8GPU NO NV NV ATI ATI NV + ATITime (s) 6020 300 160 347 197 109Speedup 1 20.07 37.62 17.35 30.55 55.23GFlop/s 1.34 26.84 50.33 23.2 40.87 73.87
Next Step: handling of Load (un)balancing
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andBigDFT
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachPresent Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
GPU
Practical cases
Conclusion
A rapidly evolving situation
Architecture evolutionsManycore era (multilevel parallelisation)
Memory traffic as the limiting factor
Software evolutionsSuperposition of parallelization layers
Optimization issues: maintainability vs. robustness
Users abilityArchitecture dimensioning: adapt the runs to the system
Performance evaluation approach
And it is not going better:New set of architectures (GPU, MIC, BG/Q,. . . )New development paradigms (MPI, OpenMP,OpenCL,. . . )HPC codes must follow (HPC projects, Usershow-to,. . . ) . . . and algorithms
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andBigDFT
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachPresent Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
GPU
Practical cases
Conclusion
O(N) approach (traditional O(N3))
Crossover point
TCPU
N
Formalism AdvantageUse locality of wavelets
Localization regions
Better flexibility
Different schemes tolocalise wavefunctions
Where are we?A prototype versionvalidated since 2007
New localisationschemes already testedwith cubic paradigm
* Underlying infrastructureready, soon in production
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andBigDFT
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachPresent Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
GPU
Practical cases
Conclusion
A look in near future: science with HPC DFT codes
Enhancing BigDFT functionalitiesPAW formalismShould further reduce computational overhead
O(N) approach, production codePossible thanks to wavelets localisation and orthogonality
New parallelisation scheme suitable for very largeplatforms
Further refine formalims for Quantum ChemistrySystematic basis set extension for accurate treatment
The Mars missionIs Petaflop performance possible?
Multilevel parallelization→ one order of magnitude
Bigger systems, heavier methods→ (more than) oneorder of magnitude bigger
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andBigDFT
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachPresent Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
GPU
Practical cases
Conclusion
General considerations
What is desirable? (Does it open new directions?)Performance should lead to improvements
Optimisation effortKnow the code behaviour and featuresCareful performance study of the complete algorithm
Identify and make modular critical sectionsFundamental for mainainability and architecture evolution
Optimisation cost: consider end-user running conditionsRobustness is more important than best performance
Performance evaluation know-howNo general thumb-rule: what means High Performance?A multi-criterion evaluation process
Multi-level parallelisation always to be usedYour code will not (anymore) become faster via hardware
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese