High-Performance Numerical Components and Common Interfaces
Lois Curfman McInnesMathematics and Computer Science Division
Argonne National Laboratory
June 7-8, 2005Joint ORNL/Indiana University
Workshop on Computational Frameworks for FusionOak Ridge, TN
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 2
Outline
• Motivation– Complex, multiphysics, multiscale nonlinear
applications– Distributed, multilevel memory hierarchies
• Parallel Components for PDEs and Optimization– Two-phased approach
• Some Challenges– Domain-specific common interfaces– Dynamic adaptivity
• Concluding Remarks
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 3
Motivating Scientific Applications
Discretization
Algebraic Solvers
Parallel I/O
Meshes
Data Redistribution
Physics
Optimization
Derivative ComputationDiagnostics
Steering
Visualization
Adaptive Solution
Astrophysics
Molecularstructures
AerodynamicsFusion
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 4
Challenges
• Community Perspective– Life-cycle costs of applications are increasing
• Require the combined use of software developed by different groups• Difficult to leverage expert knowledge and advances in subfields• Difficult to obtain portable performance
• Individual Scientist Perspective– Too much energy focused on too many details
• Little time to think about modeling, physics, mathematics• Fear of bad performance without custom code• Even when code reuse is possible, it is far too difficult
• Our Perspective– How to manage complexity?
• Numerical software tools that work together• New algorithms (e.g., interactive/dynamic techniques, algorithm
composition)• Multimodel, multiphysics simulations
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 5
What are the algorithmic needs of our target applications?
• Large-scale, nonlinear PDE-based simulations– Multirate, multiscale, multicomponent – Rich variety of time scales and strong nonlinearities– Can run on 10,000+ processors, where systems have
increasingly deep memory hierarchies– Require 100,000’s of nonlinear solves (time integration)
• Need– Fully or semi-implicit solvers– Multi-level algorithms– Support for adaptivity– Support for user-defined customizations (e.g., physics-
informed preconditioners, transfer operators, and smoothers)
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 6
Software for Nonlinear PDEs and Related Optimization Problems
• Goal: For problems arising from PDEs, support the general solution of F(u) = 0
User provides:– Code to evaluate F(u)– Code to evaluate Jacobian of F(u) (optional)
• or use sparse finite difference (FD) approximation• or use automatic differentiation (AD)
– AD support via collaboration with P. Hovland and B. Norris (see http://www.mcs.anl.gov/autodiff)
• Goal: Solve related optimization problems, generally min f(u), u < u < u , c < c(u) < c
Simple example: unconstrained minimization: min f(u)
User provides:– Code to evaluate f(u)– Code to evaluate gradient and Hessian of f(u) (optional)
• or use sparse FD or AD
l lu u
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 7
Interface Issues• How to hide complexity, yet allow
customization and access to a range of algorithmic options?
• How to achieve portable performance?• How to interface among external tools?
– Including multiple libraries developed by different groups that provide similar functionality (e.g., linear algebra software)
• Criteria for evaluation of success– Efficiency (both per node performance and scalability)– Usability– Extensibility
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 8
Two-Phased Approach to Numerical Components
• Phase 1– Develop parallel, object-oriented numerical libraries
• OO techniques are effective for development with a moderate sized team
• Provide foundation of algorithms, data structures, implementations
• Phase 2– Develop CCA-compliant component interfaces
• Leverage existing code• Provide a more effective means for managing
interactions among code developed by different groups
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 9
Parallel Numerical Libraries: PETSc and TAO
• PETSc: Portable, Extensible Toolkit for Scientific Computation
– S. Balay, K. Buschelman, B. Gropp, D. Kaushik, M. Knepley, L. C. McInnes, B. Smith, H. Zhang
– http://www.mcs.anl.gov/petsc– Targets the parallel solution of large-scale PDE-based applications– Begun in 1991, now over 13,000 downloads since 1995
• TAO: Toolkit for Advanced Optimization– S. Benson, L. C. McInnes, J. Moré, J. Sarich– http://www.mcs.anl.gov/tao– Targets the solution of large-scale optimization problems– Begun in 1997 as part of DOE ACTS Toolkit
• Approach– Freely available and supported research toolkits
• Hyperlinked docs, many examples, usable from Fortran 77/90, C, and C++
– Portable to any parallel system supporting MPI, including• Tightly coupled systems
– Cray T3E, SGI Origin, IBM SP, HP 9000, Sun Enterprise
• Loosely coupled systems, e.g., networks of workstations– Compaq, HP, IBM, SGI, Sun, PCs running Linux or Windows
– Distributed memory ‘shared nothing’ approach; encapsulate message-passing details in objects such as matrices, vectors, index sets
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 10
CompressedSparse Row
(AIJ)
Blocked CompressedSparse Row
(BAIJ)
BlockDiagonal(BDIAG)
Dense Others
Indices Block Indices Stride Others
Index Sets
Vectors
Line Search Trust Region
Newton-based MethodsOthers
Nonlinear Solvers
AdditiveSchwartz
BlockJacobi Jacobi ILU ICC
LU(Sequential only) Others
Preconditioners
EulerBackward
EulerPseudo Time
Stepping Others
Time Steppers
GMRES CG CGS Bi-CG-STAB TFQMR Richardson Chebychev Others
Krylov Subspace Methods
Matrices
PETSc Numerical Libraries
Distributed Arrays
Matrix-free
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 11
Semi-smooth Methods Others
Complementarity
Newton Trust Region GPCG Interior Point LMVM KT Others
Bound Constrained Optimization
TAO Solvers
• PETSc (initial interface)• Global Arrays (PNNL – thanks to M. Kumar
and J. Nieplocha)• Etc.
LevenbergMarquardt
Gauss-Newton
LMVM Levenberg Marquardtwith Bound Constraints Others
Nonlinear Least SquaresLMVM with
Bound Constraints
Line Search Trust Region
Newton-based Methods Limited Memory Variable Metric (LMVM) Method
Unconstrained MinimizationConjugate Gradient Methods
Fletcher-Reeves
Polak-Ribiére
Polak-Ribiére-Plus
Others
TAO interfaces to external libraries for parallel vectors, matrices, and linear solvers
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 12
Newton-Krylov Methods
• Newton: Solve: Update:• Krylov: Projection methods for solving linear
systems, Ax=b, using the Krylov subspace K = span(r , Ar , A r ,…,A r )
– Require A only in the form of matrix-vector products– Popular methods: CG, GMRES, TFQMR, BiCGStab, etc.
• Preconditioning: In practice, typically needed:– Transform Ax=b into an equivalent form: B Ax = B b or (AB )(Bx) = b where the inverse action of B approximates that of A, but
at a smaller cost
F’(u ) u = – F(u ) u = u + dul-1l
l
l
l-1 l-1
j 0 0 0 02 j-1
-1 -1
-1
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 13
Post-Processing
ApplicationInitialization
FunctionEvaluation
JacobianEvaluation
PETSc
Nonlinear Solvers
PETSc code
Application code
Finite difference approximation Or automatic differentiation code
Matrices Vectors
Krylov SolversPreconditioners
GMRES
TFQMR
BCGS
CGS
BCG
Others…
ASM
ILU
B-Jacobi
SSOR
Multigrid
Others…
AIJ
B-AIJ
Diagonal
Dense
Matrix-free
Others…
Sequential
Parallel
Others…
Application Driver
An Application Perspective: Solve F(u) = 0
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 14
Aerodynamics Example
• Developers: D. Kaushik (Argonne), D. Keyes (Columbia Univ), W. Gropp, B. Smith (Argonne), W.K. Anderson (NASA); based on a legacy NASA code, FUN3d, developed by Anderson
• Background: The Euler equations describe the conservation of mass, momentum, and energy in an inviscid fluid; here we study the flow of air over an ONERA M6 wing.
• Model: Fully implicit steady-state 3D incompressible Euler model using a tetrahedral mesh
www.mcs.anl.gov/petsc-fun3d
• Solvers: Newton-Krylov-Schwarz method with pseudo-transient continuation
• Won Gordon Bell prize at SC99
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 15
PerformanceONERA M6 wing test case, tetrahedral grid of 2.8 million vertices (about 11 million
unknowns) on up to 3072 ASCI Red nodes (each with dual Pentium Pro 333 MHz processors)
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 16
Scientific Applications• PETSc and TAO solvers have been used successfully in many
scientific applications– Aerodynamics, acoustics, biomechanics, chemistry, fusion,
electromagnetics, micromagnetics, materials science, multiphase flow, nanotechnology, reactive transport, etc.
– See http://www.mcs.anl.gov/petsc/petsc-as/publicationsand http://www.mcs.anl.gov/tao/impact
– Scale to low 1000s of processors• PETSc usage in fusion applications includes:
– The SEL macroscopic modeling code, A. H. Glasser and X. Z. Tang, Computer Physics Communications, 164, 237-243, 2004.
– A finite element Poisson solver for gyrokinetic particle simulations, Y. Nishimura, Z.Lin, J.Lewandowski, and S.Ethier, Submitted to J. Comput. Phys., 2004.
– Global gyrokinetic Particle-in-cell Simulations with Trapped Electrons, J.L.V Lewandowski, Y.Nishimura, W.W.Lee, Z.Lin, and S. Ethier, Sherwood Fusion Theory Conference, Missoula, MT, 2004.
– Electromagnetic gyrokinetic simulation with a fluid-kinetic hybrid electron model, Y. Nishimura, Z.Lin, L.Chen, J.Lewandowski, S.Ethier, and W. Wang, Sherwood Fusion Theory Conference, Missoula, MT, 2004.
– Numerical studies of a steady state axisymmetric co-axial helicity injection plasma, X.Z. Tang and A.H. Boozer, Physics of Plasmas, 11, 171-185, 2004.
– Inclusion of electromagnetic effects into gyrokinetic particle simulations, Y. Nishimura, Z.Lin, L.Chen, and W. Wang, American Physical Society 45th Annual Meeting Division of Plasma Physics, Albuquerque, New Mexico, October 2003, 2003.
– Resistive Magnetohydrodynamics Simulation of Fusion Plasmas, X. Z. Tang, G. Y. Fu, S. C. Jardin, L. L. Lowe, W. Park, and H. R. Strauss, Princeton Plasma Physics Laboratory, PPPL-3532, Presented at 10th Society for Industrial and Applied Mathematics (SIAM) Conference on Parallel Processing for Scientific Computing, Portsmouth, Virginia, March 12-14, 2001.
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 17
Two-Phased Approach to Numerical Components
• Phase 1– Develop parallel, object-oriented numerical libraries
• OO techniques are effective for development with a moderate sized team
• Provide foundation of algorithms, data structures, implementations
• Phase 2– Develop CCA-compliant component interfaces
• Leverage existing code• Provide a more effective means for managing
interactions among code developed by different groups
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 18
CCA Overview
• CCA evolved from DOE2000 as a grass roots effort– Recognized benefit of component based software engineering
(CBSE) to high-performance scientific computing – Bridle the burgeoning hardware/software complexity!– See: www.cca-forum.org
• CBSE needed to be specially crafted for HPC– Supporting parallelism and performance requirements– Supporting scientific languages (e.g. Fortran 90), legacy
codes
• With SciDAC support, CCA has:– Demonstrated effectiveness of component-oriented approach– Advanced scientific research across several key domains– Grown a diverse community of users– See: www.cca-forum.org/ccttss
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 19
CCA Compliance in TAOParadigm shift; both TAO and the application become components• Each is required to provide a default constructor and to implement the CCA
component interface– contains one method: “setServices” to register ports
• All interactions between components use ports– Application provides a “go” port and uses “taoSolver” port– TAO provides a “taoSolver” port
• There is no “main” routine
Ref: J. Sarich, A Programmer's Guide for Providing CCA ComponentInterfaces to the Toolkit for Advanced Optimization, Argonnetechnical report ANL/MCS-TM-279, December, 2004.
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 20
Negligible CCA Overhead in TAO
Optimization Components
• No CCA overhead within components• Small overhead between components• Small overhead for language
interoperability• No CCA overhead on parallel
computing• Be aware of costs & design with them
in mind– Small costs, easily amortized
Maximum 0.2% overhead for CCA vs native C++ code for parallel molecular dynamics up to 170 CPUs.
Aggregate time for linear solver component in unconstrained minimization problem.
Ref: B. Norris et al., Parallel Components for PDEs and Optimization: Some Issues and Experiences, Parallel Computing, 28 (12), 2002, pp. 1811-1831.
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 21
CCA Application: Optimization in Quantum Chemistry
• Collaboration of ANL, PNNL, and SNL researchers, working with their own packages, integrated using CCA:
– TAO (ANL) Limited Memory Variable Metric (LMVM) algorithm
– PETSc (ANL) and Global Arrays (PNNL) for linear algebra
– MPQC (SNL) and NWChem (PNNL) chemistry packages
• Significant improvements over “traditional” BFGS optimizers built into packages
• Interoperability at linear algebra and chemistry package levels
Ref: J. P. Kenny et al. Component-Based Integration of Chemistry and Optimization Software. J. Computational Chemistry, 24(14):1717--1725, 2004.
0
10
20
30
40
50
60
70
80
90
Glycine Isoprene Phosphoserine Acetylsalicylic Acid Cholesterol
Nu
mb
er
of
En
erg
y a
nd
Gra
die
nt
Ev
alu
taio
ns
NWChem/native
MPQC/native
NWChem/TAO
MPQC/TAO
Comparison of native BFGS and TAO LMVM optimization algorithms used with the MPQC and NWChem computational chemistry packages. Function evaluations in this domain are very expensive, so reducing optimization steps is very important.
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 22
Outline
• Motivation– Complex, multiphysics, multiscale nonlinear
applications– Distributed, multilevel memory hierarchies
• Parallel Components for PDEs and Optimization– Two-phased approach
• Some Challenges– Domain-specific common interfaces– Dynamic adaptivity
• Concluding Remarks
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 23
• The CCA Forum participants do not pretend to be experts in all phases of computation, but rather just to be developing a standard way to exchange component capabilities.
• Medium of exchange: interfaces– Components interact only through explicitly defined
interfaces– Quality (generality, completeness) of interfaces varies
widely– Higher quality interfaces…
• Require general agreement among groups or communities
• Are more easily used in front of multiple implementations• Are more easily (re)used by many applications• Facilitate experimentation with new algorithms,
implementations, etc.
The Importance of Interfaces
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 24
A challenge to the community:Common interfaces are central
• Need experts in various areas to define sets of domain-specific common interfaces
– Scientific application domains, meshes, discretization, (non)linear solvers, optimization, data analysis, visualization, etc.
• Caveat: Developing common interfaces is difficult!– Technical challenges
• Tradeoffs in broad functionality vs. maintaining good performance
– Social challenges• Agreement among diverse individuals with different priorities • Few academic rewards for software
• The CCA is actively developing or promoting the development of common domain-specific interfaces, including
– Distributed array descriptor
– Molecular geometry optimization
– MxN parallel data redistribution
– Adaptive mesh refinement (w/ APDEC SciDAC Center)
– Mesh and discretization interfaces (lead: TSTT SciDAC Center)
– Linear and nonlinear solver interfaces (lead: TOPS SciDAC Center)
This means
you!
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 25
Interface Definition Efforts• Collaborations with math SciDAC
centers focus on unified interfaces to numerous existing and new libraries– Users can swap libraries without
having to change their code
– New libraries are more easily integrated into applications
• Some info on TOPS and TSTT interfaces:
– Parallel PDE-Based Simulations Using the Common Component Architecture, Lois Curfman McInnes et al., Argonne National Laboratory preprint ANL/MCS-P1179-0704, 2004 (available via www.mcs.anl.gov/cca), to appear in Are Magnus Bruaset, Petter Bjorstad, and Aslak Tveito, editors, Numerical Solution of PDEs on Parallel Computers, Springer-Verlag.
SuperLU
PETSc
Hypre
Sparskit
Others …
Application
LinearSolver
Libraries
TOPSSolver
Interfaces
SuperLU
PETSc
Hypre
Sparskit
Solvers Others …
Application
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 26
TOPS’ Linear Solver Interface• Goals
– Simplicity - small number of distinct concepts – Generality– Programming language independence (via SIDL)– High performance– Extensibility – infrastructure for defining/implementing
‘conceptual’ solver interfaces• Progenitors include
– FEI (finite element interface) / C++ • developed at SNL
– ESI (equation solver interface) / C++ • multi-lab effort
– Various TOPS software packages• Current drafts available via
– Bitkeeper repository: bk://tops.bkbits.net:8080/tops-solver-interface– Snapshot: http://www.mcs.anl.gov/scidac-tops/tops-solver-interface
• Who: B. Smith (ANL), R. Falgout (LLNL), various TOPS investigators
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 27
Object Model Concepts
Solver(is an)
Vector – represents field data
View
(has one or more)
(has a)
Layout
– provides access to the data
– how data is laid out across processes
Operator
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 28
View allows users to access values in the “language of the application”
• Handles any data communication transparently• Same idea as conceptual interfaces within hypre (LLNL)
Data Layoutstructured composite block-struc unstruc CSR
Linear SolversGMG, ... FAC, ... Hybrid, ... AMGe, ... ILU, ...
Conceptual (Linear System) Interfaces
c/o Rob Falgout, LLNL
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 29
Views differ primarily in the way they “set” and “get” data
• Classical Linear Algebra View – Indices are scalars that represent locations in Rn
• Structured Mesh View – Indices are 3D triples that describe “boxes of data” (think 3D Fortran arrays)
• Views / Layouts– classical linear algebra access– single structured mesh– finite element interface– semi-structured meshes (structured mesh “parts” with
additional arbitrary connections)– etc…
array<double> getValues(array<int> indices);
array<double> getValues(<int,3> ilower, <int,3> iupper);
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 30
What’s Coming in TOPS Solvers
• Greater interface standardization • Greater solver interoperability• Better integration upwards
w/ meshing and discretization systems• Better integration downwards
w/ performance monitoring and engineering systems• Better algorithms!
c/o David Keyes, TOPS PI (see more TOPS info at www.tops-scidac.org)
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 31
Anticipated Impact of Common TOPS Solver Interfaces on Fusion
• Easier for fusion scientists to explore different algorithms and solvers developed by different groups, such as these MHD/TOPS collaborations (for which interfaces were done manually for new algorithms callable across Ax=b interface)
0
100
200
300
400
500
600
700
3 12 27 48 75
ASM-GMRES
AMG-FMGRES
– M3D• replacement of additive Schwarz
(ASM) preconditioner with algebraic multigrid (AMG) in hypre (LLNL)
• achieved mesh-independent convergence rate
• 4-5 improvement in execution time
– NIMROD• replacement of diagonally scaled Krylov solver with a supernodal
parallel sparse direct solver in SuperLU (LBNL) • 2D tests run 100 faster; 3D production runs are 4-5 faster
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 32
Motivating Scientific Applications
Discretization
Algebraic Solvers
Parallel I/O
Meshes
Data Redistribution
Physics
Optimization
Derivative ComputationDiagnostics
Steering
Visualization
Adaptive Solution
Astrophysics
Molecularstructures
AerodynamicsFusion
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 33
Dynamic Adaptivity• Next generation applications will need to adapt to changing
computational conditions – Changes in physics/models/algorithms in long-running simulations,
different resource needs and performance characteristics
• CBSE enables component substitution at runtime, based on changing application characteristics and available resources
linear solver A
linear solver B
linear solver C
linear solverproxy:
solvef’(u) du = -f(u)
componentmonitoring
N
ewto
n-K
rylo
v so
lver
applicationmonitoring
a
pplic
atio
n d
rive
r
analysis, optimization, replacement,
and substitution decision services
Component Substitution Set
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 34
Computational Quality of Service (CQoS) Approach: Automatic selection and configuration of components
to suit a particular computational purpose, involves research in:
Ref: P. Hovland, K. Keahey, L. McInnes, B. Norris, L. Diachin, and P. Raghavan, A Quality-of-Service Architecture for High-Performance Numerical Components, Proceedings of the Workshop on QoS in Component-Based Software Engineering, Toulouse, France, June 20, 2003.
Ref: B. Norris, J. Ray, R. Armstrong, L. McInnes, D. Bernholdt, W. Elwasif, A. Malony, and S. Shende, Computational Quality of Service for Scientific Components, Proceedings of the International Symposium on Component-Based Software Engineering (CBSE7), Edinburgh, Scotland, 2004.
Ref: B. Norris and I. Veljkovic, Performance Monitoring and Analysis Components in Adaptive PDE-Based Simulations, Argonne preprint ANL/MCS-P1221-0105, January, 2005.
Provider Component C
Provider Component B
Provider Component A
Component Proxy
RuntimeMonitoring
HistoricalDatabase
RuntimeDatabase
DatabaseAccess
Component Framework
Application Component(s)
Adaptive StrategyComponent
Adaptive StrategyComponent
Adaptive StrategyComponent
Adaptive StrategyComponent
Abstract Interface
• Metadata and metrics
• Performance evaluation and monitoring
• Automated application assembly and reconfiguration
• Adaptive polyalgorithmic solvers
L.C. McInnes, IU/ORNL Workshop on Computational Frameworks for Fusion, 6/8/2005 35
Concluding Remarks• High-performance numerical components can be
effectively built using a 2-phased process– Object-oriented numerical libraries developed by
different teams at different institutions– Light-weight component layers
• Domain-specific common interfaces that are defined by various computational science communities are critical for– Achieving the promise of ‘plug-and-play’ component
interoperability– Addressing issues in dynamic component interactions
(reconfiguring and recomposing)• These capabilities are becoming increasingly
important for multi-physics, multi-scale computational science applications (e.g., fusion simulations)