Computational Chemistry at Daresbury 16-22 November 2002 onal Science and Engineering Department Daresbury La Computational Chemistry at Computational Chemistry at Daresbury Laboratory Daresbury Laboratory Quantum Chemistry Group Quantum Chemistry Group Martyn. F. Guest, Paul. Sherwood, and Huub Martyn. F. Guest, Paul. Sherwood, and Huub J.J. van Dam J.J. van Dam http://www.dl.ac.uk/CFS http://www.dl.ac.uk/CFS http://www.cse.clrc.ac.uk/Activity/QUASI http://www.cse.clrc.ac.uk/Activity/QUASI Molecular Simulation Group Molecular Simulation Group Bill Smith and Maurice Leslie Bill Smith and Maurice Leslie http://www.dl.ac.uk/TCSC/Software/DL_POLY http://www.dl.ac.uk/TCSC/Software/DL_POLY
55
Embed
Computational Chemistry at Daresbury 16-22 November 2002 Computational Science and Engineering Department Daresbury Laboratory Computational Chemistry.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Computational Chemistry at Computational Chemistry at Daresbury LaboratoryDaresbury Laboratory
Quantum Chemistry GroupQuantum Chemistry Group
Martyn. F. Guest, Paul. Sherwood, and Huub J.J. van Dam Martyn. F. Guest, Paul. Sherwood, and Huub J.J. van Dam
2. Metallic Al with2. Metallic Al with 256256 80008000
Sutton-Chen potentialSutton-Chen potential
3. Valinomycin in 12233. Valinomycin in 1223 38373837 100100
water moleculeswater molecules
4. Dynamic shell model 4. Dynamic shell model 768768 10001000
water with 1024 siteswater with 1024 sites
5. Dynamic shell model 5. Dynamic shell model 768768 10001000
MgCl2 with 1280 sitesMgCl2 with 1280 sites
6. Model membrane, 2 6. Model membrane, 2 31483148 10001000
membrane chains, 202 membrane chains, 202
solute and 2746 solvent moleculessolute and 2746 solvent molecules
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
The GAMESS-UK BenchmarkThe GAMESS-UK BenchmarkPerformance relative to the Compaq Alpha ES45/68-1000
55
67
90
46
16
59
45
111
51
48
96
84
49
126
100
73
144
0 30 60 90 120 150
IBM RS/6000 44P-270
IBM 690 Turbo/POWER4 1.3 GHz
HP PA-9000/J6000-552
HP PA-9000/J6700-750
SGI Origin3800/R14k-500
SGI O2 R12k/270
SUN Fire 6800 / 900 Cu
SUN Blade 1000 / 750
Compaq Alpha ES45/1000
Compaq Alpha ES40/833
Intel Tiger Itanium 2/1000
HP RX2600 Itanium 2/1000
IBM IA64 Itanium 800/4MB
HP RX4610 Itanium 733/2MB
Pentium 4 / 2000
AMD MP1800+ / 1533
Pentium III / 1000
3.6 minutes
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
32
78
58
78
17
65
24
50
42
151
114
46
46
62
60
38
83
100
0 20 40 60 80 100 120 140 160
IBM RS/6000 44P-270
IBM 690 Turbo/POWER4 1.3 GHz
HP PA-9000/J6000-552
HP PA-9000/J6700-750
Cray T3E/1200
SGI Origin3800/R14k-500
SGI O2 R12k/270
SUN Fire 6800 / 900 Cu
SUN Blade 1000 / 750
Compaq Alpha ES45/1000
Compaq Alpha ES40/833
Intel Tiger Itanium 2/1000
HP RX2600 Itanium 2/1000
IBM IA64 Itanium 800/4MB
HP RX4610 Itanium 733/2MB
Pentium 4 / 2000
AMD MP1800+ / 1533
Pentium III / 1000
The DLThe DL__POLY Benchmark. POLY Benchmark. Performance relative to the Compaq Alpha ES45/68-1000
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
High-End Systems EvaluatedHigh-End Systems Evaluated Cray T3E/1200ECray T3E/1200E
816 processor system at Manchester (CSAR service)816 processor system at Manchester (CSAR service) 600 Mz EV56 Alpha processor with 256 MB memory600 Mz EV56 Alpha processor with 256 MB memory
IBM SP/WH2-375 and SP/Regatta-H 32 CPU system at DL, 4-way Winterhawk2 SMP “thin nodes” with 2 GB 32 CPU system at DL, 4-way Winterhawk2 SMP “thin nodes” with 2 GB
memory, 375 MHz Power3-II processors with 8 MB L2 cachememory, 375 MHz Power3-II processors with 8 MB L2 cache IBM Regatta-HIBM Regatta-H (32-way node, 1.3 GHZ power4 CPUs) at Montepelier (32-way node, 1.3 GHZ power4 CPUs) at Montepelier IBM SP/Regatta-HIBM SP/Regatta-H (8-way LPAR’d nodes, 1.3 GHZ) at (8-way LPAR’d nodes, 1.3 GHZ) at ORNLORNL
SGI Origin 3800SGI Origin 3800 SARA (1000 CPUs) - Numalink with R14k/500 & R12k/400 CPUsSARA (1000 CPUs) - Numalink with R14k/500 & R12k/400 CPUs
Cray Supercluster at EagenCray Supercluster at Eagen Linux Alpha Cluster (96 X API CS20s - dual 833 MHz EV67 CPUs, Myrinet) Linux Alpha Cluster (96 X API CS20s - dual 833 MHz EV67 CPUs, Myrinet)
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Commodity Systems (CSx)Commodity Systems (CSx)Prototype / Evaluation HardwarePrototype / Evaluation Hardware
SystemsSystems LocationLocation CPUsCPUs ConfigurationConfigurationCS1CS1 DaresburyDaresbury 3232 PentiumIII / 450 MHz + FE (EPSRC)PentiumIII / 450 MHz + FE (EPSRC)CS2CS2 DaresburyDaresbury 6464 24 X dual UP2000/EV67-667, 24 X dual UP2000/EV67-667,
QSNet Alpha/LINUX cluster,QSNet Alpha/LINUX cluster,8 X dual CS20/EV67-833 (“loki”)8 X dual CS20/EV67-833 (“loki”)
CS5CS5 DaresburyDaresbury 1616 8 X dual Pentium III/933, SCALI8 X dual Pentium III/933, SCALI
www.cse.clrc.ac.uk/Activity/DisCo
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
T (32-nodes Cray T3E/1200E) / T (32 CPUs ) CSxT (32-nodes Cray T3E/1200E) / T (32 CPUs ) CSx
[ T 32-node T3E / T 32-node CS1 Pentium III/450 + FE ]
T 32-node T3E / T 32-node CS6 Pentium III/800 + FE
T T 32-node32-node T3E / T T3E / T 32-CPU32-CPU CS2 Alpha Linux Cluster + Quadrix CS2 Alpha Linux Cluster + Quadrix
Performance Metrics: 1999-2001Performance Metrics: 1999-2001Attempted to quantify delivered performance from the Commodity-based Attempted to quantify delivered performance from the Commodity-based systems against current MPP (CSAR Cray T3E/1200E)and ASCI-style SMP-systems against current MPP (CSAR Cray T3E/1200E)and ASCI-style SMP-node platforms (e.g. SGI Origin 3800) i.e.node platforms (e.g. SGI Origin 3800) i.e.
T (32-CPUs AlphaServer SC ES45/1000) / T (32 CPUs ) CSx
T 32-CPU AlphaServer ES45 / T 32-CPU CS9 Pentium 4 Xeon / 2000 + Myrinet 2k
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Beowulf Comparisons with the T3E & O3800/R14k-500Beowulf Comparisons with the T3E & O3800/R14k-500CSx - Pentium III + FECSx - Pentium III + FE% of 32-node Cray T3E/1200E% of 32-node Cray T3E/1200E
Capabilities Capabilities (Direct, Semi-direct and conventional):(Direct, Semi-direct and conventional): RHF, UHF, ROHFRHF, UHF, ROHF using up to 10,000 basis functions; analytic 1st using up to 10,000 basis functions; analytic 1st
and 2nd derivatives.and 2nd derivatives. DFTDFT with a wide variety of local and non-local XC potentials, using with a wide variety of local and non-local XC potentials, using
up to 10,000 basis functions; analytic 1st and 2nd derivatives.up to 10,000 basis functions; analytic 1st and 2nd derivatives. CASSCFCASSCF; analytic 1st and numerical 2nd derivatives.; analytic 1st and numerical 2nd derivatives. Semi-direct and RI-based MP2Semi-direct and RI-based MP2 calculations for RHF and UHF wave calculations for RHF and UHF wave
functions using up to 3,000 basis functions; analytic 1st derivatives functions using up to 3,000 basis functions; analytic 1st derivatives and numerical 2nd derivatives.and numerical 2nd derivatives.
Coupled cluster, CCSD and CCSD(T)Coupled cluster, CCSD and CCSD(T) using up to 3,000 basis using up to 3,000 basis functions; numerical 1st and 2nd derivatives of the CC energy. functions; numerical 1st and 2nd derivatives of the CC energy.
Classical molecular dynamics and free energy simulations with the forces obtainable from a variety of sources
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Single, shared data structure
Physically distributed data • Shared-memory-like model– Fast local access– NUMA aware and easy to use– MIMD and data-parallel modes– Inter-operates with MPI, …
• BLAS and linear algebra interface• Ported to major parallel machines
– IBM, Cray, SGI, clusters,...• Originated in an HPCC project• Used by 5 major chemistry codes,
PDSYEV: 2 CPUs / nodePDSYEV: 1 CPU / nodePDSYEVD: 2 CPUs / nodePDSYEVD: 1 CPU / nodePeIGS 2.1: 2 CPUs / nodePeIGS 2.1: 1 CPU / node
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Case Studies - Zeolite FragmentsCase Studies - Zeolite Fragments
SiSi88OO77HH1818 347/832347/832
SiSi88OO2525HH1818 617/1444617/1444
SiSi2626OO3737HH3636 1199/28181199/2818
SiSi2828OO6767HH3030 1687/39281687/3928
• DFT Calculations with DFT Calculations with Coulomb FittingCoulomb Fitting
Basis (Godbout et al.)Basis (Godbout et al.) DZVP - O, SiDZVP - O, Si
DZVP2 - HDZVP2 - HFitting Basis:Fitting Basis:
DGAUSS-A1 - O, SiDGAUSS-A1 - O, SiDGAUSS-A2 - HDGAUSS-A2 - H
• NWChem & GAMESS-UKNWChem & GAMESS-UK
Both codes use auxiliary fitting Both codes use auxiliary fitting basis for coulomb energy, with basis for coulomb energy, with 3 centre 2 electron integrals 3 centre 2 electron integrals held in coreheld in core..
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Wall time (10 SCF iterations) on Wall time (10 SCF iterations) on 256 CPUs = 11,960 seconds 256 CPUs = 11,960 seconds (60% efficiency)(60% efficiency)
Pyridine in Zeolite ZSM-5Pyridine in Zeolite ZSM-5
• 3-centre 2e-integrals = 3.79 X 103-centre 2e-integrals = 3.79 X 10 12 12
• Schwarz screening = 2.81 X 10Schwarz screening = 2.81 X 10 11 11
• % 3c 2e-ints. In core = 1.66%% 3c 2e-ints. In core = 1.66%
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Parallel Implementations of GAMESS-UKParallel Implementations of GAMESS-UK
Extensive use of Global Array (GA) Tools and Parallel Extensive use of Global Array (GA) Tools and Parallel Linear Algebra from NWChem Project (EMSL)Linear Algebra from NWChem Project (EMSL)
SCF and DFTSCF and DFT Replicated data, but …Replicated data, but … GA Tools for caching of I/O for restart and checkpoint filesGA Tools for caching of I/O for restart and checkpoint files Storage of 2-centre 2-e integrals in DFT Jfit Storage of 2-centre 2-e integrals in DFT Jfit Linear Algebra (via PeIGs, DIIS/MMOs, Inversion of 2c-2e matrix)Linear Algebra (via PeIGs, DIIS/MMOs, Inversion of 2c-2e matrix)
SCF second derivativesSCF second derivatives Distribution of <vvoo> and <vovo> integrals via GAsDistribution of <vvoo> and <vovo> integrals via GAs
MP2 gradientsMP2 gradients Distribution of <vvoo> and <vovo> integrals via GAsDistribution of <vvoo> and <vovo> integrals via GAs
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
GAMESS-UK GAMESS-UK SCF PerformanceSCF PerformanceCray T3E/1200E, High-end and Commodity-based SystemsCray T3E/1200E, High-end and Commodity-based Systems
Number of CPUs
Cyclosporin:(3-21G Basis, 1000 GTOS)
Elapsed Time (seconds)
95%,135%
T3ET3E128128 = 436 = 436
Impact of Impact of Serial Serial Linear Algebra:Linear Algebra:TTIBM-SPIBM-SP(16) = (16) = 26562656 [1289] [1289]
DFT BLYP Gradient:DFT BLYP Gradient: Cray T3E/1200, High-end and Commodity-based SystemsCray T3E/1200, High-end and Commodity-based Systems
Number of CPUs Number of CPUs
Geometry optimisation of polymerisation catalystGeometry optimisation of polymerisation catalystCl(CCl(C33HH55O).Pd[(P(CMeO).Pd[(P(CMe33))22))22.C.C66HH44]]
Basis 3-21G* (446 GTOs): 10 energy + gradient Basis 3-21G* (446 GTOs): 10 energy + gradient evaluationsevaluations
Speed-upElapsed Time (seconds)
linear
69%,114%
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Auxilliary Basis Coulomb Fit (I)Auxilliary Basis Coulomb Fit (I)
Where Where V V is the matrix of 2-centre 2-electron repulsion integrals in the charge is the matrix of 2-centre 2-electron repulsion integrals in the charge density basis and density basis and bb are the three centre electron repulsion integrals between the are the three centre electron repulsion integrals between the wavefunction basis set and the charge density basis.wavefunction basis set and the charge density basis.
pqpq bVC 1
The approach is based on the expansion of the charge density in an The approach is based on the expansion of the charge density in an auxiliary basis of Gaussian functionsauxiliary basis of Gaussian functions
As suggested by Dunlap, a variational choice of the fitting coefficients C As suggested by Dunlap, a variational choice of the fitting coefficients C can be obtained as followscan be obtained as follows::
uu
u pq
pqupq
pqpq uduCDpqDr)(
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
The number of 3-centre integrals is significantly smaller than the 4-centre The number of 3-centre integrals is significantly smaller than the 4-centre integrals used in the conventional coulomb evaluation, but for large molecules integrals used in the conventional coulomb evaluation, but for large molecules additional screening is required.additional screening is required.
We make use of the Schwarz inequalityWe make use of the Schwarz inequality
Auxilliary Basis Coulomb Fit (ii)Auxilliary Basis Coulomb Fit (ii)
uupqpqupq
Where p and q are AO basis functions and u are the fitting functions. Where p and q are AO basis functions and u are the fitting functions. Since screening is applied on a shell basis, the maximal integrals for Since screening is applied on a shell basis, the maximal integrals for each shell quartet are stored.each shell quartet are stored.
Using this screening, and exploiting the aggregate memory of a parallel Using this screening, and exploiting the aggregate memory of a parallel machine, it is possible to hold a significant fraction of the 3-centre machine, it is possible to hold a significant fraction of the 3-centre integrals in core.integrals in core.
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
GAMESS-UK: DFT HCTH on Valinomycin. GAMESS-UK: DFT HCTH on Valinomycin. Impact of Coulomb Fitting: Impact of Coulomb Fitting: Cray T3E/1200, Cray Super Cray T3E/1200, Cray Super
Cluster/833, Compaq AlphaServer SC/667 and SGI Origin R14k/500Cluster/833, Compaq AlphaServer SC/667 and SGI Origin R14k/500
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
GAMESS-UK: DFT HCTH on Valinomycin. Speedups GAMESS-UK: DFT HCTH on Valinomycin. Speedups for both Explicit and Coulomb Fitting.for both Explicit and Coulomb Fitting.
JJEXPLICITEXPLICIT JJFITFIT
Number of CPUsNumber of CPUs
Speedup Speedup
112
90.6
16
32
48
64
80
96
112
128
16 32 48 64 80 96 112 128
Linear
Cray T3E/1200E
CS2 QSNet Alpha Cluster/667
Compaq AlphaServer/667
Cray SuperCluster/883
SGI Origin 3800/R12k-400
100.1
65.4
16
32
48
64
80
96
112
128
16 32 48 64 80 96 112 128
linearCray T3E/1200E
CS2 QSNet Alpha Cluster/667Compaq Alpha Server SC/667
Cray SuperCluster/833SGI Origin 3800/R14k-500
SGI Origin 3800/R12k-400 105+105+
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Memory-driven Approaches - SCF and DFT Memory-driven Approaches - SCF and DFT
Number of CPUs
HF and DFT Energy and Gradient calculation for HF and DFT Energy and Gradient calculation for NiSiMeNiSiMe22CHCH22CHCH22CC66FF1313: Basis Ahlrichs DZ (1485 GTOs): Basis Ahlrichs DZ (1485 GTOs)
Elapsed Time (seconds)
4206
2476
1723
830
4182
2405
1692
799
0
1000
2000
3000
4000
5000
28 60 124
HF - Direct SCFHF - In-coreDFT - Direct-SCFDFT - In-core
Integrals written directly Integrals written directly to memory, rather than toto memory, rather than todisk, and are not re-disk, and are not re-
calculatedcalculated
SGI Origin 3800/R14k-500SGI Origin 3800/R14k-500
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
MP2 Gradient AlgorithmsMP2 Gradient Algorithms
ConventionalConventional integrals written to diskintegrals written to disk read back, transformed, read back, transformed,
written out, resorted etc.written out, resorted etc. heavy I/O demandsheavy I/O demands
Direct/Semi-direct (Frisch, Direct/Semi-direct (Frisch, Head-Gordon & Pople, Hasse Head-Gordon & Pople, Hasse and Ahlrichs)and Ahlrichs) replace all/some I/O with replace all/some I/O with
batched integral batched integral recomputationrecomputation
Poor I/O-to-compute Poor I/O-to-compute performance of MPPsperformance of MPPs direct approachdirect approach
Current MPPs have large Current MPPs have large global memoriesglobal memories
Store subset of MO integralsStore subset of MO integrals reduce number of integral reduce number of integral
recomputationsrecomputations increase communication increase communication
overheadoverhead Subset includes VOVO, VVOO, Subset includes VOVO, VVOO,
VOOO,VOOO, VVVO-class too large to storeVVVO-class too large to store compute VVVO-terms in compute VVVO-terms in
separate stepseparate step
SerialSerial ParallelParallel
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Performance of MP2 Gradient ModulePerformance of MP2 Gradient ModuleCray T3E/1200, High-end and Commodity-Cray T3E/1200, High-end and Commodity-based Systemsbased Systems
These codes have similar These codes have similar functionality, power and problems. functionality, power and problems. CASTEP is the flagship code of CASTEP is the flagship code of UKCP and hence subsequent UKCP and hence subsequent discussions will focus on this.discussions will focus on this.
Local Gaussian Basis Set Codes:• CRYSTAL
This code presents a different set of problems when considering performance on HPC(x).
SIESTA and SIESTA and CONQUEST:CONQUEST:
• O(n) scaling codes which will be extremely attractive to users. O(n) scaling codes which will be extremely attractive to users. • Both are currently development rather than production codes. Both are currently development rather than production codes.
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
CRYSTAL - 2000CRYSTAL - 2000 Distributed Data implementationDistributed Data implementation Benchmark: Benchmark:
An Acid Centre in Zeolite-Y (Faujasite)An Acid Centre in Zeolite-Y (Faujasite) Single point energySingle point energy 145 atoms / cell, No symmetry / 8k-points145 atoms / cell, No symmetry / 8k-points 2208 basis functions, (6-21G2208 basis functions, (6-21G* * ))
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
CASTEP 4.2 - Parallel Benchmark II.CASTEP 4.2 - Parallel Benchmark II.TiN: TiN: A TiN 32 atom slab, 8 k-points, single point A TiN 32 atom slab, 8 k-points, single point
energy calculation with Mulliken analysis,energy calculation with Mulliken analysis,
9. Model membrane/Valinomycin (MTS, 18,886)7. Gramicidin in water (SHAKE, 12,390)6. K/valinomycin in water (SHAKE, AMBER, 3,838)1. Metallic Al (19,652 atoms, Sutton Chen)3. Transferrin in Water (neutral groups + SHAKE, 27,593)2. Peptide in water (neutral groups + SHAKE, 3993).
Speed-up
0
16
32
48
64
0 16 32 48 64
Speed-up LinearLinear
V2: Replicated Data
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Bench 7. Gramicidin in water; rigid bonds and rigid bonds and SHAKE, SHAKE, 12,390 atoms,12,390 atoms,500 time steps500 time steps
Number of CPUs
Measured Time (seconds)
T3ET3E128128 =166 =16641%,64%
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Migration from Replicated to Distributed dataMigration from Replicated to Distributed data DL_POLY-3 : Domain DecompositionDL_POLY-3 : Domain Decomposition
Distribute atoms, forces across the Distribute atoms, forces across the nodesnodes More memory efficient, can address More memory efficient, can address
much larger cases (10 much larger cases (10 55-10 -10 77)) Shake and short-ranges forces require Shake and short-ranges forces require
only neighbour communicationonly neighbour communication communications scale linearly with communications scale linearly with
number of nodesnumber of nodes
Coulombic energy remains globalCoulombic energy remains global strategy depends on problem and strategy depends on problem and
includes Fourier transform smoothed includes Fourier transform smoothed charge density (reciprocal space grid charge density (reciprocal space grid typically 64x64x64 - 128x128x128)typically 64x64x64 - 128x128x128)
AA BB
CC DD
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Conventional routines (Conventional routines (e.g.e.g. fftw) assume plane fftw) assume plane or column distributions or column distributions
A global transpose of the data is required to A global transpose of the data is required to complete the 3D FFT and additional costs are complete the 3D FFT and additional costs are incurred re-organising the data from the natural incurred re-organising the data from the natural block domain decomposition. block domain decomposition.
An alternative FFT algorithm has been designed An alternative FFT algorithm has been designed to reduce communication costs. to reduce communication costs.
the 3D FFT are performed as a series of 1D the 3D FFT are performed as a series of 1D FFTs, each involving communications only FFTs, each involving communications only between blocks in a given columnbetween blocks in a given column
More data is transferred, but in far fewer More data is transferred, but in far fewer messagesmessages
Rather than all-to-all, the communications are Rather than all-to-all, the communications are column-wise onlycolumn-wise only
Plane Block
Migration from Replicated to Distributed dataMigration from Replicated to Distributed data DL_POLY-3: Coulomb Energy EvaluationDL_POLY-3: Coulomb Energy Evaluation
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
171
186
0
64
128
192
256
0 64 128 192 256
Linear
AlphaServer ES45/1000
CS9 P4/2000 + Myrinet 2k
SGI Origin 3800/R14k-500
Migration from Replicated to Distributed dataMigration from Replicated to Distributed data DL_POLY-3: Coulomb Energy PerformanceDL_POLY-3: Coulomb Energy Performance
DLPOLY_2.1127,000 ions, 500 time steps, 27,000 ions, 500 time steps, Cutoff=24ÅCutoff=24ÅDLPOLY_3DLPOLY_327,000 ions, 500 time steps, 27,000 ions, 500 time steps, Cutoff=12ÅCutoff=12Å DLPOLY_3DLPOLY_3
216,000 ions, 200 time steps, 216,000 ions, 200 time steps, Cutoff=12ÅCutoff=12Å
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
174.7
208.4
0
64
128
192
256
0 64 128 192 256
Linear
AlphaServer SC ES45/1000CS9 P4/2000 + Myrinet 2k
SGI Origin 3800/R14k-500
Migration from Replicated to Distributed dataMigration from Replicated to Distributed data DL_POLY-3: Macromolecular SimulationsDL_POLY-3: Macromolecular Simulations
DLPOLY_3DLPOLY_3792,960 ions, 50 time steps792,960 ions, 50 time steps
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Parallel CHARMM Benchmark:Parallel CHARMM Benchmark:LAM MPI vs. MPICHLAM MPI vs. MPICHBenchmark MD Calculation of Carboxy Myoglobin (MbCO) with 3830 Benchmark MD Calculation of Carboxy Myoglobin (MbCO) with 3830 Water Molecules:Water Molecules:
Measured Time (seconds)
Number of CPUs
198.9
440.1
0
150
300
450
600
750
4 8 16 32
CS6 PIII/800 + FE
LAM 6.3-b2
MPICH-1.2.1
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
QM/MM Thrombin BenchmarkQM/MM Thrombin Benchmark
QM/MM calculation on thrombin pocket:
QM region (69 Atoms) is modelled by SCF 6-31G calculation (401 GTOs), total system size for the classical calculation is 16,659 atoms. The MM calculation was performed using all non-bonded interactions (i.e. without a pairlist), and the QM/MM interaction was cut off at 15 angstroms using a neutral group scheme (leading to the inclusion in the QM calculation of 1507 classical centres).
0
400
800
1200
1600
2000
1 2 4 8 16 32 64 128 256
QM time
MM Time
Scaling for Thrombin + Inhibitor (NAPAP) + WaterScaling for Thrombin + Inhibitor (NAPAP) + Water
Speedup (Cray T3E)
0
32
64
96
128
0 32 64 96 128
Number of Nodes
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
• QM region 35 atoms (DFT BLYP) – include residues with possible proton donor/acceptor roles – GAMESS-UK, MNDO, TURBOMOLE
• MM region (4,180 atoms + 2 link)– CHARMM force-field, implemented in CHARMM, DL_POLY
Triosephosphate isomerase (TIM)
• Central reaction in glycolysis, catalytic interconversion ofDHAP to GAP
• Demonstration case within QUASI (Partners UZH, and BASF)
Triosephosphate isomerase (TIM)
• Central reaction in glycolysis, catalytic interconversion ofDHAP to GAP
• Demonstration case within QUASI (Partners UZH, and BASF)
QM/MM Applications
1467
1030
1487
714
802
540
778
419
508
308
428
274
196
257
213
0
400
800
1200
1600
8 16 32 64
CS7 AMD K7/1000 + SCI
CS9 P4/2000 + Myrinet 2k
SGI Origin3800/R14k-500
AlphaServer SC ES45/1000
Measured Time (seconds)
T T 128128 (O3800/R14k-500) = 181 secs (O3800/R14k-500) = 181 secs
89%,139%
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Commodity Comparisons with High-end Systems:Commodity Comparisons with High-end Systems:Compaq AlphaServer SC ES45/1000 and the SGI O3800/R14k-500Compaq AlphaServer SC ES45/1000 and the SGI O3800/R14k-500