NWChem Performance Benchmark and Profiling July 2009
2
Note
• The following research was performed under the HPC Advisory Council activities
– Participating vendors: AMD, Dell, Mellanox
– Compute resource - HPC Advisory Council Cluster Center
• For more info please refer to
– www.mellanox.com, www.dell.com/hpc, www.amd.com,
3
NWChem
• NWChem is a computational chemistry package
– NWChem has been developed by the Molecular Sciences Software group of the Environmental
Molecular Sciences Laboratory (EMSL) at the Pacific Northwest National Laboratory (PNNL)
• NWChem provides many methods to compute the properties of molecular and periodic systems
– Using standard quantum mechanical descriptions of the electronic wavefunction or density
• NWChem has the capability to perform classical molecular dynamics and free energy simulations
– These approaches may be combined to perform mixed quantum-mechanics and molecular-mechanics
simulations
4
Objectives
• The presented research was done to provide best practices
– NWChem performance benchmarking
– Performance comparison with different MPI libraries
– Interconnect performance comparisons
– Understanding NWChem communication patterns
– Power-efficient simulations
5
Test Cluster Configuration
• Dell™ PowerEdge™ SC 1435 24-node cluster
• Quad-Core AMD Opteron™ 2382 (“Shanghai”) CPUs
• Mellanox® InfiniBand ConnectX® 20Gb/s (DDR) HCAs
• Mellanox® InfiniBand DDR Switch
• Memory: 16GB memory, DDR2 800MHz per node
• OS: RHEL5U2, OFED 1.4 InfiniBand SW stack
• MPI: HP-MPI 2.3, Open MPI 1.3.2, and Mvapich-1.1
• Application: NWChem 5.1.1
• Math Library: AMD Core Math Library (ACML)
• Benchmark Workload
– MP2 gradient calculation of the (H2O)7 molecule - H2O7
• Flags for ifort compiler– -i8 -align -w -g -vec-report1 -i8 -O3 -prefetch -unroll -tpp7 -ip
6
Mellanox InfiniBand Solutions
• Industry Standard– Hardware, software, cabling, management
– Design for clustering and storage interconnect
• Performance– 40Gb/s node-to-node
– 120Gb/s switch-to-switch
– 1us application latency
– Most aggressive roadmap in the industry
• Reliable with congestion management• Efficient
– RDMA and Transport Offload
– Kernel bypass
– CPU focuses on application processing
• Scalable for Petascale computing & beyond• End-to-end quality of service• Virtualization acceleration• I/O consolidation Including storage
InfiniBand Delivers the Lowest Latency
The InfiniBand Performance Gap is Increasing
Fibre Channel
Ethernet
60Gb/s
20Gb/s
120Gb/s
40Gb/s
240Gb/s (12X)
80Gb/s (4X)
7
• Performance– Quad-Core
• Enhanced CPU IPC• 4x 512K L2 cache• 6MB L3 Cache
– Direct Connect Architecture• HyperTransport™ Technology • Up to 24 GB/s peak per processor
– Floating Point• 128-bit FPU per core• 4 FLOPS/clk peak per core
– Integrated Memory Controller• Up to 12.8 GB/s• DDR2-800 MHz or DDR2-667 MHz
• Scalability– 48-bit Physical Addressing
• Compatibility– Same power/thermal envelopes as 2nd / 3rd generation AMD Opteron™ processor
7 November5, 2007
PCI-E® Bridge
I/O Hub
USB
PCI
PCI-E® Bridge
8 GB/S
8 GB/S
Dual ChannelReg DDR2
8 GB/S
8 GB/S
8 GB/S
Quad-Core AMD Opteron™ Processor
8
Dell PowerEdge Servers helping Simplify IT
• System Structure and Sizing Guidelines– 24-node cluster build with Dell PowerEdge™ SC 1435 Servers
– Servers optimized for High Performance Computing environments
– Building Block Foundations for best price/performance and performance/watt
• Dell HPC Solutions– Scalable Architectures for High Performance and Productivity
– Dell's comprehensive HPC services help manage the lifecycle requirements.
– Integrated, Tested and Validated Architectures
• Workload Modeling– Optimized System Size, Configuration and Workloads
– Test-bed Benchmarks
– ISV Applications Characterization
– Best Practices & Usage Analysis
9
NWChem Benchmark Results
• Input Dataset - H2O7
• ACML provides higher performance and scalability versus the default BLAS library
InfiniBand DDRLower is better
NWChem Benchmark Result (H2O7 MP2)
0
50
100
150
200
250
300
4 8 16 24Number of Nodes
Wal
l tim
e (S
econ
ds)
MVAPICH HP-MPI Open MPIMVAPICH + ACML HP-MPI + ACML OpenMPI + ACML
10
NWChem Benchmark Result (H2O7 MP2)
0
300
600
900
1200
1500
1800
4 8 16 24Number of Nodes
Perf
orm
ance
(Job
s/da
y)
Ethernet InfiniBand DDR
NWChem Benchmark Results
• Input Dataset - H2O7
• InfiniBand enables better performance and scalability– Up to 136% higher productivity versus Gigabit Ethernet
Open MPI + ACMLHigher is better
136%
11
NWChem Profiling – MPI Functions
• Mostly used MPI functions– MPI_Get_Count, MPI_Recv, and MPI_Send
NWChem Benchmark Profiling(Siosi7)
0
10000
20000
30000
40000
MPI_Allto
all
MPI_Barr
ier
MPI_Bca
st
MPI_Get_
count
MPI_Rec
v
MPI_Sen
d
MPI Function
Num
ber o
f Mes
sage
s (K
)
8 Nodes 16 Nodes 24 Nodes
12
NWChem Profiling – Timing per MPI Function
• MPI_Recv and MPI_Barrier show high communication overhead
NWChem Benchmark Profiling(Siosi7)
0100000200000300000400000500000
MPI_Allto
all
MPI_Barr
ier
MPI_Bca
st
MPI_Get_
count
MPI_Rec
v
MPI_Sen
d
MPI Function
Tota
l Ove
rhea
d (s
)
8 Nodes 16 Nodes 24 Nodes
13
NWChem Profiling – Message Transferred
• Most data related MPI messages are within 8KB-256KB in size• Number of messages increases with cluster size
NWChem Benchmark Profiliing(Siosi7)
0
10000
20000
30000
40000
50000
<128B <1KB <8KB <256KB <1MB >1MB
Message Size
Num
ber o
f Mes
sage
s (K
)
8nodes 16nodes 24nodes
14
NWChem Profiling Summary
• NWChem is profiled to identify its communication pattern
• Frequent used message sizes
– 8KB-256KB messages for data related communications
– Number of messages increases with system size
– Message size kept with system size
• Interconnects effect to NWChem performance
– Interconnect throughput significantly influences NWChem performance
– The need for higher throughput increases with system size
15
Power Cost Savings
• Dell economical integration of AMD CPUs and Mellanox InfiniBand saves up to $7000 in power – Versus using Gigabit Ethernet as the connectivity solutions– Yearly based for 24-node cluster,
• As cluster size increases, more power can be saved
$/KWh = KWh * $0.20For more information - http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf
16
Conclusions
• ACML enables higher NWChem performance
– Faster than GCC compiler with default BLAS library
• HP MPI and Open MPI shows better performance than MVAPICH
• NWChem relies on interconnect with highest throughput
– Most transferred messages are 8KB-256KB messages
– Number of messages scales up as number of processes increases
• InfiniBand enables highest NWChem performance and scalability
– Nearly 136% higher productivity versus GigE
– performance gain increases with system size
• Higher performance expected with more nodes
• Balanced system enables high productivity
– Optimal job placement can maximize NWchem simulations
1717
Thank YouHPC Advisory Council
All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein