Designing Next Generation Clusters: Evaluation of InfiniBand DDR/QDR on Intel Computing Platforms Hari Subramoni, Matthew Koop and Hari Subramoni, Matthew Koop and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University HotI '09
35
Embed
Designing Next Generation Clusters: Evaluation of ...mvapich.cse.ohio-state.edu/static/media/publications/slide/subramon... · Designing Next Generation Clusters: Evaluation of InfiniBand
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Designing Next Generation Clusters: Evaluation of InfiniBand
DDR/QDR on Intel Computing Platforms
Hari Subramoni, Matthew Koop and Hari Subramoni, Matthew Koop and Dhabaleswar. K. Panda
Computer Science & Engineering DepartmentThe Ohio State University
HotI '09
• Introduction• Introduction• Problem Statement• Approach• Approach• Performance Evaluation and Results• Performance Evaluation and Results• Conclusions and Future Work
HotI '09
• Commodity clusters are becoming more popular for High Performance Computing (HPC) SystemsPerformance Computing (HPC) Systems
• Modern clusters are being designed with multi-core processorsprocessors
• Communication characteristics• Communication characteristics• Message distribution• Mapping of processes into cores/nodes• Mapping of processes into cores/nodes
– Block vs. cyclic
• Traditionally, intra-node communication performance has • Traditionally, intra-node communication performance has been better than inter-node communication performance– Most applications use `block’ distributions
• Are such practices still valid on modern clusters?• Are such practices still valid on modern clusters?
HotI '09
• First true quad core processor with L3 cache sharing Socketcache sharing
– Multiple virtual lanes– Quality of Service (QoS) support– Quality of Service (QoS) support
• Double Data Rate (DDR) with 20 Gbps bandwidth has been therebeen there
• Quad Data Rate (QDR) with 40 Gbps bandwidth is available recently
• Has impact on Inter-node communication performance• Has impact on Inter-node communication performanceHotI '09
• Introduction• Introduction• Problem Statement• Approach• Approach• Performance Evaluation and Results• Performance Evaluation and Results• Conclusions and Future Work
HotI '09
• What are the intra-node and inter-node communication performance of Nehalem-based clusters with InfiniBand performance of Nehalem-based clusters with InfiniBand DDR and QDR?
• How do these communication performance compare with • How do these communication performance compare with previous generation Intel processors (Clovertown and Harpertown) with similar InfiniBand DDR and QDR?
• With rapid advances in processor and networking • With rapid advances in processor and networking technologies, are the relative performance between intra-node and inter-node changing?
• How such changes can be characterized?• Can such characterization be used to analyze
application performance across different systems? application performance across different systems? HotI '09
• Introduction• Introduction• Problem Statement• Approach• Approach• Performance Evaluation and Results• Performance Evaluation and Results• Conclusions and Future Work
HotI '09
• Absolute Performance of Intra-node and Inter-node communicationcommunication– Different combinations of Intel processor platforms and
InfiniBand (DDR and QDR)
• Characterization of Relative Performance between Intra-• Characterization of Relative Performance between Intra-node and Inter-node communication – Use such characterization to analyze application-level
performance
HotI '09
• Applications have different communication characteristics– Latency sensitive– Latency sensitive– Bandwidth (uni-directional) sensitive– Bandwidth (bi-directional) sensitive
• Introduce a set of metrics Communication Balance Ratio (CBR) • Introduce a set of metrics Communication Balance Ratio (CBR) – CBR-Latency = Latency_Intra / Latency_Inter– CBR-Bandwidth = Bandwidth_Intra / Bandwidth_Inter– CBR-Bi-BW = Bi-BW_Intra / Bi_BW_Inter– CBR-Bi-BW = Bi-BW_Intra / Bi_BW_Inter– CBR-Multi-BW = Multi-BW_Intra / Multi-BW_Inter
• CBR-x=1 => Cluster is Balanced wrt metric x• Applications sensitive to metric x can be mapped anywhere in the
cluster without any significant impact on overall performance
HotI '09
• Introduction• Introduction• Problem Statement• Approach• Approach• Performance Evaluation and Results• Performance Evaluation and Results• Conclusions and Future Work
HotI '09
• Three different compute platforms– Intel Clovertown– Intel Clovertown
• Two different InfiniBand Host Channel Adapters– Dual port ConnectX DDR adapter– Dual port ConnectX QDR adapter
• Two different InfiniBand Switches– Flextronics 144 port DDR switch– Mellanox 24 port QDR switch
• Five different platform-interconnect combinations– NH-QDR – Intel Nehalem machines using ConnectX QDR HCA’s– NH-DDR – Intel Nehalem machines using ConnectX DDR HCA’s– NH-DDR – Intel Nehalem machines using ConnectX DDR HCA’s– HT-QDR – Intel Harpertown machines using ConnectX QDR HCA’s– HT-DDR – Intel Harpertown machines using ConnectX DDR HCA’s– CT-DDR – Intel Clovertown machines using ConnectX DDR HCA’s– CT-DDR – Intel Clovertown machines using ConnectX DDR HCA’s
• Open Fabrics Enterprise Distribution (OFED) 1.4.1 drivers• Red Hat Enterprise Linux 4U4• MPI Stack used – MVAPICH2-1.2p1• MPI Stack used – MVAPICH2-1.2p1
HotI '09
• High Performance MPI Library for IB and 10GE• High Performance MPI Library for IB and 10GE– MVAPICH (MPI-1) and MVAPICH2 (MPI-2)
– Used by more than 960 organizations in 51 countries– Used by more than 960 organizations in 51 countries
– More than 32,000 downloads from OSU site directly
– Empowering many TOP500 clusters
• 8th ranked 62,976-core cluster (Ranger) at TACC
– Available with software stacks of many IB, 10GE and server vendors
including Open Fabrics Enterprise Distribution (OFED)including Open Fabrics Enterprise Distribution (OFED)
– Also supports uDAPL device to work with any network supporting
• NAS Parallel Benchmarks (NPB)– Version 3.3– http://www.nas.nasa.gov/– http://www.nas.nasa.gov/
HotI '09
• Absolute Performance– Inter-node latency and bandwidth– Inter-node latency and bandwidth– Intra-node latency and bandwidth– Collective All-to-all – HPCC– NAS
• Communication Balance Ratio• Communication Balance Ratio– CBR-Latency– CBR-Bandwidth (uni-directional)– CBR-Bandwidth (bi-directional)– CBR-Bandwidth (bi-directional)– CBR-Bandwidth (multi-pair)
• Impact of CBR on Application Performance• Impact of CBR on Application Performance
HotI '09
44.5
14001600
22.5
33.5
4
Lat
ency
(u
s)
600800
100012001400
Lat
ency
(u
s)
1.551.63
1.76
00.5
11.5
2
Lat
ency
(u
s)
0200400600
Lat
ency
(u
s)
1.05
1.05
0
Message Size (Bytes)
HT-QDR HT-DDR NH-QDR
Message Size (Bytes)
HT-QDR HT-DDR NH-QDRHT-QDR HT-DDR NH-QDR
NH-DDR CT-DDR
HT-QDR HT-DDR NH-QDR
NH-DDR CT-DDR
• Harpertown systems deliver best small message latency• Up to 10% improvement in large message latency for NH-QDR over HT-QDR• Up to 10% improvement in large message latency for NH-QDR over HT-QDR
HotI '09
2500
3000
3500
Ban
dw
idth
(M
Bp
s)
3029
25755000
6000
7000
Bid
irec
tio
nal
Ban
dw
idth
(M
Bp
s)
5236
1000
1500
2000
2500
Ban
dw
idth
(M
Bp
s)
1943
15562000
3000
4000
5000
Bid
irec
tio
nal
Ban
dw
idth
(M
Bp
s)
19433870
3011
5042
3743
0
500
1000
Ban
dw
idth
(M
Bp
s)
0
1000
2000
Bid
irec
tio
nal
Ban
dw
idth
(M
Bp
s)
Message Size (Bytes)
HT-QDR HT-DDR NH-QDR
NH-DDR CT-DDR
Bid
irec
tio
nal
Ban
dw
idth
(M
Bp
s)
Message Size (Bytes)
HT-QDR HT-DDR NH-QDR
NH-DDR CT-DDRNH-DDR CT-DDR NH-DDR CT-DDR
• Nehalem systems offer a peak uni-directional bandwidth of 3029 MBps and bi-directional bandwidth of 5236 MBps
• NH-QDR gives up to 18% improvement in uni-directional bandwidth over HT-QDR
HotI '09
• NH-QDR gives up to 18% improvement in uni-directional bandwidth over HT-QDR
1
1.2
Lat
ency
(u
s)
2000
2500
Lat
ency
(u
s)
0.4
0.6
0.8
Lat
ency
(u
s)
1000
1500
Lat
ency
(u
s)
0.570.57
0.57
0
0.2
0.4
Lat
ency
(u
s)
0
500
Lat
ency
(u
s)
0.35
0.35
Message Size (Bytes)
HT-QDR HT-DDR NH-QDR
Message Size (Bytes)
HT-QDR HT-DDR NH-QDRHT-QDR HT-DDR NH-QDR
NH-DDR CT-DDR
HT-QDR HT-DDR NH-QDR
NH-DDR CT-DDR
• Intra-Socket small message latency of 0.35 us• Nehalem systems give up to 40% improvement in Intra-Node latency for
HotI '09
various message sizes
700080009000
10000
Ban
dw
idth
(M
Bp
s) 7474
700080009000
10000
Bid
irec
tio
nal
Ban
dw
idth
(M
Bp
s)
6826
30004000500060007000
Ban
dw
idth
(M
Bp
s)
3282
3282
7474
30004000500060007000
Bid
irec
tio
nal
Ban
dw
idth
(M
Bp
s)
27792779
6826
0100020003000
Ban
dw
idth
(M
Bp
s)
2208
0100020003000
Bid
irec
tio
nal
Ban
dw
idth
(M
Bp
s)
1738
Message Size (Bytes)
HT-QDR HT-DDR NH-QDRB
idir
ecti
on
al B
and
wid
th (
MB
ps)
Message Size (Bytes)
HT-QDR HT-DDR NH-QDR
NH-DDR CT-DDRNH-DDR CT-DDR NH-DDR CT-DDR
• Intra-Socket bandwidth (7474 MBps) and bidirectional bandwidth (6826 MBps) show the high memory bandwidth of Nehalem systems
• Drop in performance at large message size due to cache collisions• Drop in performance at large message size due to cache collisions
HotI '09
30000
Ban
dw
idth
(M
Bp
s)
Same Send/Recv Buffers
269542695425000
Ban
dw
idth
(M
Bp
s)
Different Send/Recv Buffers
10000
15000
20000
25000
Ban
dw
idth
(M
Bp
s)
16382
2695426954
1638210000
15000
20000
Ban
dw
idth
(M
Bp
s)
9843
4609
9843
0
5000
10000
Ban
dw
idth
(M
Bp
s)
21898
16382
0
5000
10000
Ban
dw
idth
(M
Bp
s)
4609
1168
4609
Message Size (Bytes)
HT-QDR HT-DDR NH-QDR
NH-DDR CT-DDR
Message Size (Bytes)
HT-QDR HT-DDR NH-QDR
NH-DDR CT-DDRNH-DDR CT-DDR NH-DDR CT-DDR
• Different send/recv buffers are used to negate the caching effect• Nehalem systems show superior memory bandwidth with different send/recv
• A 43% to 55% improvement by using QDR HCA over a DDR HCA • Harpertown numbers not shown due to unavailability of more number of nodes• Harpertown numbers not shown due to unavailability of more number of nodes
HotI '09
3000 • Baseline numbers are
1500
2000
2500
Ban
dw
idth
(M
Bp
s)
• Baseline numbers are taken on CT-DDR
• NH-DDR shows a 13%improvement in
500
1000
1500
Ban
dw
idth
(M
Bp
s)
improvement in performance over Harpertown and Clovertown systems
0
500
Minimum Ping Pong Bandwidth
Ban
dw
idth
(M
Bp
s)
• NH-QDR shows a 38% improvement in performance over NH-DDR
HT-QDR HT-DDR NH-QDRNH-DDR Baseline
performance over NH-DDR systems
HotI '09
700800
Ban
dw
idth
(M
Bp
s) 700800
Ban
dw
idth
(M
Bp
s)
300400500600700
Ban
dw
idth
(M
Bp
s)
300400500600700
Ban
dw
idth
(M
Bp
s)
0100200300
Ban
dw
idth
(M
Bp
s)
0100200300
Ban
dw
idth
(M
Bp
s)
0
Naturally Ordered Ring Bandwidth
HT-QDR HT-DDR NH-QDR
Randomly Ordered Ring Bandwidth
HT-QDR HT-DDR NH-QDR
NH-DDR Baseline NH-DDR Baseline
• Up to 190% improvement in Naturally Ordered Ring bandwidth for NH-QDR
• Up to 130% improvement in Randomly Ordered Ring bandwidth for NH-QDR
HotI '09
Ordered Ring bandwidth for NH-QDR Ordered Ring bandwidth for NH-QDR
1.2
Class B – 32 processes1.2
Class C – 32 processes
0.60.8
11.2
No
rmal
ized
Tim
e
0.60.8
11.2
No
rmal
ized
Tim
e
00.20.40.6
No
rmal
ized
Tim
e
00.20.40.6
No
rmal
ized
Tim
e
0
CG FT IS LU MG
No
rmal
ized
Tim
e
NAS Benchmarks
0
CG FT IS LU MG
No
rmal
ized
Tim
eNAS Benchmarks
NH-DDR NH-QDR NH-DDR NH-QDR
• Numbers normalized to NH-DDR• NH-QDR shows clear benefits over NH-DDR for multiple applications
HotI '09
• NH-QDR shows clear benefits over NH-DDR for multiple applications
2
2.5
Co
mm
un
icat
ion
Bal
ance
Rat
io
HT-QDR
1
1.5
Co
mm
un
icat
ion
Bal
ance
Rat
io
(L_i
ntr
a/L
_in
ter)
HT-QDRHT-DDRNH-QDRNH-DDRCT-DDR
0
0.5
Co
mm
un
icat
ion
Bal
ance
Rat
io
(L_i
ntr
a/L
_in
ter)
CT-DDRBalanced
0
Co
mm
un
icat
ion
Bal
ance
Rat
io
Message Size (Bytes)
• Useful for Latency bound applications• Useful for Latency bound applications• Harpertown more balanced for applications using small to medium sized messages
• HT-DDR more balanced for applications using large messages followed • HT-DDR more balanced for applications using large messages followed by NH-QDR
HotI '09
2.5
3
3.5
Co
mm
un
icat
ion
Bal
ance
Rat
io
(BW
_in
tra/
BW
_in
ter)
HT-QDRHT-DDR
1.5
2
2.5
Co
mm
un
icat
ion
Bal
ance
Rat
io
(BW
_in
tra/
BW
_in
ter)
HT-DDRNH-QDRNH-DDRCT-DDR
0
0.5
1
Co
mm
un
icat
ion
Bal
ance
Rat
io
(BW
_in
tra/
BW
_in
ter)
Balanced
0
Co
mm
un
icat
ion
Bal
ance
Rat
io
Message Size (Bytes)
• Useful for Bandwidth bound applications• Nehalem systems more balanced for applications using small to medium sized messages
• Harpertown systems more balanced for applications using large
HotI '09
• Harpertown systems more balanced for applications using large messages
1.41.61.8
2
Co
mm
un
icat
ion
Bal
ance
Rat
io
(Bi_
BW
_in
tra/
Bi_
BW
_in
ter) HT-QDR
HT-DDR
0.60.8
11.21.4
Co
mm
un
icat
ion
Bal
ance
Rat
io
(Bi_
BW
_in
tra/
Bi_
BW
_in
ter)
HT-DDRNH-QDRNH-DDRCT-DDRBalanced
00.20.40.6
Co
mm
un
icat
ion
Bal
ance
Rat
io
(Bi_
BW
_in
tra/
Bi_
BW
_in
ter)
Balanced
Co
mm
un
icat
ion
Bal
ance
Rat
io
(Bi_
BW
_in
tra/
Bi_
BW
_in
ter)
Message Size (Bytes)
• Useful for Applications using frequent bidirectional communication • Useful for Applications using frequent bidirectional communication pattern
• Nehalem systems balanced for applications using small to medium sized messages in bidirectional communication pattern
HotI '09
medium sized messages in bidirectional communication pattern• NH-QDR balanced for all message sizes
12
14
16C
om
mu
nic
atio
n B
alan
ce R
atio
(M
P_B
W_i
ntr
a/M
P_B
W_i
nte
r)
HT-QDRHT-DDR
6
8
10
12
Co
mm
un
icat
ion
Bal
ance
Rat
io
(MP
_BW
_in
tra/
MP
_BW
_in
ter)
HT-DDRNH-QDRNH-DDRCT-DDR
0
2
4
6
Co
mm
un
icat
ion
Bal
ance
Rat
io
(MP
_BW
_in
tra/
MP
_BW
_in
ter)
CT-DDRBalanced
0
Co
mm
un
icat
ion
Bal
ance
Rat
io
(MP
_BW
_in
tra/
MP
_BW
_in
ter)
Message Size (Bytes)
• Useful for Communication intensive applications • NH-QDR balanced for applications using mainly small to medium sized messages
• Harpertown balanced for applications using mainly large messages
HotI '09
• Harpertown balanced for applications using mainly large messages
1
1.2
No
rmal
ized
Tim
e • NH-QDR is more balanced than NH-DDR especially for medium to large
0.6
0.8
1
No
rmal
ized
Tim
e
DDR especially for medium to large messages
• Process mapping should have less impact with NH-QDR than NH-DDR for applications using medium to large
0
0.2
0.4
No
rmal
ized
Tim
e
applications using medium to large messages
• We compare NPB performance for block and cyclic process mapping
• Numbers normalized to NH-DDR-CyclicCG EP FT LU MG
NAS Benchmarks
• Numbers normalized to NH-DDR-Cyclic• NH-QDR has very similar performance
for both block and cyclic mapping for multiple applications
• CG & FT uses a lot of large messages, NH-QDR-Block NH-QDR-Cyclic
NH-DDR-Block NH-DDR-Cyclic
• CG & FT uses a lot of large messages, hence show difference
• MG is not communication intensive• LU uses small messages where CBR
for NH-QDR and NH-DDR is similar
HotI '09
for NH-QDR and NH-DDR is similar
• Introduction• Introduction• Problem Statement• Approach• Approach• Performance Evaluation and Results• Performance Evaluation and Results• Conclusions and Future Work
HotI '09
• Studied absolute communication performance of various Intel computing platforms with InfiniBand DDR and QDRcomputing platforms with InfiniBand DDR and QDR
• Proposed a set of metrics related to Communication Balance Ratio (CBR)
• Evaluated these metrics for various computing platforms and • Evaluated these metrics for various computing platforms and InfiniBand DDR and QDR
• Nehalem systems with InfiniBand QDR give the best absolute • Nehalem systems with InfiniBand QDR give the best absolute performance for latency and bandwidth in most cases
• Nehalem based systems alter the CBR metrics• Nehalem systems with InfiniBand QDR interconnects also • Nehalem systems with InfiniBand QDR interconnects also
offer best communication balance in most cases• Plan to perform larger scale evaluations and study impact of
these systems on the performance of end applicationsthese systems on the performance of end applicationsHotI '09