Designing Next Generation Clusters: Evaluation of ...mvapich.cse.ohio-state.edu/static/media/publications/slide/subramon... · Designing Next Generation Clusters: Evaluation of InfiniBand

Designing Next Generation Clusters: Evaluation of InfiniBand

DDR/QDR on Intel Computing Platforms

Hari Subramoni, Matthew Koop and Hari Subramoni, Matthew Koop and Dhabaleswar. K. Panda

Computer Science & Engineering DepartmentThe Ohio State University

HotI '09

• Introduction• Introduction• Problem Statement• Approach• Approach• Performance Evaluation and Results• Performance Evaluation and Results• Conclusions and Future Work

HotI '09

• Commodity clusters are becoming more popular for High Performance Computing (HPC) SystemsPerformance Computing (HPC) Systems

• Modern clusters are being designed with multi-core processorsprocessors

• Introduces multi-level communication• Intra-node (intra-socket, inter-socket)• Inter-node• Inter-node

HotI '09

Chip ChipSocketNode

Chip ChipSocketNode Compute

ClusterCore

Core

Chip

Core

Core

Chip

Core

Core

Chip

Core

Core

Chip Cluster

Core

Chip

Core

ChipSocket

Core

Chip

Core

ChipSocket

Network Fabric

Core CoreCore Core

Compute Cluster

• Interconnect Speed• Network Performance• Network Topology

• Cache Hierarchies• Memory Architecture• Inter Processor Connections

Cluster

HotI '09

• Network Topology• Network Congestion• MPI Library Design

• Memory controllers• MPI Library Design

• Communication characteristics• Communication characteristics• Message distribution• Mapping of processes into cores/nodes• Mapping of processes into cores/nodes

– Block vs. cyclic

• Traditionally, intra-node communication performance has • Traditionally, intra-node communication performance has been better than inter-node communication performance– Most applications use `block’ distributions

• Are such practices still valid on modern clusters?• Are such practices still valid on modern clusters?

HotI '09

• First true quad core processor with L3 cache sharing Socketcache sharing

• 45 nm manufacturing process• Uses QuickPath Interconnect

Technology

NehalemCore 0

32 KB

NehalemCore 3

32 KB

Socket

• HyperThreading allows execution of multiple threads per core in a seamless manner

• Turbo boost technology allows

32 KBL1D Cache

256 KB L2 Cache

32 KBL1D Cache

256 KB L2 Cache

• Turbo boost technology allows automatic over clocking of processors

• Integrated memory controller supporting multiple memory channels

L2 Cache

8MB L3 Cache

L2 Cache

supporting multiple memory channels gives very high memory bandwidth

• Has impact on Intra-node Communication Performance

DDR3 MemoryController

QuickPathInterconnect

HotI '09

• An industry standard for low latency, high bandwidth, System Area Networksbandwidth, System Area Networks

• Multiple features– Two communication types

• Channel Semantics • Channel Semantics • Memory Semantics (RDMA mechanism)

– Multiple virtual lanes– Quality of Service (QoS) support– Quality of Service (QoS) support

• Double Data Rate (DDR) with 20 Gbps bandwidth has been therebeen there

• Quad Data Rate (QDR) with 40 Gbps bandwidth is available recently

• Has impact on Inter-node communication performance• Has impact on Inter-node communication performanceHotI '09


HotI '09

• What are the intra-node and inter-node communication performance of Nehalem-based clusters with InfiniBand performance of Nehalem-based clusters with InfiniBand DDR and QDR?

• How do these communication performance compare with • How do these communication performance compare with previous generation Intel processors (Clovertown and Harpertown) with similar InfiniBand DDR and QDR?

• With rapid advances in processor and networking • With rapid advances in processor and networking technologies, are the relative performance between intra-node and inter-node changing?

• How such changes can be characterized?• Can such characterization be used to analyze

application performance across different systems? application performance across different systems? HotI '09


HotI '09

• Absolute Performance of Intra-node and Inter-node communicationcommunication– Different combinations of Intel processor platforms and

InfiniBand (DDR and QDR)

• Characterization of Relative Performance between Intra-• Characterization of Relative Performance between Intra-node and Inter-node communication – Use such characterization to analyze application-level

performance

HotI '09

• Applications have different communication characteristics– Latency sensitive– Latency sensitive– Bandwidth (uni-directional) sensitive– Bandwidth (bi-directional) sensitive

• Introduce a set of metrics Communication Balance Ratio (CBR) • Introduce a set of metrics Communication Balance Ratio (CBR) – CBR-Latency = Latency_Intra / Latency_Inter– CBR-Bandwidth = Bandwidth_Intra / Bandwidth_Inter– CBR-Bi-BW = Bi-BW_Intra / Bi_BW_Inter– CBR-Bi-BW = Bi-BW_Intra / Bi_BW_Inter– CBR-Multi-BW = Multi-BW_Intra / Multi-BW_Inter

• CBR-x=1 => Cluster is Balanced wrt metric x• Applications sensitive to metric x can be mapped anywhere in the

cluster without any significant impact on overall performance

HotI '09


HotI '09

• Three different compute platforms– Intel Clovertown– Intel Clovertown

• Intel Xeon E5345 Dual quad-core processors operating at 2.33 GHz• 6GB RAM, 4MB cache• PCIe 1.1 interface

– Intel Harpertown• Dual quad-core processors operating at 2.83 GHz • 8GB RAM, 6MB cache• PCIe 2.0 interface

– Intel Nehalem– Intel Nehalem• Intel Xeon E5530 Dual quad-core processors operating at 2.40 GHz• 12GB RAM, 8MB cache• PCIe 2.0 interface• PCIe 2.0 interface

HotI '09

• Two different InfiniBand Host Channel Adapters– Dual port ConnectX DDR adapter– Dual port ConnectX QDR adapter

• Two different InfiniBand Switches– Flextronics 144 port DDR switch– Mellanox 24 port QDR switch

• Five different platform-interconnect combinations– NH-QDR – Intel Nehalem machines using ConnectX QDR HCA’s– NH-DDR – Intel Nehalem machines using ConnectX DDR HCA’s– NH-DDR – Intel Nehalem machines using ConnectX DDR HCA’s– HT-QDR – Intel Harpertown machines using ConnectX QDR HCA’s– HT-DDR – Intel Harpertown machines using ConnectX DDR HCA’s– CT-DDR – Intel Clovertown machines using ConnectX DDR HCA’s– CT-DDR – Intel Clovertown machines using ConnectX DDR HCA’s

• Open Fabrics Enterprise Distribution (OFED) 1.4.1 drivers• Red Hat Enterprise Linux 4U4• MPI Stack used – MVAPICH2-1.2p1• MPI Stack used – MVAPICH2-1.2p1

HotI '09

• High Performance MPI Library for IB and 10GE• High Performance MPI Library for IB and 10GE– MVAPICH (MPI-1) and MVAPICH2 (MPI-2)

– Used by more than 960 organizations in 51 countries– Used by more than 960 organizations in 51 countries

– More than 32,000 downloads from OSU site directly

– Empowering many TOP500 clusters

• 8th ranked 62,976-core cluster (Ranger) at TACC

– Available with software stacks of many IB, 10GE and server vendors

including Open Fabrics Enterprise Distribution (OFED)including Open Fabrics Enterprise Distribution (OFED)

– Also supports uDAPL device to work with any network supporting

uDAPL

– http://mvapich.cse.ohio-state.edu/– http://mvapich.cse.ohio-state.edu/

HotI '09

• OSU Microbenchmarks (OMB)– Version 3.1.1– Version 3.1.1– http://mvapich.cse.ohio-state.edu/benchmarks/

• Intel Collective Microbenchmarks (IMB)• Intel Collective Microbenchmarks (IMB)– Version 3.2– http://software.intel.com/en-us/articles/intel-mpi-benchmarks/

• HPC Challenge Benchmark (HPCC)• HPC Challenge Benchmark (HPCC)– Version 1.3.1– http://icl.cs.utk.edu/hpcc/– http://icl.cs.utk.edu/hpcc/

• NAS Parallel Benchmarks (NPB)– Version 3.3– http://www.nas.nasa.gov/– http://www.nas.nasa.gov/

HotI '09

• Absolute Performance– Inter-node latency and bandwidth– Inter-node latency and bandwidth– Intra-node latency and bandwidth– Collective All-to-all – HPCC– NAS

• Communication Balance Ratio• Communication Balance Ratio– CBR-Latency– CBR-Bandwidth (uni-directional)– CBR-Bandwidth (bi-directional)– CBR-Bandwidth (bi-directional)– CBR-Bandwidth (multi-pair)

• Impact of CBR on Application Performance• Impact of CBR on Application Performance

HotI '09

44.5

14001600

22.5

33.5

4

Lat

ency

(u

s)

600800

100012001400

Lat

ency

(u

s)

1.551.63

1.76

00.5

11.5

2

Lat

ency

(u

s)

0200400600

Lat

ency

(u

s)

1.05

1.05

0

Message Size (Bytes)

HT-QDR HT-DDR NH-QDR


HT-QDR HT-DDR NH-QDRHT-QDR HT-DDR NH-QDR

NH-DDR CT-DDR


NH-DDR CT-DDR

• Harpertown systems deliver best small message latency• Up to 10% improvement in large message latency for NH-QDR over HT-QDR• Up to 10% improvement in large message latency for NH-QDR over HT-QDR

HotI '09

2500

3000

3500

Ban

dw

idth

(M

Bp

s)

3029

25755000

6000

7000

Bid

irec

tio

nal

Ban

dw

idth

(M

Bp

s)

5236

1000

1500

2000

2500

Ban

dw

idth

(M

Bp

s)

1943

15562000

3000

4000

5000

Bid

irec

tio

nal

Ban

dw

idth

(M

Bp

s)

19433870

3011

5042

3743

0

500

1000

Ban

dw

idth

(M

Bp

s)

0

1000

2000

Bid

irec

tio

nal

Ban

dw

idth

(M

Bp

s)



NH-DDR CT-DDR

Bid

irec

tio

nal

Ban

dw

idth

(M

Bp

s)



NH-DDR CT-DDRNH-DDR CT-DDR NH-DDR CT-DDR

• Nehalem systems offer a peak uni-directional bandwidth of 3029 MBps and bi-directional bandwidth of 5236 MBps

• NH-QDR gives up to 18% improvement in uni-directional bandwidth over HT-QDR

HotI '09

• NH-QDR gives up to 18% improvement in uni-directional bandwidth over HT-QDR

1

1.2

Lat

ency

(u

s)

2000

2500

Lat

ency

(u

s)

0.4

0.6

0.8

Lat

ency

(u

s)

1000

1500

Lat

ency

(u

s)

0.570.57

0.57

0

0.2

0.4

Lat

ency

(u

s)

0

500

Lat

ency

(u

s)

0.35

0.35




HT-QDR HT-DDR NH-QDRHT-QDR HT-DDR NH-QDR

NH-DDR CT-DDR


NH-DDR CT-DDR

• Intra-Socket small message latency of 0.35 us• Nehalem systems give up to 40% improvement in Intra-Node latency for

HotI '09

various message sizes

700080009000

10000

Ban

dw

idth

(M

Bp

s) 7474

700080009000

10000

Bid

irec

tio

nal

Ban

dw

idth

(M

Bp

s)

6826

30004000500060007000

Ban

dw

idth

(M

Bp

s)

3282

3282

7474

30004000500060007000

Bid

irec

tio

nal

Ban

dw

idth

(M

Bp

s)

27792779

6826

0100020003000

Ban

dw

idth

(M

Bp

s)

2208

0100020003000

Bid

irec

tio

nal

Ban

dw

idth

(M

Bp

s)

1738


HT-QDR HT-DDR NH-QDRB

idir

ecti

on

al B

and

wid

th (

MB

ps)




• Intra-Socket bandwidth (7474 MBps) and bidirectional bandwidth (6826 MBps) show the high memory bandwidth of Nehalem systems

• Drop in performance at large message size due to cache collisions• Drop in performance at large message size due to cache collisions

HotI '09

30000

Ban

dw

idth

(M

Bp

s)

Same Send/Recv Buffers

269542695425000

Ban

dw

idth

(M

Bp

s)

Different Send/Recv Buffers

10000

15000

20000

25000

Ban

dw

idth

(M

Bp

s)

16382

2695426954

1638210000

15000

20000

Ban

dw

idth

(M

Bp

s)

9843

4609

9843

0

5000

10000

Ban

dw

idth

(M

Bp

s)

21898

16382

0

5000

10000

Ban

dw

idth

(M

Bp

s)

4609

1168

4609



NH-DDR CT-DDR




• Different send/recv buffers are used to negate the caching effect• Nehalem systems show superior memory bandwidth with different send/recv

buffers

HotI '09

buffers

300

350

800000900000

1000000328.4

150

200

250

300

Lat

ency

(u

s)

400000500000600000700000800000

Lat

ency

(u

s)

160.2

50

100

150

Lat

ency

(u

s)

0100000200000300000400000

Lat

ency

(u

s)

72.1

0


0

Message Size (Bytes)Message Size (Bytes)NH-QDR NH-DDR CT-DDR

Message Size (Bytes)NH-QDR NH-DDR CT-DDR

• A 43% to 55% improvement by using QDR HCA over a DDR HCA • Harpertown numbers not shown due to unavailability of more number of nodes• Harpertown numbers not shown due to unavailability of more number of nodes

HotI '09

3000 • Baseline numbers are

1500

2000

2500

Ban

dw

idth

(M

Bp

s)

• Baseline numbers are taken on CT-DDR

• NH-DDR shows a 13%improvement in

500

1000

1500

Ban

dw

idth

(M

Bp

s)

improvement in performance over Harpertown and Clovertown systems

0

500

Minimum Ping Pong Bandwidth

Ban

dw

idth

(M

Bp

s)

• NH-QDR shows a 38% improvement in performance over NH-DDR

HT-QDR HT-DDR NH-QDRNH-DDR Baseline

performance over NH-DDR systems

HotI '09

700800

Ban

dw

idth

(M

Bp

s) 700800

Ban

dw

idth

(M

Bp

s)

300400500600700

Ban

dw

idth

(M

Bp

s)

300400500600700

Ban

dw

idth

(M

Bp

s)

0100200300

Ban

dw

idth

(M

Bp

s)

0100200300

Ban

dw

idth

(M

Bp

s)

0

Naturally Ordered Ring Bandwidth


Randomly Ordered Ring Bandwidth


NH-DDR Baseline NH-DDR Baseline

• Up to 190% improvement in Naturally Ordered Ring bandwidth for NH-QDR

• Up to 130% improvement in Randomly Ordered Ring bandwidth for NH-QDR

HotI '09

Ordered Ring bandwidth for NH-QDR Ordered Ring bandwidth for NH-QDR

1.2

Class B – 32 processes1.2

Class C – 32 processes

0.60.8

11.2

No

rmal

ized

Tim

e

0.60.8

11.2

No

rmal

ized

Tim

e

00.20.40.6

No

rmal

ized

Tim

e

00.20.40.6

No

rmal

ized

Tim

e

0

CG FT IS LU MG

No

rmal

ized

Tim

e

NAS Benchmarks

0

CG FT IS LU MG

No

rmal

ized

Tim

eNAS Benchmarks

NH-DDR NH-QDR NH-DDR NH-QDR

• Numbers normalized to NH-DDR• NH-QDR shows clear benefits over NH-DDR for multiple applications

HotI '09

• NH-QDR shows clear benefits over NH-DDR for multiple applications

2

2.5

Co

mm

un

icat

ion

Bal

ance

Rat

io

HT-QDR

1

1.5

Co

mm

un

icat

ion

Bal

ance

Rat

io

(L_i

ntr

a/L

_in

ter)

HT-QDRHT-DDRNH-QDRNH-DDRCT-DDR

0

0.5

Co

mm

un

icat

ion

Bal

ance

Rat

io

(L_i

ntr

a/L

_in

ter)

CT-DDRBalanced

0

Co

mm

un

icat

ion

Bal

ance

Rat

io


• Useful for Latency bound applications• Useful for Latency bound applications• Harpertown more balanced for applications using small to medium sized messages

• HT-DDR more balanced for applications using large messages followed • HT-DDR more balanced for applications using large messages followed by NH-QDR

HotI '09

2.5

3

3.5

Co

mm

un

icat

ion

Bal

ance

Rat

io

(BW

_in

tra/

BW

_in

ter)

HT-QDRHT-DDR

1.5

2

2.5

Co

mm

un

icat

ion

Bal

ance

Rat

io

(BW

_in

tra/

BW

_in

ter)

HT-DDRNH-QDRNH-DDRCT-DDR

0

0.5

1

Co

mm

un

icat

ion

Bal

ance

Rat

io

(BW

_in

tra/

BW

_in

ter)

Balanced

0

Co

mm

un

icat

ion

Bal

ance

Rat

io


• Useful for Bandwidth bound applications• Nehalem systems more balanced for applications using small to medium sized messages

• Harpertown systems more balanced for applications using large

HotI '09

• Harpertown systems more balanced for applications using large messages

1.41.61.8

2

Co

mm

un

icat

ion

Bal

ance

Rat

io

(Bi_

BW

_in

tra/

Bi_

BW

_in

ter) HT-QDR

HT-DDR

0.60.8

11.21.4

Co

mm

un

icat

ion

Bal

ance

Rat

io

(Bi_

BW

_in

tra/

Bi_

BW

_in

ter)

HT-DDRNH-QDRNH-DDRCT-DDRBalanced

00.20.40.6

Co

mm

un

icat

ion

Bal

ance

Rat

io

(Bi_

BW

_in

tra/

Bi_

BW

_in

ter)

Balanced

Co

mm

un

icat

ion

Bal

ance

Rat

io

(Bi_

BW

_in

tra/

Bi_

BW

_in

ter)


• Useful for Applications using frequent bidirectional communication • Useful for Applications using frequent bidirectional communication pattern

• Nehalem systems balanced for applications using small to medium sized messages in bidirectional communication pattern

HotI '09

medium sized messages in bidirectional communication pattern• NH-QDR balanced for all message sizes

12

14

16C

om

mu

nic

atio

n B

alan

ce R

atio

(M

P_B

W_i

ntr

a/M

P_B

W_i

nte

r)

HT-QDRHT-DDR

6

8

10

12

Co

mm

un

icat

ion

Bal

ance

Rat

io

(MP

_BW

_in

tra/

MP

_BW

_in

ter)

HT-DDRNH-QDRNH-DDRCT-DDR

0

2

4

6

Co

mm

un

icat

ion

Bal

ance

Rat

io

(MP

_BW

_in

tra/

MP

_BW

_in

ter)

CT-DDRBalanced

0

Co

mm

un

icat

ion

Bal

ance

Rat

io

(MP

_BW

_in

tra/

MP

_BW

_in

ter)


• Useful for Communication intensive applications • NH-QDR balanced for applications using mainly small to medium sized messages

• Harpertown balanced for applications using mainly large messages

HotI '09

• Harpertown balanced for applications using mainly large messages

1

1.2

No

rmal

ized

Tim

e • NH-QDR is more balanced than NH-DDR especially for medium to large

0.6

0.8

1

No

rmal

ized

Tim

e

DDR especially for medium to large messages

• Process mapping should have less impact with NH-QDR than NH-DDR for applications using medium to large

0

0.2

0.4

No

rmal

ized

Tim

e

applications using medium to large messages

• We compare NPB performance for block and cyclic process mapping

• Numbers normalized to NH-DDR-CyclicCG EP FT LU MG

NAS Benchmarks

• Numbers normalized to NH-DDR-Cyclic• NH-QDR has very similar performance

for both block and cyclic mapping for multiple applications

• CG & FT uses a lot of large messages, NH-QDR-Block NH-QDR-Cyclic

NH-DDR-Block NH-DDR-Cyclic

• CG & FT uses a lot of large messages, hence show difference

• MG is not communication intensive• LU uses small messages where CBR

for NH-QDR and NH-DDR is similar

HotI '09

for NH-QDR and NH-DDR is similar


HotI '09

• Studied absolute communication performance of various Intel computing platforms with InfiniBand DDR and QDRcomputing platforms with InfiniBand DDR and QDR

• Proposed a set of metrics related to Communication Balance Ratio (CBR)

• Evaluated these metrics for various computing platforms and • Evaluated these metrics for various computing platforms and InfiniBand DDR and QDR

• Nehalem systems with InfiniBand QDR give the best absolute • Nehalem systems with InfiniBand QDR give the best absolute performance for latency and bandwidth in most cases

• Nehalem based systems alter the CBR metrics• Nehalem systems with InfiniBand QDR interconnects also • Nehalem systems with InfiniBand QDR interconnects also

offer best communication balance in most cases• Plan to perform larger scale evaluations and study impact of

these systems on the performance of end applicationsthese systems on the performance of end applicationsHotI '09

{subramon, koop, panda}@cse.ohio-state.edu{subramon, koop, panda}@cse.ohio-state.edu

Network-Based Computing Laboratoryhttp://mvapich.cse.ohio-state.edu/http://mvapich.cse.ohio-state.edu/

HotI '09

Designing Next Generation Clusters: Evaluation of ...mvapich.cse.ohio-state.edu/static/media/publications/slide/subramon... · Designing Next Generation Clusters: Evaluation of InfiniBand

Documents