MVAPICH2 and MVAPICH2-MIC: Latest Status

MVAPICH2 and MVAPICH2-MIC: Latest Status

Dhabaleswar K. (DK) Panda and Khaled Hamidouche

The Ohio State University

E-mail: {panda, hamidouc}@cse.ohio-state.edu

https://mvapich.cse.ohio-state.edu/

Presentation at IXPUG Meeting, July 2014

by

mailto:[email protected]



Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)

0

10

20

30

40

50

60

70

80

90

100

0

50

100

150

200

250

300

350

400

450

500

Pe

rce

nta

ge o

f C

lust

ers

Nu

mb

er

of

Clu

ste

rs

Timeline

Percentage of Clusters

Number of Clusters

2 OSU - IXPUG'14

• 223 IB Clusters (44.3%) in the June 2014 Top500 list

(http://www.top500.org)

• Installations in the Top 50 (25 systems):

Large-scale InfiniBand Installations

519,640 cores (Stampede) at TACC (7th) 120, 640 cores (Nebulae) at China/NSCS (28th)

62,640 cores (HPC2) in Italy (11th) 72,288 cores (Yellowstone) at NCAR (29th)

147, 456 cores (Super MUC) in Germany (12th) 70,560 cores (Helios) at Japan/IFERC (30th)

76,032 cores (Tsubame 2.5) at Japan/GSIC (13th) 138,368 cores (Tera-100) at France/CEA (35th)

194,616 cores (Cascade) at PNNL (15th) 222,072 cores (QUARTETTO) in Japan (37th)

110,400 cores (Pangea) at France/Total (16th) 53,504 cores (PRIMERGY) in Australia (38th)

96,192 cores (Pleiades) at NASA/Ames (21st) 77,520 cores (Conte) at Purdue University (39th)

73,584 cores (Spirit) at USA/Air Force (24th) 44,520 cores (Spruce A) at AWE in UK (40th)

77,184 cores (Curie thin nodes) at France/CEA (26h) 48,896 cores (MareNostrum) at Spain/BSC (41st)

65,320-cores, iDataPlex DX360M4 at Germany/Max-Planck (27th)

and many more!

3 OSU - IXPUG'14

http://www.top500.org/

• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA

over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002

– MVAPICH2-X (MPI + PGAS), Available since 2012

– Support for GPGPUs and MIC

– Used by more than 2,150 organizations in 72 countries

– More than 218,000 downloads from OSU site directly

– Empowering many TOP500 clusters

• 7th ranked 519,640-core cluster (Stampede) at TACC

• 13th ranked 74,358-core cluster (Tsubame 2.5) at Tokyo Institute of Technology

• 23rd ranked 96,192-core cluster (Pleiades) at NASA

– Available with software stacks of many IB, HSE, and server vendors including

Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Partner in the U.S. NSF-TACC Stampede System

MVAPICH2/MVAPICH2-X Software

4 OSU - IXPUG'14

http://mvapich.cse.ohio-state.edu/



• Released on 06/20/14

• Major Features and Enhancements

– Based on MPICH-3.1

– MPI-3 RMA Support

– Support for Non-blocking collectives

– MPI-T support

– CMA support is default for intra-node communication

– Optimization of collectives with CMA support

– Large message transfer support

– Reduced memory footprint

– Improved Job start-up time

– Optimization of collectives and tuning for multiple platforms

– Updated hwloc to version 1.9

• MVAPICH2-X 2.0 GA supports hybrid MPI + PGAS (UPC and OpenSHMEM)

programming models

– Based on MVAPICH2 2.0 GA including MPI-3 features

– Compliant with UPC 2.18.0 and OpenSHMEM v1.0f

5

MVAPICH2 2.0 GA and MVAPICH2-X 2.0 GA

OSU - IXPUG'14

• MPI libraries run out of box (or with minor changes) on the Xeon Phi

• Critical to optimize the runtimes for better performance

– Tune existing designs

– Designs using lower level features offered by MPSS

– Designs to address system-level limitations

• Initial version of MVAPICH2-MIC (based on MVAPICH2 2.0a release) is

available on Stampede since Oct ‘13

– Supports all modes of usage – host-only, offload, coprocessor-only and symmetric

– Improved shared memory communication channel

– SCIF-based designs for improved communication within MIC and between MICs and Hosts

– Proxy-based design to work around bandwidth limitations on Sandy Bridge platform

• Enhanced version based on MVAPICH2 2.0GA release is being worked out.

Will be coming out soon!!

MVAPICH2-MIC: Optimized MPI Library for Xeon Phi Clusters

6 OSU - IXPUG'14

MPI Applications on MIC Clusters

Xeon Xeon Phi

Multi-core Centric

Many-core Centric

MPI Program

MPI Program

Offloaded Computation

MPI Program

MPI Program

MPI Program

Host-only

Offload (/reverse Offload)

Symmetric

Coprocessor-only

• MPI (+X) continues to be the predominant programming model in HPC • Flexibility in launching MPI jobs on clusters with Xeon Phi

7 OSU - IXPUG'14

Xeon Xeon Phi

Multi-core Centric

Many-core Centric

MPI Program

MPI Program


MPI Program

MPI Program

MPI Program

Host-only

Offload

Symmetric

Coprocessor-only

• Offload mode: lesser overhead way to extract performance from Xeon Phi

MPI Applications on MIC Clusters

8 OSU - IXPUG'14

MPI Data Movement in Offload Mode

CPU0 CPU1 QPI

MIC

PC

Ie

CPU1

MIC

IB

Node 0 Node 1

1. Intra-Socket 2. Inter-Socket 3. Inter-Node

MPI Process

CPU0

Supported by the Standard MVAPICH2 Library

(Latest Release - MVAPICH2 2.0 GA)

• MPI communication only from the host

9 OSU - IXPUG'14

One-way Latency: MPI over IB with MVAPICH

0.00

1.00

2.00

3.00

4.00

5.00

6.00Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

1.66

1.56

1.64

1.82

0.99 1.09

1.12

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

0.00

50.00

100.00

150.00

200.00

250.00Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR

Large Message Latency


Late

ncy

(u

s)

10 OSU - IXPUG'14

Bandwidth: MPI over IB with MVAPICH

0

2000

4000

6000

8000

10000

12000

14000Unidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)


3280

3385

1917 1706

6343

12485

12810

0

5000

10000

15000

20000

25000

30000Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR

Bidirectional Bandwidth

Ban

dw

idth

(MB

yte

s/se

c)


3341 3704

4407

11643

6521

21025

24727

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

11 OSU - IXPUG'14

0

0.2

0.4

0.6

0.8

1

0 1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(u

s)

Message Size (Bytes)

Latency

Intra-Socket Inter-Socket

MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))

12

Latest MVAPICH2 2.0

Intel Ivy-bridge

0.18 us

0.45 us

0

2000

4000

6000

8000

10000

12000

14000

16000

Ban

dw

idth

(M

B/s

)


Bandwidth (Inter-socket)

inter-Socket-CMA

inter-Socket-Shmem

inter-Socket-LiMIC

0

2000

4000

6000

8000

10000

12000

14000

16000

Ban

dw

idth

(M

B/s

)


Bandwidth (Intra-socket)

intra-Socket-CMA

intra-Socket-Shmem

intra-Socket-LiMIC

14,250 MB/s 13,749 MB/s

OSU - IXPUG'14

0

0.5

1

1.5

2

2.5

3

3.5

1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(u

s)


Inter-node Get/Put Latency

Get Put

MPI-3 RMA Get/Put with Flush Performance

13

Latest MVAPICH2 2.0, Intel Sandy-bridge with Connect-IB (single-port)

2.04 us

1.56 us

0

5000

10000

15000

20000

1 4 16 64 256 1K 4K 16K 64K 256K

Ban

dw

idth

(M

byt

es/s

ec)


Intra-socket Get/Put Bandwidth

Get Put

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 4 8 16 32 64 128256512 1K 2K 4K 8K

Late

ncy

(u

s)


Intra-socket Get/Put Latency

Get Put

0.08 us

OSU - IXPUG'14

0

1000

2000

3000

4000

5000

6000

7000

8000

1 4 16 64 256 1K 4K 16K 64K 256K

Ban

dw

idth

(M

byt

es/s

ec)


Inter-node Get/Put Bandwidth

Get Put6881

6876

14926

15364

HPCG Benchmark with Offload Mode

• Full subscription of a node:

• 1 MPI process + 10 OpenMP threads on the CPU

• 3 MPI processes + 6 OpenMP threads on CPU + 240 OpenMP threads offload on MIC

• Input data size = 128^3

• Binding using MV2_CPU_MAPPING

• 1 node MVAPICH2 achieves 24 GFlops

• 1024 nodes MVAPICH2 achieves 18.2 TFlops

14 OSU - IXPUG'14

0

200

400

600

800

1 2 4 8 16 32

GFl

op

s

# Nodes

MVAPICH2-Offload

0

5000

10000

15000

20000

64 128 256 512 1024

GFl

op

s

# Nodes

MVAPICH2-Offload

Xeon Xeon Phi

Multi-core Centric

Many-core Centric

MPI Program

MPI Program


MPI Program

MPI Program

MPI Program

Host-only

Offload

Symmetric

Coprocessor-only

• Xeon Phi as a many-core node

MPI Applications on Xeon Phi Clusters - IntraNode

15 OSU - IXPUG'14

MPI Data Movement in Coprocessor-Only Mode

CPU0 CPU1 QPI

MIC

PC

Ie

Node 0

Intra-MIC Communication

MPI Process

C C

L2 L2

C C

L2 L2

C C

L2 L2

C C

L2 L2

C C

L2 L2

C C

L2 L2

C C

L2 L2

C C

L2 L2

GDDR GDDR

GDDR GDDR

GDDR GDDR

GDDR GDDR

Xeon Phi

16 OSU - IXPUG'14

MVAPICH2-MIC Channels for Intra-MIC Communication

• MVAPICH2 provides a hybrid of two

channels

– CH3-SHM

• Derives the design for shared memory

communication on host

• Tuned for the Xeon Phi architecture

– CH3-SCIF

• SCIF is a lower level API provided by

MPSS

• Provides user control of the DMA

• Takes advantage of SCIF to improve

performance

MVAPICH2 MVAPICH2

CH3-SCIF CH3-SCIF

Xeon Phi

CH3-SHM CH3-SHM

SCIF SCIF POSIX Calls POSIX Calls

S. Potluri, A. Venkatesh, D. Bureddy, K. Kandalla, and D. K. Panda - Efficient Intra-node Communication on Intel-MIC

Clusters - International Symposium on Cluster, Cloud and Grid Computing, May 2013 (CCGrid’ 13).

S. Potluri, K. Tomko, D. Bureddy and D. K. Panda - Intra-MIC MPI Communication using MVAPICH2: Early

Experience - TACC-Intel Highly-Parallel Computing Symposium (TI-HPCS), April 2012 - Best Student Paper

17 OSU - IXPUG'14

Intra-MIC - Point-to-Point Communication

osu_latency (small)

osu_bw osu_bibw

Better

Bet

ter

Bet

ter

Better

osu_latency (medium)

0

5

10

15

20

25

64 256 1K 4K

La

ten

cy (

use

c)


020406080

100120140160

1 4 16 64 256

Ba

nd

wid

th (

MB

/s)


0

100

200

300

400

500

1 4 16 64 256 1K 4K

Ba

nd

wid

th (

MB

/s)


66%

30%

00.5

11.5

22.5

33.5

4

0 2 8 32

La

ten

cy (

use

c)


MV2-MIC-2.0GA

MV2-MIC-2.0a

18 OSU - IXPUG'14

Intra-MIC - Collective Communication – AlltoAll

8 processes – osu_alltoall (small)

Better

Better

16 processes – osu_alltoall (small)

0

100

200

300

400

500

1 4 16 64 256 1K 4K

La

ten

cy (

use

c)


• 8 Processes: 40% and 19% improvement for 4 and 4K bytes messages

• 16 Processes: 38% and 71% improvement for 64 and 1K bytes messages

70%

0

50

100

150

200

250

1 4 16 64 256 1K 4K

La

ten

cy (

use

c)


MV2-MIC-2.0GA

MV2-MIC-2.0a

19 OSU - IXPUG'14

Intra-MIC – HPCG Benchmark

MV2-MIC-2.0GA MV2-MIC-2.0a

8%

• 4 MPI processes with 60 OpenMP threads per process

• Binding using MV2_MIC_MAPPING environment variable

Bet

ter

20 OSU - IXPUG'14

0

2

4

6

8

10

12

14

16

32 64

GFl

op

s

Input size

Xeon Xeon Phi

Multi-core Centric

Many-core Centric

MPI Program

MPI Program


MPI Program

MPI Program

MPI Program

Host-only

Offload

Symmetric

Coprocessor-only

MPI Applications on MIC Clusters - InterNode

21 OSU - IXPUG'14

MPI Data Movement in Coprocessor-Only Mode

CPU0 CPU1 QPI

MIC

PC

Ie

CPU1

MIC

IB

Node 0 Node 1

1. MIC-RemoteMIC

CPU0

MPI Process

22 OSU - IXPUG'14

MVAPICH2 Channels for MIC-RemoteMIC Communication

• CH3-IB channel

– High-performance InfiniBand channel in MVAPICH2

– Implemented using IB Verbs - DirectIB

– Currently does not support advanced features like hardware multicast, etc.

– Limited by P2P Bandwidth offered on Sandy Bridge platform

MVAPICH2 MVAPICH2

Xeon Phi

IB Verbs IB Verbs

OFA-IB-CH3 OFA-IB-CH3

mlx4_0 mlx4_0

MVAPICH2 MVAPICH2

Xeon Phi

IB Verbs IB Verbs

CH3 CH3

mlx4_0 mlx4_0

IB-HCA IB-HCA

23 OSU - IXPUG'14

Host Proxy-based Designs in MVAPICH2-MIC

• Direct IB channels is limited by P2P read bandwidth

• MVAPICH2-MIC uses a hybrid DirectIB + host proxy-based approach to

work around this

962.86 MB/s

5280 MB/s

P2P Read / IB Read from Xeon Phi

P2P Write/IB Write to Xeon Phi

6977 MB/s 6296 MB/s

IB Read from Host

Xeon Phi-to-Host

SNB E5-2670

24 OSU - IXPUG'14

S. Potluri, D. Bureddy, K. Hamidouche, A. Venkatesh, K. Kandalla, H. Subramoni and D. K. Panda, MVAPICH-PRISM: A Proxy-based Communication Framework using InfiniBand and SCIF for Intel MIC Clusters Int'l

Conference on Supercomputing (SC '13), November 2013.

MIC-RemoteMIC Point-to-Point Communication (Active Proxy)

osu_latency (small)

osu_bw osu_bibw

Better B

ette

r

Bet

ter

Better

osu_latency (large)

0

2000

4000

6000

1 16 256 4K 64K 1M

Ba

nd

wid

th (

MB

/sec

)


0

2000

4000

6000

8000

10000

1 16 256 4K 64K 1M

Ba

nd

wid

th (

MB

/sec

)


5681 8290

0

5

10

15

20

25

0 2 8 32 128 512 2K

La

ten

cy (

use

c)


MV2-MIC-2.0GA w/ proxy

MV2-MIC-2.0GA w/o proxy

0

1000

2000

3000

4000

5000

8K 32K 128K 512K 2M

La

ten

cy (

use

c)


25 OSU - IXPUG'14

InterNode - Coprocessor-only Mode - HPCG

26 OSU - IXPUG'14

0

50

100

150

1 2 4 8

GFl

op

s

# MICs

MV2-MIC-2.0GA

• Full subscription of a MIC: (Host is not involved) • 3 MPI processes + 240 OpenMP threads on MIC • Input size = 96^3

• Binding using MV2_MIC_MAPPING • 1 node MVAPICH2-MIC achieves 15 Gflops • 8 nodes MVAPICH2-MIC achieves 105 GFlops

Xeon Xeon Phi

Multi-core Centric

Many-core Centric

MPI Program

MPI Program


MPI Program

MPI Program

MPI Program

Host-only

Offload

Symmetric

Coprocessor-only

MPI Applications on MIC Clusters - InterNode

27 OSU - IXPUG'14

MPI Data Movement in Symmetric Mode - InterNode

CPU0 CPU1 QPI

PC

Ie

CPU1

MIC

IB

Node 0 Node 1

CPU0

MPI Process

1. MIC- RemoteMIC 2. MIC- RemoteHost (Closer to HCA)

3. MIC- RemoteHost (Farther from HCA)

• Uses OFA-IB-CH3 channel backed by host-based proxy

28 OSU - IXPUG'14

MIC-RemoteHost Point-to-Point Communication (Active Proxy)

osu_latency (small)

osu_bw osu_bibw

Better B

ette

r

Bet

ter

Better

osu_latency (large) 5701 8742

0

2

4

6

8

10

12

14

0 2 8 32 128 512 2K

La

ten

cy (

use

c)


MV2-MIC-2.0GA w/ proxy

MV2-MIC-2.0GA w/o proxy

0

500

1000

1500

2000

2500

3000

8K 32K 128K 512K 2M

La

ten

cy (

use

c)


0

1000

2000

3000

4000

5000

6000

1 4

16

64

25

6

1K

4K

16

K

64

K

25

6K

1M

4M

Ba

nd

wid

th (

MB

/s)


0

2000

4000

6000

8000

10000

1 4

16

64

25

6

1K

4K

16

K

64

K

25

6K

1M

4M

Ba

nd

wid

th (

MB

/s)


29 OSU - IXPUG'14

Inter-Node Symmetric Mode – Passive Proxy

osu_alltoall

Better

Better

osu_allgather

64 MPI processes on 2 nodes (16H+16M per node)

• Fully-subscribed mode (16 processes on the Host)

• Passive proxy design outperforms the active proxy (default in MV2-MIC-2.0a)

30 OSU - IXPUG'14

0100000200000300000400000500000600000700000800000

La

ten

cy (

use

c)


MV2-MIC-2.0GA

MV2-MIC-2.0a

0

20000

40000

60000

80000

100000

1 4 16 64 256 1K 4K 16K64K

La

ten

cy (

use

c)


1.5X 1.8X

Inter-Node Symmetric Mode – Redesigning Collectives

0

0.1

0.2

0.3

0.4

0.5

0.6

1 2 4 8 16 32 64 128 256 512

La

ten

cy (

us)

x

10

00

0


0

2

4

6

8

10

12

14

La

ten

cy (

us)

x 1

00

00

0


0

5

10

15

20

25L

ate

ncy

(u

s)

x 1

00

0


MV2-MIC-2.0a

MV2-MIC-2.0GA

0

2

4

6

8

10

12

14

La

ten

cy (

us)

x 1

00

00

0


Allgather: 32 Nodes (16H+16M) Allgather: 32 Nodes (8H+8M)

Alltoall: 16 Nodes (8H+8M) Alltoall: 16 Nodes (8H+8M)

Better

Better

31 OSU - IXPUG'14

3X

2X

2.5X

A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014

InterNode – Symmetric Mode - HPCG

• Full subscription of a node: • 2 MPI processes + 16 OpenMP threads on CPU • 3 MPI processes + 240 OpenMP threads on MIC • Input size = 96^3

• Binding using both MV2_CPU_MAPPING and MV2_MIC_MAPPING • To explore different threading levels for host and Xeon Phi • 1 node MVAPICH2-MIC achieves 23.5 GFlops • 16 nodes MVAPICH2-MIC achieves 340 GFlops

32 OSU - IXPUG'14

0

100

200

300

400

1 2 4 8 16

GFl

op

s

# Nodes

MV2-MIC-2.0GA

On-Going and Future Work

• Redesigning of other collective operations

• Optimizing MPI3-RMA operations for MIC

• Automatic proxy selection using architecture topology detection

• Evaluating application performance and scaling

• Extending designs to KNL-based self-hosted nodes

33 OSU - IXPUG'14

Conclusion

• MVAPICH2-MIC provides a high-performance and scalable MPI library

for InfiniBand clusters using Xeon Phi

• Initial version takes advantage of SCIF to improve Intra-MIC and

Intranode MIC-Host communication

• Host-Proxy based designs help work around P2P bandwidth

bottlenecks

• Provides initial support for heterogeneity-aware collectives

• Enhanced version of MVAPICH2-MIC (based on 2.0 GA) will be

available soon

34 OSU - IXPUG'14

• August 25-27, 2014; Columbus, Ohio, USA

• Keynote Talks, Invited Talks, Contributed Presentations

• Tutorial on MVAPICH2 and MVAPICH2-X optimization and tuning

• Speakers (Confirmed so far):

– Keynote: Dan Stanzione (TACC)

– Keynote: Darren Kerbyson (PNNL)

– Sadaf Alam (CSCS, Switzerland)

– Onur Celebioglu (Dell)

– Jens Glaser (Univ. of Michigan)

– Jeff Hammond (Intel)

– Adam Moody (LLNL)

– David Race (Cray)

– Davide Rossetti (NVIDIA)

– Gilad Shainer (Mellanox)

– Sayantan Sur (Intel)

– Mahidhar Tatineni (SDSC)

– Jerome Vienne (TACC)

• More details at: http://mug.mvapich.cse.ohio-state.edu

35

Upcoming 2nd Annual MVAPICH User Group (MUG) Meeting

OSU - IXPUG'14

http://mug.mvapich.cse.ohio-state.edu/



Web Pointers

http://www.cse.ohio-state.edu/~panda

http://nowlab.cse.ohio-state.edu

MVAPICH Web Page

http://mvapich.cse.ohio-state.edu

[email protected]

36 OSU - IXPUG'14





http://nowlab.cse.ohio-state.edu/










MVAPICH2 and MVAPICH2-MIC: Latest Status

Documents