Institute for Advanced Computational Science · 2018-07-19 · •Scientific Computing – Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant Programming Model

Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges

Dhabaleswar K. (DK) Panda The Ohio State University

E-mail: [email protected] http://www.cse.ohio-state.edu/~panda

Talk at Brookhaven National Laboratory (Oct 2014)

by

http://www.cse.ohio-state.edu/%7Epanda

High-End Computing (HEC): PetaFlop to ExaFlop

BNL, Oct. '14 2

Expected to have an ExaFlop system in 2020-2024!

100 PFlops in

2015

1 EFlops in 2018?

• Scientific Computing – Message Passing Interface (MPI), including MPI + OpenMP, is the

Dominant Programming Model

– Many discussions towards Partitioned Global Address Space (PGAS) • UPC, OpenSHMEM, CAF, etc.

– Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)

• Big Data/Enterprise/Commercial Computing – Focuses on large data and data analysis

– Hadoop (HDFS, HBase, MapReduce)

– Spark is emerging for in-memory computing

– Memcached is also used for Web 2.0

• Applications can run on a single-site or across sites over WAN

3

Two Major Categories of Applications

BNL, Oct. '14

BNL, Oct. '14 4

Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)

0102030405060708090100

050

100150200250300350400450500

Perc

enta

ge o

f Clu

ster

s

Num

ber o

f Clu

ster

s

Timeline

Percentage of ClustersNumber of Clusters

Drivers of Modern HPC Cluster Architectures

• Multi-core processors are ubiquitous

• InfiniBand very popular in HPC clusters •

• Accelerators/Coprocessors becoming common in high-end systems

• Pushing the envelope for Exascale computing

Accelerators / Coprocessors high compute density, high performance/watt

>1 TFlop DP on a chip

High Performance Interconnects - InfiniBand <1usec latency, >100Gbps Bandwidth

Tianhe – 2 (1) Titan (2) Stampede (6) Tianhe – 1A (10)

5

Multi-core Processors

BNL, Oct. '14

• 223 IB Clusters (44.3%) in the June 2014 Top500 list

(http://www.top500.org)

• Installations in the Top 50 (25 systems):

BNL, Oct. '14

Large-scale InfiniBand Installations

519,640 cores (Stampede) at TACC (7th) 120, 640 cores (Nebulae) at China/NSCS (28th)

62,640 cores (HPC2) in Italy (11th) 72,288 cores (Yellowstone) at NCAR (29th)

147, 456 cores (Super MUC) in Germany (12th) 70,560 cores (Helios) at Japan/IFERC (30th)

76,032 cores (Tsubame 2.5) at Japan/GSIC (13th) 138,368 cores (Tera-100) at France/CEA (35th)

194,616 cores (Cascade) at PNNL (15th) 222,072 cores (QUARTETTO) in Japan (37th)

110,400 cores (Pangea) at France/Total (16th) 53,504 cores (PRIMERGY) in Australia (38th)

96,192 cores (Pleiades) at NASA/Ames (21st) 77,520 cores (Conte) at Purdue University (39th)

73,584 cores (Spirit) at USA/Air Force (24th) 44,520 cores (Spruce A) at AWE in UK (40th)

77,184 cores (Curie thin nodes) at France/CEA (26h) 48,896 cores (MareNostrum) at Spain/BSC (41st)

65,320-cores, iDataPlex DX360M4 at Germany/Max-Planck (27th) and many more!

6

http://www.top500.org/

Towards Exascale System (Today and Target)

Systems 2014 Tianhe-2

2020-2022 Difference Today & Exascale

System peak 55 PFlop/s 1 EFlop/s ~20x

Power 18 MW (3 Gflops/W)

~20 MW (50 Gflops/W)

O(1) ~15x

System memory 1.4 PB (1.024PB CPU + 0.384PB CoP)

32 – 64 PB ~50X

Node performance 3.43TF/s (0.4 CPU + 3 CoP)

1.2 or 15 TF O(1)

Node concurrency 24 core CPU + 171 cores CoP

O(1k) or O(10k) ~5x - ~50x

Total node interconnect BW 6.36 GB/s 200 – 400 GB/s ~40x -~60x

System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x

Total concurrency 3.12M 12.48M threads (4 /core)

O(billion) for latency hiding

~100x

MTTI Few/day Many/day O(?)

Courtesy: Prof. Jack Dongarra

7 BNL, Oct. '14

BNL, Oct. '14 8

Parallel Programming Models Overview

P1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory Memory

Logical shared memory

Shared Memory Model

SHMEM, DSM Distributed Memory Model

MPI (Message Passing Interface)

Partitioned Global Address Space (PGAS)

Global Arrays, UPC, Chapel, X10, CAF, …

• Programming models provide abstract machine models

• Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc.

• Power required for data movement operations is one of the main challenges

• Non-blocking collectives – Overlap computation and communication

• Much improved One-sided interface – Reduce synchronization of sender/receiver

• Manage concurrency – Improved interoperability with PGAS (e.g. UPC, Global Arrays,

OpenSHMEM)

• Resiliency – New interface for detecting failures

BNL, Oct. '14 9

How does MPI Plan to Meet Exascale Challenges?

• Major features – Non-blocking Collectives

– Improved One-Sided (RMA) Model

– MPI Tools Interface

• Specification is available from: http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf

BNL, Oct. '14 10

Major New Features in MPI-3

http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf

http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf

11

Partitioned Global Address Space (PGAS) Models

BNL, Oct. '14

• Key features - Simple shared memory abstractions

- Light weight one-sided communication

- Easier to express irregular communication

• Different approaches to PGAS

- Languages • Unified Parallel C (UPC)

• Co-Array Fortran (CAF)

• X10

- Libraries • OpenSHMEM

• Global Arrays

• Chapel

• Hierarchical architectures with multiple address spaces

• (MPI + PGAS) Model – MPI across address spaces

– PGAS within an address space

• MPI is good at moving data between address spaces

• Within an address space, MPI can interoperate with other shared memory programming models

• Can co-exist with OpenMP for offloading computation

• Applications can have kernels with different communication patterns

• Can benefit from different models

• Re-writing complete applications can be a huge effort

• Port critical kernels to the desired model instead

BNL, Oct. '14 12

MPI+PGAS for Exascale Architectures and Applications

Hybrid (MPI+PGAS) Programming

• Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics

• Benefits: – Best of Distributed Computing Model

– Best of Shared Memory Computing Model

• Exascale Roadmap*: – “Hybrid Programming is a practical way to

program exascale systems”

* The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications, ISSN 1094-3420

Kernel 1 MPI

Kernel 2 MPI

Kernel 3 MPI

Kernel N MPI

HPC Application

Kernel 2 PGAS

Kernel N PGAS

BNL, Oct. '14 13

Designing Software Libraries for Multi-Petaflop and Exaflop Systems: Challenges

Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM),

CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc.

Application Kernels/Applications

Networking Technologies (InfiniBand, 40/100GigE,

Aries, BlueGene)

Multi/Many-core Architectures

Point-to-point Communication (two-sided & one-sided)

Collective Communication

Synchronization & Locks

I/O & File Systems

Fault Tolerance

Communication Library or Runtime for Programming Models

14

Accelerators (NVIDIA and MIC)

Middleware

BNL, Oct. '14

Co-Design Opportunities

and Challenges

across Various Layers

Performance Scalability

Fault-Resilience

• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided) – Extremely minimum memory footprint

• Balancing intra-node and inter-node communication for next generation multi-core (128-1024 cores/node)

– Multiple end-points per node • Support for efficient multi-threading • Support for GPGPUs and Accelerators • Scalable Collective communication

– Offload – Non-blocking – Topology-aware – Power-aware

• Fault-tolerance/resiliency • QoS support for communication and I/O • Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC,

MPI + OpenSHMEM, …) • Virtualization

Challenges in Designing (MPI+X) at Exascale

15 BNL, Oct. '14

• Extreme Low Memory Footprint – Memory per core continues to decrease

• D-L-A Framework – Discover

• Overall network topology (fat-tree, 3D, …) • Network topology for processes for a given job • Node architecture • Health of network and node

– Learn • Impact on performance and scalability • Potential for failure

– Adapt • Internal protocols and algorithms • Process mapping • Fault-tolerance solutions

– Low overhead techniques while delivering performance, scalability and fault-tolerance

Additional Challenges for Designing Exascale Middleware

16 BNL, Oct. '14

• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002

– MVAPICH2-X (MPI + PGAS), Available since 2012

– Support for GPGPUs and MIC

– Used by more than 2,225 organizations in 73 countries

– More than 224,000 downloads from OSU site directly

– Empowering many TOP500 clusters

• 7th ranked 519,640-core cluster (Stampede) at TACC

• 13th, 74,358-core (Tsubame 2.5) at Tokyo Institute of Technology

• 23rd, 96,192-core (Pleiades) at NASA and many others

– Available with software stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Partner in the U.S. NSF-TACC Stampede System

MVAPICH2/MVAPICH2-X Software

17 BNL, Oct. '14

http://mvapich.cse.ohio-state.edu/



• Support for GPGPUs • Support for Intel MICs • Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with

Unified Runtime • Virtualization

Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale

18 BNL, Oct. '14

BNL, Oct. '14 19

One-way Latency: MPI over IB with MVAPICH2

0.00

1.00

2.00

3.00

4.00

5.00

6.00Small Message Latency

Message Size (bytes)

Late

ncy

(us)

1.66

1.56

1.64

1.82

0.99 1.09

1.12

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

0.00

50.00

100.00

150.00

200.00

250.00Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR

Large Message Latency


Late

ncy

(us)

BNL, Oct. '14 20

Bandwidth: MPI over IB with MVAPICH2

0

2000

4000

6000

8000

10000

12000

14000Unidirectional Bandwidth

Band

wid

th (M

Byte

s/se

c)


3280

3385

1917 1706

6343

12485

12810

0

5000

10000

15000

20000

25000

30000Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR

Bidirectional Bandwidth

Band

wid

th (M

Byte

s/se

c)


3341 3704

4407

11643

6521

21025

24727

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

0

0.2

0.4

0.6

0.8

1

0 1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(us)

Message Size (Bytes)

Latency

Intra-Socket Inter-Socket

MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))

21

Latest MVAPICH2 2.1a

Intel Ivy-bridge

0.18 us

0.45 us

02000400060008000

10000120001400016000

Band

wid

th (M

B/s)


Bandwidth (Inter-socket)

inter-Socket-CMAinter-Socket-Shmeminter-Socket-LiMIC

02000400060008000

10000120001400016000

Band

wid

th (M

B/s)


Bandwidth (Intra-socket)

intra-Socket-CMAintra-Socket-Shmemintra-Socket-LiMIC

14,250 MB/s 13,749 MB/s

BNL, Oct. '14

0

0.5

1

1.5

2

2.5

3

3.5

1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)


Inter-node Get/Put Latency

Get Put

MPI-3 RMA Get/Put with Flush Performance

22

Latest MVAPICH2 2.0rc2, Intel Sandy-bridge with Connect-IB (single-port)

2.04 us

1.56 us

0

5000

10000

15000

20000

1 4 16 64 256 1K 4K 16K 64K 256K

Band

wid

th (M

byte

s/se

c)


Intra-socket Get/Put Bandwidth

Get Put

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 4 8 16 32 64 128256512 1K 2K 4K 8K

Late

ncy

(us)


Intra-socket Get/Put Latency

Get Put

0.08 us

BNL, Oct. '14

010002000300040005000600070008000

1 4 16 64 256 1K 4K 16K 64K 256K

Band

wid

th (M

byte

s/se

c)


Inter-node Get/Put Bandwidth

Get Put6881

6876

14926

15364

• Memory usage for 32K processes with 8-cores per node can be 54 MB/process (for connections)

• NAMD performance improves when there is frequent communication to many peers

BNL, Oct. '14

Minimizing memory footprint with XRC and Hybrid Mode Memory Usage Performance on NAMD (1024 cores)

23

• Both UD and RC/XRC have benefits

• Hybrid for the best of both

• Available since MVAPICH2 1.7 as integrated interface

• Runtime Parameters: RC - default;

• UD - MV2_USE_ONLY_UD=1

• Hybrid - MV2_HYBRID_ENABLE_THRESHOLD=1

M. Koop, J. Sridhar and D. K. Panda, “Scalable MPI Design over InfiniBand using eXtended Reliable Connection,” Cluster ‘08

0

100

200

300

400

500

1 4 16 64 256 1K 4K 16KMem

ory

(MB/

proc

ess)

Connections

MVAPICH2-RCMVAPICH2-XRC

0

0.5

1

1.5

apoa1 er-gre f1atpase jac

Nor

mal

ized

Tim

e

Dataset

MVAPICH2-RC MVAPICH2-XRC

0

2

4

6

128 256 512 1024

Tim

e (u

s)

Number of Processes

UD Hybrid RC

26% 40% 30% 38%






24 BNL, Oct. '14

PCIe

GPU

CPU

NIC

Switch

At Sender: cudaMemcpy(s_hostbuf, s_devbuf, . . .); MPI_Send(s_hostbuf, size, . . .);

At Receiver: MPI_Recv(r_hostbuf, size, . . .); cudaMemcpy(r_devbuf, r_hostbuf, . . .);

• Data movement in applications with standard MPI and CUDA interfaces

High Productivity and Low Performance

25 BNL, Oct. '14

MPI + CUDA - Naive

PCIe

GPU

CPU

NIC

Switch

At Sender: for (j = 0; j < pipeline_len; j++) cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz,

…); for (j = 0; j < pipeline_len; j++) { while (result != cudaSucess) { result = cudaStreamQuery(…); if(j > 0) MPI_Test(…); } MPI_Isend(s_hostbuf + j * block_sz, blksz . . .); } MPI_Waitall();

<<Similar at receiver>>

• Pipelining at user level with non-blocking MPI and CUDA interfaces

Low Productivity and High Performance

26 BNL, Oct. '14

MPI + CUDA - Advanced

At Sender: At Receiver: MPI_Recv(r_devbuf, size, …);

inside MVAPICH2

• Standard MPI interfaces used for unified data movement

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

27 BNL, Oct. '14

GPU-Aware MPI Library: MVAPICH2-GPU

• 45% improvement compared with a naïve user-level implementation (Memcpy+Send), for 4MB messages

• 24% improvement compared with an advanced user-level implementation (MemcpyAsync+Isend), for 4MB messages

H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters, ISC ‘11

Better

28 BNL, Oct. '14

24 %

45 %

0

500

1000

1500

2000

2500

3000

32K 64K 128K 256K 512K 1M 2M 4M

Tim

e (u

s)


Memcpy+SendMemcpyAsync+IsendMVAPICH2-GPU

MPI Micro-benchmark Performance

CUDA-Aware MPI: MVAPICH2 1.8, 1.9, 2.0 Releases

• Support for MPI communication from NVIDIA GPU device memory

• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)

• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)

• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node

• Optimized and tuned collectives for GPU device buffers • MPI datatype support for point-to-point and collective

communication from GPU device buffers

29 BNL, Oct. '14

• OFED with support for GPUDirect RDMA is developed by NVIDIA and Mellanox

• OSU has a design of MVAPICH2 using GPUDirect RDMA – Hybrid design using GPU-Direct RDMA

• GPUDirect RDMA and Host-based pipelining

• Alleviates P2P bandwidth bottlenecks on SandyBridge and IvyBridge

– Support for communication using multi-rail

– Support for Mellanox Connect-IB and ConnectX VPI adapters

– Support for RoCE with Mellanox ConnectX VPI adapters

BNL, Oct. '14 30

GPU-Direct RDMA (GDR) with CUDA

IB Adapter

System Memory

GPU Memory

GPU

CPU

Chipset

P2P write: 5.2 GB/s P2P read: < 1.0 GB/s

SNB E5-2670

P2P write: 6.4 GB/s P2P read: 3.5 GB/s

IVB E5-2680V2

SNB E5-2670 /

IVB E5-2680V2

31

Performance of MVAPICH2 with GPU-Direct-RDMA: Latency GPU-GPU Internode MPI Latency

BNL, Oct. '14

MVAPICH2-GDR-2.0 Intel Ivy Bridge (E5-2680 v2) node with 20 cores

NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA

71 %

90 %

2.18 usec 0

5

10

15

20

25

30

0 2 8 32 128 512 2K

MV2-GDR2.0

MV2-GDR2.0b

MV2 w/o GDR

Small Message Latency


Late

ncy

(us)

32

Performance of MVAPICH2 with GPU-Direct-RDMA: Bandwidth GPU-GPU Internode MPI Uni-Directional Bandwidth

BNL, Oct. '14

10x

0

500

1000

1500

2000

2500

3000

1 4 16 64 256 1K 4K

MV2-GDR2.0

MV2-GDR2.0b

MV2 w/o GDR

Small Message Bandwidth


Band

wid

th (M

B/s)

2.2x



33

Performance of MVAPICH2 with GPU-Direct-RDMA: Bi-Bandwidth GPU-GPU Internode MPI Bi-directional Bandwidth

BNL, Oct. '14

10x

2x

0

500

1000

1500

2000

2500

3000

3500

4000

1 4 16 64 256 1K 4K

MV2-GDR2.0

MV2-GDR2.0b

MV2 w/o GDR

Small Message Bi-Bandwidth


Bi-B

andw

idth

(MB/

s)



34

Performance of MVAPICH2 with GPU-Direct-RDMA: MPI-3 RMA GPU-GPU Internode MPI Put latency (RMA put operation Device to Device )

BNL, Oct. '14

2.88 us

83%



MPI-3 RMA provides flexible synchronization and completion primitives

0

5

10

15

20

25

30

35

0 2 8 32 128 512 2K 8K

MV2-GDR2.0

MV2-GDR2.0b

Small Message Latency


Late

ncy

(us)

BNL, Oct. '14 35

Application-Level Evaluation (HOOMD-blue)

• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) • MV2-GDR 2.0 (released on 08/23/14) : try it out !!

• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

HOOMD-blue Strong Scaling HOOMD-blue Weak Scaling

• MVAPICH2-2.0 with GDR support can be downloaded from https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/

• System software requirements • Mellanox OFED 2.1 or later

• NVIDIA Driver 331.20 or later

• NVIDIA CUDA Toolkit 6.0 or later

• Plugin for GPUDirect RDMA

– http://www.mellanox.com/page/products_dyn?product_family=116

– Strongly Recommended : use the new GDRCOPY module from NVIDIA

• https://github.com/drossetti/gdrcopy

• Has optimized designs for point-to-point communication using GDR

• Contact MVAPICH help list with any questions related to the package

[email protected]

Using MVAPICH2-GPUDirect Version

MUG'14

https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/

https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/

http://www.mellanox.com/page/products_dyn?product_family=116

mailto:[email protected]






37 BNL, Oct. '14

MPI Applications on MIC Clusters

Xeon Xeon Phi

Multi-core Centric

Many-core Centric

MPI Program

MPI Program

Offloaded Computation

MPI Program

MPI Program

MPI Program

Host-only

Offload (/reverse Offload)

Symmetric

Coprocessor-only

• Flexibility in launching MPI jobs on clusters with Xeon Phi

38 BNL, Oct. '14

Data Movement on Intel Xeon Phi Clusters

CPU CPU QPI

MIC

PCIe

M

IC

MIC

CPU

MIC

IB

Node 0 Node 1 1. Intra-Socket 2. Inter-Socket 3. Inter-Node 4. Intra-MIC 5. Intra-Socket MIC-MIC 6. Inter-Socket MIC-MIC 7. Inter-Node MIC-MIC 8. Intra-Socket MIC-Host

10. Inter-Node MIC-Host 9. Inter-Socket MIC-Host

MPI Process

39

11. Inter-Node MIC-MIC with IB adapter on remote socket

and more . . .

• Critical for runtimes to optimize data movement, hiding the complexity

• Connected as PCIe devices – Flexibility but Complexity

BNL, Oct. '14

40

MVAPICH2-MIC Design for Clusters with IB and MIC

• Offload Mode

• Intranode Communication

• Coprocessor-only Mode

• Symmetric Mode

• Internode Communication

• Coprocessors-only

• Symmetric Mode

• Multi-MIC Node Configurations

BNL, Oct. '14

MIC-Remote-MIC P2P Communication

41 BNL, Oct. '14

Bandwidth

Better Bett

er

Bett

er

Latency (Large Messages)

0

1000

2000

3000

4000

5000

8K 32K 128K 512K 2M

Lat

ency

(use

c)


0

2000

4000

6000

1 16 256 4K 64K 1M

Ban

dwid

th (M

B/se

c)


5236

Intra-socket P2P

Inter-socket P2P

02000400060008000

10000120001400016000

8K 32K 128K 512K 2M

Lat

ency

(use

c)


Latency (Large Messages)

0

2000

4000

6000

1 16 256 4K 64K 1M

Ban

dwid

th (M

B/se

c)


Better 5594

Bandwidth

42

Latest Status on MVAPICH2-MIC

BNL, Oct. '14

• Running on three major systems – Stampede : module swap mvapich2 mvapich2-mic/20130911

– Blueridge(Virginia Tech) : module swap mvapich2 mvapich2-mic/1.9

– Beacon (UTK) : module unload intel-mpi; module load mvapich2-mic/1.9

• A new version based on MVAPICH2 2.0 is being worked out

• Will be available in a few weeks

43

Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall)

BNL, Oct. '14

A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014

0

5000

10000

15000

20000

25000

1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(use

cs)


32-Node-Allgather (16H + 16 M) Small Message Latency

MV2-MIC

MV2-MIC-Opt

0

500

1000

1500

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(use

cs) x

1000


32-Node-Allgather (8H + 8 M) Large Message Latency

MV2-MIC

MV2-MIC-Opt

0

200

400

600

800

4K 8K 16K 32K 64K 128K 256K 512K

Late

ncy

(use

cs) x 10

00


32-Node-Alltoall (8H + 8 M) Large Message Latency

MV2-MIC

MV2-MIC-Opt

0

10

20

30

40

50

MV2-MIC-Opt MV2-MIC

Exec

utio

n Ti

me

(sec

s)

32 Nodes (8H + 8M), Size = 2K*2K*1K

P3DFFT Performance Communication

Computation

76% 58%

55%






44 BNL, Oct. '14

MVAPICH2-X for Hybrid MPI + PGAS Applications

BNL, Oct. '14 45

MPI Applications, OpenSHMEM Applications, UPC Applications, Hybrid (MPI + PGAS) Applications

Unified MVAPICH2-X Runtime

InfiniBand, RoCE, iWARP

OpenSHMEM Calls MPI Calls UPC Calls

• Unified communication runtime for MPI, UPC, OpenSHMEM available with MVAPICH2-X 1.9 onwards! – http://mvapich.cse.ohio-state.edu

• Feature Highlights – Supports MPI(+OpenMP), OpenSHMEM, UPC, MPI(+OpenMP) + OpenSHMEM,

MPI(+OpenMP) + UPC – MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard

compliant – Scalable Inter-node and Intra-node communication – point-to-point and collectives

http://mvapich.cse.ohio-state.edu/overview/mvapich2x

BNL, Oct. '14 46

Hybrid MPI+OpenSHMEM Graph500 Design Execution Time

J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC’13), June 2013

0123456789

26 27 28 29

Billi

ons

of T

rave

rsed

Edg

es P

er

Seco

nd (T

EPS)

Scale

MPI-SimpleMPI-CSCMPI-CSRHybrid(MPI+OpenSHMEM)

0123456789

1024 2048 4096 8192

Billi

ons

of T

rave

rsed

Edg

es P

er

Seco

nd (T

EPS)

# of Processes

MPI-SimpleMPI-CSCMPI-CSRHybrid(MPI+OpenSHMEM)

Strong Scalability

Weak Scalability

J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012

05

101520253035

4K 8K 16K

Tim

e (s

)

No. of Processes

MPI-SimpleMPI-CSCMPI-CSRHybrid (MPI+OpenSHMEM)

13X

7.6X

• Performance of Hybrid (MPI+OpenSHMEM) Graph500 Design

• 8,192 processes - 2.4X improvement over MPI-CSR - 7.6X improvement over MPI-Simple • 16,384 processes - 1.5X improvement over MPI-CSR - 13X improvement over MPI-Simple

47

Hybrid MPI+OpenSHMEM Sort Application Execution Time

Strong Scalability

• Performance of Hybrid (MPI+OpenSHMEM) Sort Application • Execution Time - 4TB Input size at 4,096 cores: MPI – 2408 seconds, Hybrid: 1172 seconds - 51% improvement over MPI-based design • Strong Scalability (configuration: constant input size of 500GB) - At 4,096 cores: MPI – 0.16 TB/min, Hybrid – 0.36 TB/min - 55% improvement over MPI based design

0.000.050.100.150.200.250.300.350.40

512 1,024 2,048 4,096

Sort

Rat

e (T

B/m

in)

No. of Processes

MPIHybrid

0

500

1000

1500

2000

2500

3000

500GB-512 1TB-1K 2TB-2K 4TB-4K

Tim

e (s

econ

ds)

Input Data - No. of Processes

MPI

Hybrid

51% 55%

BNL, Oct. '14

J. Jose, S. Potluri, H. Subramoni, X. Lu, K. Hamidouche, K. Schulz, H. Sundar and D. K. Panda, Designing Scalable Out-of-core Sorting with Hybrid MPI+PGAS Programming Models, PGAS’14, Oct 2014






48 BNL, Oct. '14

• Virtualization has many benefits – Job migration – Compaction

• Not very popular in HPC due to overhead associated with Virtualization • New SR-IOV (Single Root – IO Virtualization) support available with

Mellanox InfiniBand adapters • Initial designs of MVAPICH2 with SR-IOV support

Can HPC and Virtualization be Combined?

49 BNL, Oct. '14

Intra-node Inter-VM Point-to-Point Latency and Bandwidth

• 1 VM per Core

• MVAPICH2-SR-IOV-IB brings only 3-7% (latency) and 3-8%(BW) overheads, compared to MAPICH2 over Native InfiniBand Verbs (MVAPICH2-Native-IB)

0

20

40

60

80

100

120

140

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Late

ncy

(us)

Message Size (byte)

MVAPICH2-SR-IOV-IB

MVAPICH2-Native-IB

0

2000

4000

6000

8000

10000

12000

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Band

wid

th (M

B/s)

Message Size (byte)

MVAPICH2-SR-IOV-IB

MVAPICH2-Native-IB

BNL, Oct. '14 50

Performance Evaluations with NAS and Graph500

• 8 VMs across 4 nodes, 1 VM per Socket, 64 cores totally

• MVAPICH2-SR-IOV-IB brings 3-7% and 3-9% overheads for NAS Benchmarks and Graph500, respectively, compared to MVAPICH2-Native-IB

0

1

2

3

4

5

6

7

MG-B-64 CG-B-64 EP-B-64 LU-B-64 BT-B-64

Exec

utio

n Ti

me(

s)

NAS Benchmarks

MVAPICH2-SR-IOV-IB

MVAPICH2-Native-IB

0

200

400

600

800

1000

1200

1400

1600

20,10 20,16 22，16 24,16

Exec

utio

n tim

e(m

s)

Graph500

MVAPICH2-SR-IOV-IB

MVAPICH2-Native-IB

BNL, Oct. '14 51

Performance Evaluation with LAMMPS

• 8 VMs across 4 nodes, 1 VM per Socket, 64 cores totally

• MVAPICH2-SR-IOV-IB brings 7% and 9% overheads for LJ and CHAIN in LAMMPS, respectively, compared to MVAPICH2-Native-IB

0

5

10

15

20

25

30

35

40

LJ-64-scaled CHAIN-64-scaled

Exec

utio

n Ti

me(

s)

LAMMPS

MVAPICH2-SR-IOV-IB

MVAPICH2-Native-IB

BNL, Oct. '14 52

J. Zhang, X. Lu, J. Jose, R. Shi and Dhabaleswar K. (DK) Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters?, EuroPar 2014, August 2014 J. Zhang, X. Lu, J. Jose, R. Shi, M. Li and Dhabaleswar K. (DK) Panda, High Performance MPI Library over SR-IOV Enabled InfiniBand Clusters, HiPC ’14, Dec. 14

• Performance and Memory scalability toward 900K-1M cores – Dynamically Connected Transport (DCT) service with Connect-IB

• Enhanced Optimization for GPGPU and Coprocessor Support – Extending the GPGPU support (GPU-Direct RDMA) with CUDA 6.5 and Beyond – Support for Intel MIC (Knight Landing)

• Taking advantage of Collective Offload framework – Including support for non-blocking collectives (MPI 3.0)

• RMA support (as in MPI 3.0) • Extended topology-aware collectives • Power-aware collectives • Support for MPI Tools Interface (as in MPI 3.0) • Checkpoint-Restart and migration support with in-memory checkpointing • Hybrid MPI+PGAS programming support with GPGPUs and Accelerators • High Performance Virtualization Support

MVAPICH2/MVPICH2-X – Plans for Exascale

53 BNL, Oct. '14










54


BNL, Oct. '14

Introduction to Big Data Applications and Analytics

• Big Data has become the one of the most important elements of business analytics

• Provides groundbreaking opportunities for enterprise information management and decision making

• The amount of data is exploding; companies are capturing and digitizing more information than ever

• The rate of information growth appears to be exceeding Moore’s Law

55 BNL, Oct. '14

• Commonly accepted 3V’s of Big Data • Volume, Velocity, Variety Michael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf

• 5V’s of Big Data – 3V + Value, Veracity

http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf

Can High-Performance Interconnects Benefit Big Data Middleware? • Most of the current Big Data middleware use Ethernet

infrastructure with Sockets

• Concerns for performance and scalability

• Usage of high-performance networks is beginning to draw interest from many companies

• What are the challenges?

• Where do the bottlenecks lie?

• Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)?

• Can HPC Clusters with high-performance networks be used for Big Data middleware?

• Initial focus: Hadoop, HBase, Spark and Memcached

56 BNL, Oct. '14

• Big Data Processing – RDMA-based designs for Apache Hadoop

• Case studies with HDFS, RPC and MapReduce

• RDMA-based MapReduce on HPC Clusters with Lustre

– RDMA-based design for Apache Spark

– HiBD Project and Releases

57

Overview of Presentation

BNL, Oct. '14

Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies (InfiniBand, 1/10/40GigE

and Intelligent NICs)

Storage Technologies

(HDD and SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks

RDMA Protocol

Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges

BNL, Oct. '14

Upper level Changes?

58

Design Overview of HDFS with RDMA

• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Java based HDFS with communication library written in native code

HDFS

Verbs

RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Write Others

OSU Design

• Design Features – RDMA-based HDFS

write – RDMA-based HDFS

replication – Parallel replication

support – On-demand connection

setup – InfiniBand/RoCE

support

BNL, Oct. '14 59

Communication Times in HDFS

• Cluster with HDD DataNodes

– 30% improvement in communication time over IPoIB (QDR)

– 56% improvement in communication time over 10GigE

• Similar improvements are obtained for SSD DataNodes

Reduced by 30%

BNL, Oct. '14

0

5

10

15

20

25

2GB 4GB 6GB 8GB 10GB

Com

mun

icat

ion

Tim

e (s

)

File Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

60

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012

N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC '14, June 2014

BNL, Oct. '14

Evaluations using Enhanced DFSIO of Intel HiBench on TACC-Stampede

• Cluster with 64 DataNodes (1K cores), single HDD per node – 64% improvement in throughput over IPoIB (FDR) for 256GB file size

– 37% improvement in latency over IPoIB (FDR) for 256GB file size

0

200

400

600

800

1000

1200

1400

64 128 256

Aggr

egat

ed T

hrou

ghpu

t (M

Bps)

File Size (GB)

IPoIB (FDR)OSU-IB (FDR) Increased by 64%

Reduced by 37%

0

100

200

300

400

500

600

64 128 256

Exec

utio

n Ti

me

(s)

File Size (GB)

IPoIB (FDR)

OSU-IB (FDR)

61

Design Overview of MapReduce with RDMA

MapReduce

Verbs

RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)

OSU Design

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Job Tracker

Task Tracker

Map

Reduce

BNL, Oct. '14

• Enables high performance RDMA communication, while supporting traditional socket interface • JNI Layer bridges Java based MapReduce with communication library written in native code

• Design Features – RDMA-based shuffle – Prefetching and caching map

output – Efficient Shuffle Algorithms – In-memory merge – On-demand Shuffle

Adjustment – Advanced overlapping

• map, shuffle, and merge • shuffle, merge, and reduce

– On-demand connection setup – InfiniBand/RoCE support

62

Advanced Overlapping among different phases

• A hybrid approach to achieve maximum possible overlapping in MapReduce across all phases compared to other approaches – Efficient Shuffle Algorithms – Dynamic and Efficient Switching – On-demand Shuffle Adjustment

Default Architecture

Enhanced Overlapping

Advanced Overlapping

M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS, June 2014.

BNL, Oct. '14 63

• For 240GB Sort in 64 nodes (512 cores) – 40% improvement over IPoIB (QDR)

with HDD used for HDFS

Performance Evaluation of Sort and TeraSort

BNL, Oct. '14

0

200

400

600

800

1000

1200

Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB

Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job

Exec

utio

n Ti

me

(sec

)

IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR)

Sort in OSU Cluster

0

100

200

300

400

500

600

700

Data Size: 80 GB Data Size: 160 GBData Size: 320 GB

Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job

Exec

utio

n Ti

me

(sec

)

IPoIB (FDR) UDA-IB (FDR) OSU-IB (FDR)

TeraSort in TACC Stampede

• For 320GB TeraSort in 64 nodes (1K cores) – 38% improvement over IPoIB

(FDR) with HDD used for HDFS 64

• 50% improvement in Self Join over IPoIB (QDR) for 80 GB data size

• 49% improvement in Sequence Count over IPoIB (QDR) for 30 GB data size

Evaluations using PUMA Workload

BNL, Oct. '14

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AdjList (30GB) SelfJoin (80GB) SeqCount (30GB) WordCount (30GB) InvertIndex (30GB)

Nor

mal

ized

Exe

cutio

n Ti

me

Benchmarks

10GigE IPoIB (QDR) OSU-IB (QDR)

65






66


BNL, Oct. '14

Optimized Apache Hadoop over Parallel File Systems

MetaData Servers

Object Storage Servers

Compute Nodes TaskTracker

Map Reduce

Lustre Client

Lustre Setup

• HPC Cluster Deployment – Hybrid topological solution of Beowulf

architecture with separate I/O nodes – Lean compute nodes with light OS; more

memory space; small local storage – Sub-cluster of dedicated I/O nodes with

parallel file systems, such as Lustre • MapReduce over Lustre

– Local disk is used as the intermediate data directory

– Lustre is used as the intermediate data directory

BNL, Oct. '14 67

• For 500GB Sort in 64 nodes – 44% improvement over IPoIB (FDR)

Case Study - Performance Improvement of RDMA-MapReduce over Lustre on TACC-Stampede

BNL, Oct. '14


0

200

400

600

800

1000

1200

300 400 500

Job

Exec

utio

n Ti

me

(sec

)

Data Size (GB)

IPoIB (FDR)OSU-IB (FDR)

0

50

100

150

200

250

300

350

400

450

500

20 GB 40 GB 80 GB 160 GB 320 GB 640 GB

Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128

Job

Exec

utio

n Ti

me

(sec

)

IPoIB (FDR) OSU-IB (FDR)

M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMA-based Approach Benefit?, Euro-Par, August 2014.

• Local disk is used as the intermediate data directory

68


Case Study - Performance Improvement of RDMA-MapReduce over Lustre on TACC-Stampede

BNL, Oct. '14


• Lustre is used as the intermediate data directory

0

20

40

60

80

100

120

140

160

180

200

80 120 160

Job

Exec

utio

n Ti

me

(sec

)

Data Size (GB)


0

50

100

150

200

250

300

80 GB 160 GB 320 GB

Cluster: 8 Cluster: 16 Cluster: 32Jo

b Ex

ecut

ion

Tim

e (s

ec)


• Can more optimizations be achieved by leveraging more features of Lustre? 69






70


BNL, Oct. '14

Design Overview of Spark with RDMA

• Design Features – RDMA based shuffle – SEDA-based plugins – Dynamic connection

management and sharing – Non-blocking and out-of-

order data transfer – Off-JVM-heap buffer

management – InfiniBand/RoCE support

BNL, Oct. '14

• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Scala based Spark with communication library written in native code X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data

Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014

71

BNL, Oct. '14

Preliminary Results of Spark-RDMA Design - GroupBy

0

1

2

3

4

5

6

7

8

9

4 6 8 10

Gro

upBy

Tim

e (s

ec)

Data Size (GB)

0

1

2

3

4

5

6

7

8

9

8 12 16 20

Gro

upBy

Tim

e (s

ec)

Data Size (GB)

10GigE

IPoIB

RDMA

Cluster with 4 HDD Nodes, GroupBy with 32 cores Cluster with 8 HDD Nodes, GroupBy with 64 cores

• Cluster with 4 HDD Nodes, single disk per node, 32 concurrent tasks

– 18% improvement over IPoIB (QDR) for 10GB data size

• Cluster with 8 HDD Nodes, single disk per node, 64 concurrent tasks

– 20% improvement over IPoIB (QDR) for 20GB data size

72






73


BNL, Oct. '14

• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)

• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)

• RDMA for Memcached (RDMA-Memcached)

• OSU HiBD-Benchmarks (OHB)

• http://hibd.cse.ohio-state.edu

• Users Base: 76 organizations from 13 countries

• RDMA for Apache HBase and Spark

The High-Performance Big Data (HiBD) Project

BNL, Oct. '14 74

http://hibd.cse.ohio-state.edu

• High-Performance Design of Hadoop over RDMA-enabled Interconnects

– High performance design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components

– Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with IPoIB)

• Current release: 0.9.9/0.9.1

– Based on Apache Hadoop 1.2.1/2.4.1

– Compliant with Apache Hadoop 1.2.1/2.4.1 APIs and applications

– Tested with • Mellanox InfiniBand adapters (DDR, QDR and FDR)

• RoCE support with Mellanox adapters

• Various multi-core platforms

• Different file systems with disks and SSDs

– http://hibd.cse.ohio-state.edu

RDMA for Apache Hadoop 1.x/2.x Distributions

BNL, Oct. '14 75

http://hadoop-rdma.cse.ohio-state.edu/

• Upcoming Releases of RDMA-enhanced Packages will support – Hadoop 2.x MapReduce & RPC

– Spark

– HBase

• Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will support

– HDFS

– MapReduce

– RPC

• Advanced designs with upper-level changes and optimizations

• E.g. MEM-HDFS

BNL, Oct. '14

Future Plans of OSU High Performance Big Data (HiBD) Project

76










77


BNL, Oct. '14

Communication Options in Grid

• Multiple options exist to perform data transfer on Grid • Globus-XIO framework currently does not support IB natively • We create the Globus-XIO ADTS driver and add native IB support to GridFTP

Globus XIO Framework

GridFTP High Performance Computing Applications

10 GigE Network

IB Verbs IPoIB RoCE TCP/IP

Obsidian Routers

78 BNL, Oct. '14

Globus-XIO Framework with ADTS Driver

Globus XIO Driver #n

Data Connection

Management

Persistent Session

Management

Buffer & File Management

Data Transport Interface

InfiniBand / RoCE 10GigE/iWARP

Globus XIO Interface

File System

User

Globus-XIO ADTS Driver

Modern WAN Interconnects Network

Flow Control Zero Copy Channel

Memory Registration

Globus XIO Driver #1

79 BNL, Oct. '14

H. Subramoni, P. Lai, R. Kettimuthu and D. K. Panda, High Performance Data Transfer in Grid Environment Using GridFTP over InfiniBand, Int'l Symposium on Cluster Computing and the Grid (CCGrid), May 2010

P. Lai, H. Subramoni, S. Narravula, A. Mamidala and D. K. Panda, Designing Efficient FTP Mechanisms for High Performance Data-Transfer over InfiniBand, Intl Conference on Parallel Processing (ICPP '09), Sept. 2009.

80

Performance Comparison of ADTS & UDT Drivers

ADTS based implementation is able to saturate the link bandwidth

0

500

1000

1500

0 10 100 1000Band

wid

th (M

Bps)

Network Delay (us)

ADTS Driver 2 MB 8 MB 32 MB 64 MB

90

100

110

120

0 10 100 1000Band

wid

th (M

Bps)

Network Delay (us)

UDT Driver

2 MB 8 MB 32 MB 64 MB

050

100150200250300

CCSM Ultra-Viz

Band

wid

th (M

Bps)

Target Applications

Disk Based FTP Get ADTS IPoIB

In memory data transfer performance of ADTS & UDT drivers for different buffer sizes

• Community Climate System Model (CCSM) • Part of Earth System Grid Project • Transfers 160 TB in chunks of 256 MB • Network latency - 30 ms

• Ultra-Scale Visualization (Ultra-Viz) • Transfers files of size 2.6 GB • Network latency - 80 ms

• The ADTS driver out performs the UDT driver (IPoIB) by more than 100%

BNL, Oct. '14

• InfiniBand with RDMA feature is gaining momentum in HPC systems with best performance and greater usage

• As the HPC community moves to Exascale, new solutions are needed in the MPI and Hybrid MPI+PGAS stacks for supporting GPUs and Accelerators

• Demonstrated how such solutions can be designed with MVAPICH2/MVAPICH2-X and their performance benefits

• New solutions are also needed to re-design software libraries for Big Data environments to take advantage of RDMA

• Such designs will allow application scientists and engineers to take advantage of upcoming exascale systems

81

Concluding Remarks

BNL, Oct. '14

BNL, Oct. '14

Funding Acknowledgments

Funding Support by

Equipment Support by

82

BNL, Oct. '14

Personnel Acknowledgments

83

Current Students – A. Bhat (M.S.)

– S. Chakraborthy (Ph.D.)

– N. Islam (Ph.D.)

– M. Li (Ph.D.)

– R. Rajachandrasekhar (Ph.D.)

Past Students – P. Balaji (Ph.D.)

– D. Buntinas (Ph.D.)

– S. Bhagvat (M.S.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

Past Research Scientist – S. Sur

Current Post-Docs – K. Hamidouche

– J. Lin

Current Programmers – M. Arnold

– J. Perkins

Past Post-Docs – H. Wang

– X. Besseron

– H.-W. Jin

– M. Luo

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– M. Rahman (Ph.D.)

– D. Shankar (Ph.D.)

– R. Shir (Ph.D.)

– A. Venkatesh (Ph.D.)

– J. Zhang (Ph.D.)

– E. Mancini

– S. Marcarelli

– J. Vienne

Current Senior Research Associate – X. Lu

– H. Subramoni

Past Programmers – D. Bureddy

[email protected]

BNL, Oct. '14

Thank You!

The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/

84

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

The MVAPICH2/MVAPICH2-X Project http://mvapich.cse.ohio-state.edu/

http://nowlab.cse.ohio-state.edu/

http://nowlab.cse.ohio-state.edu/

Institute for Advanced Computational Science · 2018-07-19 · •Scientific Computing – Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant Programming Model

Documents