Accelerating HPC, Big Data and Deep Learning on ...hidl.cse.ohio-state.edu/static/media/talks/slide/open...Network Based Computing Laboratory OpenPOWER-ADG ‘19 3 • Challenges in

Accelerating HPC, Big Data and Deep Learning on OpenPOWER Platforms

Dhabaleswar K. (DK) PandaThe Ohio State University

E-mail: [email protected]://www.cse.ohio-state.edu/~panda

Talk at OpenPOWER Academic Discussion Group Workshop 2019

by

Follow us on

https://twitter.com/mvapich

mailto:[email protected]

http://www.cse.ohio-state.edu/%7Epanda


2Network Based Computing Laboratory OpenPOWER-ADG ‘19

High-End Computing (HEC): PetaFlop to ExaFlop

Expected to have an ExaFlop system in 2020-2021!

100 PFlops in 2017

1 EFlops in 2020-2021?

149PFlops in 2018


• Challenges in Designing Convergent HPC, Big Data and Deep Learning Architectures

• MVAPICH Project - MPI and PGAS (MVAPICH) Library with CUDA-Awareness

• HiDL Project – High-Performance Deep Learning

• HiBD Project – High-Performance Big Data Analytics Library

• Commercial Support from X-ScaleSolutions

• Conclusions and Q&A

Presentation Overview


Big Data (Hadoop, Spark,

HBase, Memcached,

etc.)

Deep Learning(Caffe, TensorFlow, BigDL,

etc.)

HPC (MPI, RDMA, Lustre, etc.)

Increasing Usage of HPC, Big Data and Deep Learning

Convergence of HPC, Big Data, and Deep Learning!

Increasing Need to Run these applications on the Cloud!!


Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Physical Compute







Spark Job

Hadoop Job Deep LearningJob



• MVAPICH Project – MPI and PGAS (MVAPICH) Library with CUDA-Awareness







Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 (SC ‘02)

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 3,050 organizations in 89 countries

– More than 615,000 (> 0.6 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (June ‘19 ranking)

• 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China

• 5th, 448, 448 cores (Frontera) at TACC

• 8th, 391,680 cores (ABCI) in Japan

• 15th, 570,020 cores (Neurion) in South Korea and many others

– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decadePartner in the TACC Frontera System

http://mvapich.cse.ohio-state.edu/


0

100000

200000

300000

400000

500000

600000Se

p-04

Feb-

05Ju

l-05

Dec-

05M

ay-0

6O

ct-0

6M

ar-0

7Au

g-07

Jan-

08Ju

n-08

Nov

-08

Apr-

09Se

p-09

Feb-

10Ju

l-10

Dec-

10M

ay-1

1O

ct-1

1M

ar-1

2Au

g-12

Jan-

13Ju

n-13

Nov

-13

Apr-

14Se

p-14

Feb-

15Ju

l-15

Dec-

15M

ay-1

6O

ct-1

6M

ar-1

7Au

g-17

Jan-

18Ju

n-18

Nov

-18

Apr-

19

Num

ber o

f Dow

nloa

ds

Timeline

MV

0.9.

4

MV2

0.9

.0

MV2

0.9

.8

MV2

1.0

MV

1.0

MV2

1.0.

3

MV

1.1

MV2

1.4

MV2

1.5

MV2

1.6

MV2

1.7

MV2

1.8

MV2

1.9

MV2

-GD

R 2.

0b

MV2

-MIC

2.0

MV2

-GD

R 2

.3.2

MV2

-X2.

3rc

2

MV2

Virt

2.2

MV2

2.3

.2

OSU

INAM

0.9

.3

MV2

-Azu

re 2

.3.2

MV2

-AW

S2.

3

MVAPICH2 Release Timeline and Downloads


Architecture of MVAPICH2 Software Family (MPI, PGAS and DL)

High Performance Parallel Programming Models

Message Passing Interface(MPI)

PGAS(UPC, OpenSHMEM, CAF, UPC++)

Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms

Point-to-point

Primitives

Collectives Algorithms

Energy-

Awareness

Remote Memory Access

I/O and

File Systems

Fault

ToleranceVirtualization Active

MessagesJob Startup

Introspection & Analysis

Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)

Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)

Transport Protocols Modern Features

RC SRD UD DC UMR ODPSR-IOV

Multi Rail

Transport MechanismsShared

MemoryCMA IVSHMEM

Modern Features

Optane* NVLink CAPI*

* Upcoming

XPMEM


MVAPICH2 Software Family

Requirements Library

MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2

Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS with IB, Omni-Path, and RoCE

MVAPICH2-X

MPI with IB, RoCE & GPU and Support for Deep Learning MVAPICH2-GDR

HPC Cloud with MPI & IB MVAPICH2-Virt

Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA

MPI Energy Monitoring Tool OEMT

InfiniBand Network Analysis and Monitoring OSU INAM

Microbenchmarks for Measuring MPI and PGAS Performance OMB



HBase, Memcached,

etc.)


etc.)


Convergent Software Stacks for HPC, Big Data and Deep Learning


• MVAPICH2 – Basic MPI support

• since MVAPICH2 2.2rc1 (March 2016)

• MVAPICH2-X– PGAS (OpenSHMEM and UPC) and Hybrid MPI+PGAS support

• since MVAPICH2-X 2.2b (March 2016)

– Advanced Collective Support with CMA• Since MVAPICH2-X 2.3b (Oct 2017)

• MVAPICH2-GDR – NVIDIA GPGPU support with NVLink (CORAL systems like Summit and Sierra)

• Since MVAPICH2-GDR 2.3a (Nov 2017)

OpenPOWER Platform Support in MVAPICH2 Libraries


• Message Passing Interface (MPI) Support– Point-to-point Inter-node and Intra-node

– XPMEM-based collectives

• Exploiting Accelerators (NVIDIA GPGPUs)– CUDA-aware MPI

– Point-to-point

– Applications

– Integrated Support with TAU

MPI, PGAS and Deep Learning Support for OpenPOWER


0

0.2

0.4

0.6

0.8

4 8 16 32 64 128 256 512 1K 2K

Late

ncy

(us)

MVAPICH2-X 2.3.2SpectrumMPI-10.3.0.01

0.25us

Intra-node Point-to-Point Performance on OpenPOWER

Platform: Two nodes of OpenPOWER (Power9-ppc64le) CPU using Mellanox EDR (MT4121) HCA

Intra-Socket Small Message Latency Intra-Socket Large Message Latency

Intra-Socket Bi-directional BandwidthIntra-Socket Bandwidth

0

100

200

300

4K 8K 16K 32K 64K 128K 256K 512K 1M 2M

Late

ncy

(us)


0

10000

20000

30000

40000

1 8 64 512 4K 32K 256K 2M

Band

wid

th (M

B/s) MVAPICH2-X 2.3.2

SpectrumMPI-10.3.0.01

0

10000

20000

30000

40000

1 8 64 512 4K 32K 256K 2M

Band

wid

th (M

B/s)



0

1

2

3

4

1 2 4 8 16 32 64 128 256 512 1K 2K

Late

ncy

(us)

MVAPICH2-2.3.2SpectrumMPI-10.3.0.01

Inter-node Point-to-Point Performance on OpenPower

Platform: Two nodes of OpenPOWER (POWER9-ppc64le) CPU using Mellanox EDR (MT4121) HCA

Small Message Latency Large Message Latency

Bi-directional BandwidthBandwidth

0

50

100

150

4K 8K 16K 32K 64K 128K 256K 512K 1M 2M

Late

ncy

(us)

MVAPICH2-2.3.2SpectrumMPI-10.3.0.01

0

10000

20000

30000

1 8 64 512 4K 32K 256K 2M

Band

wid

th (M

B/s)

MVAPICH2-2.3.2


0

10000

20000

30000

40000

50000

1 8 64 512 4K 32K 256K 2MBi

-Ban

dwid

th (M

B/s) MVAPICH2-2.3.2


24,743 MB/s 49,249 MB/s


0

1000

2000

3000

256K 512K 1M 2M

Late

ncy

(us)

MVAPICH2-2.3

SpectrumMPI-10.1.0

OpenMPI-3.0.0

MVAPICH2-XPMEM

3X0

100

200

300

16K 32K 64K 128K

Late

ncy

(us)

Message Size

MVAPICH2-2.3SpectrumMPI-10.1.0OpenMPI-3.0.0MVAPICH2-XPMEM

34%

0

1000

2000

3000

4000

256K 512K 1M 2M

Late

ncy

(us)

Message Size


0

100

200

300

16K 32K 64K 128K

Late

ncy

(us)


Optimized MVAPICH2 All-Reduce with XPMEM(N

odes

=1, P

PN=2

0)

Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch

• Optimized MPI All-Reduce Design in MVAPICH2– Up to 2X performance improvement over Spectrum MPI and 4X over OpenMPI for intra-node

2X

(Nod

es=2

, PPN

=20)

4X48%

3.3X 2X

2X


0

100

200

300

400

16K 32K 64K 128K

Late

ncy

(us)

Message Size

MVAPICH2-2.3OpenMPI-3.0.0MVAPICH2-XPMEM

0

1000

2000

3000

4000

256K 512K 1M 2M

Late

ncy

(us)


0

100

200

300

400

16K 32K 64K 128K

Late

ncy

(us)


Optimized MVAPICH2 All-Reduce with XPMEM (N

odes

=3, P

PN=2

0)

Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch

• Optimized MPI All-Reduce Design in MVAPICH2– Up to 2X performance improvement over OpenMPI for inter-node. (Spectrum MPI didn’t run for >2 processes)

22%

0

1000

2000

3000

4000

256K 512K 1M 2M

Late

ncy

(us)

Message Size


(Nod

es=4

, PPN

=20)

42% 27%

2X35%

11%

20%

18.8%


0

20

40

60

10 20 40 60

Exec

utio

n Ti

me

(s)

MPI Processes (proc-per-sock=10, proc-per-node=20)

MVAPICH-2.3MVAPICH2-XPMEM

MiniAMR Performance using Optimized XPMEM-based Collectives

• MiniAMR application execution time comparing MVAPICH2-2.3rc1 and optimized All-Reduce design– Up to 45% improvement over MVAPICH2-2.3rc1 in mesh-refinement time of MiniAMR application for weak-

scaling workload on up to four POWER8 nodes.

45%41%

36%42%

Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=scatter


• Message Passing Interface (MPI) Support– Point-to-point Inter-node and Intra-node

– XPMEM-based collectives

• Exploiting Accelerators (NVIDIA GPGPUs)– CUDA-aware MPI

– Point-to-point

– Applications

– Integrated Support with TAU

MPI, PGAS and Deep Learning Support for OpenPOWER


At Sender:

At Receiver:MPI_Recv(r_devbuf, size, …);

insideMVAPICH2

• Standard MPI interfaces used for unified data movement

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU


0

2000

4000

6000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)

Message Size (Bytes)

GPU-GPU Inter-node Bi-Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3

01000200030004000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)


GPU-GPU Inter-node Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3

0

10

20

300 1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)


GPU-GPU Inter-node Latency

MV2-(NO-GDR) MV2-GDR 2.3

MVAPICH2-GDR-2.3Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores

NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA

CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA

10x

9x

Optimized MVAPICH2-GDR Design (D-D) Performance

1.85us11X

Presenter

Presentation Notes

This one can be updated with the latest version of GDR


D-to-D Performance on OpenPOWER w/ GDRCopy (NVLink2 + Volta)

0

1

2

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8KLate

ncy

(us)


Intra-Node Latency (Small Messages)

0

50

100

16K 32K 64K 128K 256K 512K 1M 2M 4MLate

ncy

(us)


Intra-Node Latency (Large Messages)

020000400006000080000

1 4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Band

wid

th (M

B/s)


Intra-Node Bandwidth

0

5

10

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8KLate

ncy

(us)


Inter-Node Latency (Small Messages)

0100200300

16K 32K 64K 128K 256K 512K 1M 2M 4MLate

ncy

(us)


Inter-Node Latency (Large Messages)

05000

10000150002000025000

1 4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Band

wid

th (M

B/s)


Inter-Node Bandwidth

Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect

Intra-node Bandwidth: 65.48 GB/sec for 4MB (via NVLINK2)Intra-node Latency: 0.76 us (with GDRCopy)

Inter-node Bandwidth: 23 GB/sec for 4MB (via 2 Port EDR)Inter-node Latency: 2.18 us (with GDRCopy 2.0)


D-to-H & H-to-D Performance on OpenPOWER w/ GDRCopy (NVLink2 + Volta)


Intra-node D-H Bandwidth: 16.70 GB/sec for 2MB (via NVLINK2)

Intra-node D-H Latency: 0.49 us (with GDRCopy)

Intra-node H-D Latency: 0.49 us (with GDRCopy 2.0)Intra-node H-D Bandwidth: 26.09 GB/sec for 2MB (via NVLINK2)

0204060

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8KLa

tenc

y (u

s)


D-H INTRA-NODE LATENCY (SMALL)

Spectrum MPI MV2-GDR

0

100

200

300

400

16K 32K 64K 128K 256K 512K 1M 2M 4M

Late

ncy

(us)


D-H INTRA-NODE LATENCY (LARGE)


05

101520

1 4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Band

wid

th (G

B/se

c)


D-H INTRA-NODE BW


0204060

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8KLate

ncy

(us)


H-D INTRA-NODE LATENCY (SMALL)


0100200300400

16K 32K 64K 128K 256K 512K 1M 2M 4M

Late

ncy

(us)


H-D INTRA-NODE LATENCY (LARGE)


0

10

20

30

1 4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Band

wid

th (G

B/se

c)


H-D INTRA-NODE BW



Managed Memory Performance (OpenPOWER Intra-node)

Latency MD MD Bandwidth MD MD

Bi-Bandwidth MD MD


0

2

4

6

4 8 16 32 64 128 256 512

Late

ncy

(us)

MVAPICH2-2.3.2MVAPICH2-2.3.2 w/ SHARP

0

5

10

15

4 8 16 32 64 128 256 512

Late

ncy

(us)

MVAPICH2-GDR-Next

MVAPICH2-GDR-Next w/ SHARP

MVAPICH2 with SHARP Support (Preliminary Results)GPU AllReduce w/ SHARP Support – 2 nodes

0

5

10

15

20

4 8 16 32 64 128 256 512

Late

ncy

(us)

MVAPICH2-GDR-NextMVAPICH2-GDR-Next w/ SHARP

3.41 us

GPU AllReduce w/ SHARP Support – 4 nodes

2.34 us

HOST AllReduce w/ SHARP Support – 2 nodes

3.14 us

0

2

4

6

8

10

4 8 16 32 64 128 256 512

Late

ncy

(us)

MVAPICH2-2.3.2MVAPICH2-2.3.2 w/ SHARP

HOST AllReduce w/ SHARP Support – 8 nodes

3.97 us



Application: HYPRE - BoomerAMG

0.971.196

1.4361.756

0.94 1.01 1.082

1.628

0

0.5

1

1.5

2

16 GPUs 32 GPUs 64 GPUs 128 GPUs

Seco

nds

(Sum

of S

etup

Pha

se &

Sol

ve

Phas

e Ti

mes

)

# of GPUs

HYPRE - BoomerAMGSpectrum-MPI 10.3.0.1 MVAPICH2-GDR 2.3.2

RUN MVAPICH2-GDR 2.3.2: export MV2_USE_CUDA=1 MV2_USE_GDRCOPY=0 MV2_USE_RDMA_CM=0export MV2_USE_GPUDIRECT_LOOPBACK=0 MV2_HYBRID_BINDING_POLICY=spread MV2_IBA_HCA=mlx5_0:mlx5_3OMP_NUM_THREADS=20 lrun -n 128 -N 32 mpibind ./ij -P 8 4 4 -n 50 50 50 -pmis -Pmx 8 -keepT 1 -rlx 18

RUN Spectrum-MPI 10.3.0.1: OMP_NUM_THREADS=20 lrun -n 128 -N 32 --smpiargs "-gpu --disable_gdr" mpibind ./ij -P 8 4 4 -n 50 50 50 -pmis -Pmx 8 -keepT 1 -rlx 18

25%16%


16 GPUs on POWER9 system (test Comm mpi Mesh cuda Device Buffers mpi_type)

pre-comm post-recv post-

send wait-recv wait-send

post-comm start-up test-

commbench-comm

Spectrum MPI 10.3 0.0001 0.0000 1.6021 1.7204 0.0112 0.0001 0.0004 7.7383 83.6229

MVAPICH2-GDR 2.3.2 0.0001 0.0000 0.0862 0.0871 0.0018 0.0001 0.0009 0.3558 4.4396

MVAPICH2-GDR 2.3.3 (Upcoming) 0.0001 0.0000 0.0030 0.0032 0.0001 0.0001 0.0009 0.0133 0.1602

Application: COMB

Run Scripts pushed to COMB Github repo: https://github.com/LLNL/Comb/pull/2

18x

27x

- Improvements due to enhanced support for GPU-kernel based packing/unpacking routines

https://github.com/LLNL/Comb/pull/2


19.3 21.8

35.1

52

18.8 22

34.8

52.1

0

10

20

30

40

50

60

1 GPU 4 GPUs 8 GPUs 16 GPUs

Cum

ulat

ive

Wor

k Ti

me

(sec

)# of GPUs

UMTMVAPICH2-GDR-2.3.2 Spectrum-MPI-Rolling-Release

Application: UMT - GPU

jsrun -r 4 -p 16 mpibind tau_exec -ebs ./SuOlsonTest ../sierra-runs/2x2x4_20.cmg 16 2 16 8 4RUN Spectrum-MPI/MV2-GDR

export MV2_USE_GDRCOPY=0

export MV2_USE_CUDA=1

export MV2_SUPPORT_TENSOR_FLOW=1

export MV2_ENABLE_AFFINITY=0

export OMPI_LD_PRELOAD_PREPEND=$HOME/software/mvapich232-jsrun-pgi/install/lib/libmpi.so

PREPARE MV2-GDR

• Use MV2-GDR pgi/18.7 w/ jsrun rpm

• Use TAU for profiling application

Presenter

Presentation Notes


Application: SW4

0

50

100

12 24 48

Solv

er P

hase

Tim

e (S

ec)

# GPUs

Performance Comparison on Host Buffers

Spectrum-MPI MVAPICH2-GDR

0

50

100

12 24 48

Solv

er P

hase

Tim

e (S

ec)

# GPUs

Performance Comparison on Device Buffers


0

50

100

12 24 48

Solv

er P

hase

Tim

e (S

ec)

# GPUs

Performance Comparison on Managed Buffers


RUN Spectrum-MPI

GPU_SUPPORT=‘-M “-gpu”’ (for device & managed buffers)

jsrun -n $NPROCS -g1 -c7 -a1 $GPU_SUPPORT ./sw4 $Input_file

NPROCS=<#gpus per node>*<#allocated nodes>Input_file= hayward-att-h200-ref.inRUN MV2-GDR

$MPI_HOME/bin/mpirun_rsh -export-all -np $NPROCS --hostfile <hostfile> $mv2-gdr-flags LD_PRELOAD=$MPI_HOME/lib/libmpi.so ./sw4 $Input_file

MV2-GDR-flagsMV2_USE_CUDA=1

MV2_USE_GPUDIRECT_RDMA=1 MV2_USE_GPUDIRECT_GDRCOPY=0 MV2_USE_RDMA_CM=0 MV2_DEBUG_SHOW_BACKTRACE=1 MV2_SHOW_CPU_BINDING=1 MV2_SHOW_ENV_INFO=2 MV2_USE_GPUDIRECT_LOOPBACK=0 MV2_CPU_BINDING_POLICY=HYBRID MV2_HYBRID_BINDING_POLICY=SPREAD MV2_IBA_HCA=mlx5_0:mlx5_3

Presenter

Presentation Notes


Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

0

0.2

0.4

0.6

0.8

1

1.2

16 32 64 96Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

CSCS GPU cluster

Default Callback-based Event-based

00.20.40.60.8

11.2

4 8 16 32

Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

Wilkes GPU Cluster

Default Callback-based Event-based

• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)

C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16

On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application

Cosmo model: http://www2.cosmo-model.org/content/tasks/operational/meteoSwiss/


http://www2.cosmo-model.org/content



TAU Profile with MVAPICH2-GDR










• Deep Learning frameworks are a different game altogether

– Unusually large message sizes (order of megabytes)

– Most communication based on GPU buffers

• Existing State-of-the-art– cuDNN, cuBLAS, NCCL --> scale-up performance

– NCCL2, CUDA-Aware MPI --> scale-out performance• For small and medium message sizes only!

• Proposed: Can we co-design the MPI runtime (MVAPICH2-GDR) and the DL framework (Caffe) to achieve both?

– Efficient Overlap of Computation and Communication

– Efficient Large-Message Communication (Reductions)

– What application co-designs are needed to exploit communication-runtime co-designs?

Deep Learning: New Challenges for MPI Runtimes

Scal

e-up

Per

form

ance

Scale-out PerformanceA. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)

cuDNN

gRPC

Hadoop

MPI

MKL-DNN

DesiredNCCL2



HBase, Memcached,

etc.)


etc.)




• CPU-based Deep Learning– Using MVAPICH2-X

• GPU-based Deep Learning– Using MVAPICH2-GDR

High-Performance Deep Learning


ResNet-50 using various DL benchmarks on Frontera

• Observed 260K images per sec for ResNet-50 on 2,048 Nodes

• Scaled MVAPICH2-X on 2,048 nodes on Frontera for Distributed Training using TensorFlow

• ResNet-50 can be trained in 7 minutes on 2048 nodes (114,688 cores)

*Jain et al., “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (in conjunction with SC ’19).


• CPU-based Deep Learning– Using MVAPICH2-X

• GPU-based Deep Learning– Using MVAPICH2-GDR

High-Performance Deep Learning


• MVAPICH2-GDR offers excellent performance via advanced designs for MPI_Allreduce.

• Up to 11% better performance on the RI2 cluster (16 GPUs)

• Near-ideal – 98% scaling efficiency

Exploiting CUDA-Aware MPI for TensorFlow (Horovod)

0100200300400500600700800900

1000

1 2 4 8 16

Imag

es/s

econ

d (H

ighe

r is b

ette

r)No. of GPUs

Horovod-MPI Horovod-NCCL2 Horovod-MPI-Opt (Proposed) Ideal

MVAPICH2-GDR 2.3 (MPI-Opt) is up to 11% faster than MVAPICH2

2.3 (Basic CUDA support)

A. A. Awan et al., “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation”, CCGrid ‘19, https://arxiv.org/abs/1810.11112

https://arxiv.org/abs/1810.11112


MVAPICH2-GDR vs. NCCL2 – Allreduce Operation• Optimized designs in MVAPICH2-GDR 2.3 offer better/comparable performance for most cases

• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs

1

10

100

1000

10000

100000

Late

ncy

(us)


MVAPICH2-GDR NCCL2

~1.2X better

Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect

1

10

100

1000

4 8 16 32 64 128

256

512 1K 2K 4K 8K 16K

32K

64K

Late

ncy

(us)


MVAPICH2-GDR NCCL2

~3X better


MVAPICH2-GDR vs. NCCL2: Allreduce Optimization (DGX-2)• Optimized designs in upcoming MVAPICH2-GDR offer better performance for most cases

• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on a DGX-2 machine

020406080

Band

wid

th (G

B/s)


Bandwidth

MVAPICH2-GDR 2.3.2 NCCL 2.4

2.4X better

110

1001000

10000

4 16 64 256 1K 4K 16K

64K

256K 1M 4M 16M

64M

Late

ncy

(us)


Latency

MVAPICH2-GDR 2.3.2 NCCL 2.4

5.8X better

Platform: Nvidia DGX-2 system @ PSC (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2


MVAPICH2-GDR: MPI_Allreduce (Device Buffers) on Summit• Optimized designs in MVAPICH2-GDR offer better performance for most cases

• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) up to 1,536 GPUs

0

1

2

3

4

5

6

32M 64M 128M 256M

Band

wid

th (G

B/s)


Bandwidth on 1,536 GPUs

MVAPICH2-GDR-2.3.2 NCCL 2.4

1.7X better

050

100150200250300350400450

4 16 64 256 1K 4K 16K

Late

ncy

(us)


Latency on 1,536 GPUs

MVAPICH2-GDR-2.3.2 NCCL 2.4

1.6X better

Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect

0

2

4

6

8

10

24 48 96 192 384 768 1536

Band

wid

th (G

B/s)

Number of GPUs

128MB Message

SpectrumMPI 10.2.0.11 OpenMPI 4.0.1

NCCL 2.4 MVAPICH2-GDR-2.3.2

1.7X better


Distributed Training with TensorFlow and MVAPICH2-GDR on Summit• ResNet-50 Training using

TensorFlow benchmark on SUMMIT -- 1536 Volta GPUs!

• 1,281,167 (1.2 mil.) images

• Time/epoch = 3.6 seconds

• Total Time (90 epochs) = 3.6 x 90 = 332 seconds = 5.5 minutes!

050

100150200250300350400

1 2 4 6 12 24 48 96 192 384 768 1536

Imag

e pe

r sec

ond

(Tho

usan

ds)

Number of GPUs

NCCL-2.4 MVAPICH2-GDR-2.3.2

Platform: The Summit Supercomputer (#1 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 9.2

*We observed errors for NCCL2 beyond 96 GPUs

MVAPICH2-GDR reaching ~0.35 million images per second for ImageNet-1k!

ImageNet-1k has 1.2 million images

Presenter

Presentation Notes

Time is calculated based on the images/second performance. It might be different when actual training happens.


• Near-linear scaling may be achieved by tuning Horovod/MPI• Optimizing MPI/Horovod towards large message sizes for high-resolution images

• Develop a generic Image Segmentation benchmark• Tuned DeepLabV3+ model using the benchmark and Horovod, up to 1.3X better than default

New Benchmark for Image Segmentation on Summit

*Anthony et al., “Scaling Semantic Image Segmentation using Tensorflow and MVAPICH2-GDR on HPC Systems” (Submission under review)


Using HiDL Packages for Deep Learning on Existing HPC Infrastructure

Hadoop Job










• Substantial impact on designing and utilizing data management and processing systems in multiple tiers

– Front-end data accessing and serving (Online)• Memcached + DB (e.g. MySQL), HBase

– Back-end data analytics (Offline)• HDFS, MapReduce, Spark

Data Management and Processing on Modern Datacenters



HBase, Memcached,

etc.)


etc.)




• RDMA for Apache Spark

• RDMA for Apache Hadoop 3.x (RDMA-Hadoop-3.x)

• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)

– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions

• RDMA for Apache Kafka

• RDMA for Apache HBase

• RDMA for Memcached (RDMA-Memcached)

• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)

• OSU HiBD-Benchmarks (OHB)

– HDFS, Memcached, HBase, and Spark Micro-benchmarks

• http://hibd.cse.ohio-state.edu

• Users Base: 315 organizations from 35 countries

• More than 31,600 downloads from the project site

The High-Performance Big Data (HiBD) Project

Available for InfiniBand and RoCEAlso run on Ethernet

Available for x86 and OpenPOWER

Support for Singularity and Docker

http://hibd.cse.ohio-state.edu/


• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well as performance. This mode is enabled by default in the package.

• HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-memory and obtain as much performance benefit as possible.

• HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.

• HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst buffer design is hosted by Memcached servers, each of which has a local SSD.

• MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top of Lustre alone. Here, two different modes are introduced: with local disks and without local disks.

• Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-L, and MapReduce over Lustre).

Different Modes of RDMA for Apache Hadoop 2.x


• High-Performance Design of Hadoop over RDMA-enabled Interconnects

– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components

– Enhanced HDFS with in-memory and heterogeneous storage

– High performance design of MapReduce over Lustre

– Memcached-based burst buffer for MapReduce over Lustre-integrated HDFS (HHH-L-BB mode)

– Plugin-based architecture supporting RDMA-based designs for Apache Hadoop, CDH and HDP

– Support for OpenPOWER, Singularity, and Docker

• Current release: 1.3.5

– Based on Apache Hadoop 2.8.0

– Compliant with Apache Hadoop 2.8.0, HDP 2.5.0.3 and CDH 5.8.2 APIs and applications

– Tested with

• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)

• RoCE support with Mellanox adapters

• Various multi-core platforms (x86, POWER)

• Different file systems with disks and SSDs and Lustre

RDMA for Apache Hadoop 2.x Distribution

http://hibd.cse.ohio-state.edu

http://hadoop-rdma.cse.ohio-state.edu/


• For TestDFSIO throughput experiment, RDMA-IB design of HHH mode has an improvement of 1.57x - 2.06x compared to IPoIB (100Gbps).

• In HHH-M mode, the improvement goes up to 2.18x - 2.26x compared to IPoIB (100Gbps).

Performance of RDMA-Hadoop on OpenPOWERTestDFSIO Throughput

HHH Mode HHH-M Mode

050

100150200250300350400

10 20 30

Tota

l Thr

ough

put (

MBp

s)

Data Size (GB)

IPoIB (100Gbps)

RDMA-IB (100Gbps)

050

100150200250300350400

10 20 30

Tota

l Thr

ough

put (

MBp

s)

Data Size (GB)

IPoIB (100Gbps)

RDMA-IB (100Gbps)2.06x 2.26x

Presenter

Presentation Notes

Experimental Testbed: Each node in OpenPOWER Cluster has 20 cores (each core has 8 SMT threads) POWER8 8335-GTA processors at 3491 MHz and 256GB RAM. The nodes are equipped with Mellanox ConnectX-4 EDR HCAs. The operating system used is Red Hat Enterprise Linux Server release 7.2. These experiments are performed in 5 DataNodes. HDFS block size is kept to 256MB. Each NodeManager is configured to run with 12 concurrent containers assigning a minimum of 4GB memory per container. The NameNode runs in a different node of the Hadoop cluster. For TestDFSIO, the number of concurrent writers is 64.


• The RDMA-IB design of HHH mode reduces the job execution time of Sort by a maximum of 41% compared to IPoIB (100Gbps).

• The HHH-M design reduces the execution time by a maximum of 55%.

Performance of RDMA-Hadoop on OpenPOWERSort Execution Time

HHH Mode HHH-M Mode

050

100150200250300350400

10 20 30

Exec

utio

n Ti

me

(sec

)

Data Size (GB)

IPoIB (100Gbps)

RDMA-IB (100Gbps)

55%

050

100150200250300350400

10 20 30

Exec

utio

n Ti

me

(sec

)

Data Size (GB)

IPoIB (100Gbps)

RDMA-IB (100Gbps)

41%

Presenter

Presentation Notes

Experimental Testbed: Each node in OpenPOWER Cluster has 20 cores (each core has 8 SMT threads) POWER8 8335-GTA processors at 3491 MHz and 256GB RAM. The nodes are equipped with Mellanox ConnectX-4 EDR HCAs. The operating system used is Red Hat Enterprise Linux Server release 7.2. These experiments are performed in 5 DataNodes with a total of 60 maps. HDFS block size is kept to 256MB. Each NodeManager is configured to run with 12 concurrent containers assigning a minimum of 4GB memory per container. The NameNode runs in a different node of the Hadoop cluster.


• The RDMA-IB design of HHH mode reduces the job execution time of TeraSort by a maximum of 12% compared to IPoIB (100Gbps).

• In HHH-M mode, the execution time of TeraSort is reduced by a maximum of 21% compared to IPoIB (100Gbps).

Performance of RDMA-Hadoop on OpenPOWERTeraSort Execution Time

HHH Mode HHH-M Mode

050

100150200250300350400

10 20 30

Exec

utio

n Ti

me

(sec

)

Data Size (GB)

IPoIB (100Gbps)

RDMA-IB (100Gbps)

050

100150200250300350400

10 20 30

Exec

utio

n Ti

me

(sec

)

Data Size (GB)

IPoIB (100Gbps)

RDMA-IB (100Gbps)12% 21%

Presenter

Presentation Notes

Experimental Testbed: Each node in OpenPOWER Cluster has 20 cores (each core has 8 SMT threads) POWER8 8335-GTA processors at 3491 MHz and 256GB RAM. The nodes are equipped with Mellanox ConnectX-4 EDR HCAs. The operating system used is Red Hat Enterprise Linux Server release 7.2. These experiments are performed in 5 DataNodes with a total of 60 maps. HDFS block size is kept to 256MB. Each NodeManager is configured to run with 12 concurrent containers assigning a minimum of 4GB memory per container. The NameNode runs in a different node of the Hadoop cluster.


Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure


• High-Performance Design of Spark over RDMA-enabled Interconnects

– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Spark

– RDMA-based data shuffle and SEDA-based shuffle architecture

– Non-blocking and chunk-based data transfer

– Off-JVM-heap buffer management

– Support for OpenPOWER

– Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB)

• Current release: 0.9.5

– Based on Apache Spark 2.1.0

– Tested with• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)

• RoCE support with Mellanox adapters

• Various multi-core platforms (x86, POWER)

• RAM disks, SSDs, and HDD

– http://hibd.cse.ohio-state.edu

RDMA for Apache Spark Distribution

http://hadoop-rdma.cse.ohio-state.edu/


• GroupBy: RDMA design outperforms IPoIB by a maximum of 11%

• SortBy: RDMA design outperforms IPoIB by a maximum of 18%

Performance of RDMA-Spark on OpenPOWER

GroupBy SortBy

0

20

40

60

80

100

32 48 64

Exec

utio

n Ti

me

(sec

)

Data Size (GB)

IPoIB RDMA

0

20

40

60

80

100

32 48 64

Exec

utio

n Ti

me

(sec

)

Data Size (GB)

IPoIB RDMA 11% 18%

Presenter

Presentation Notes

Experimental Testbed: Each node in OpenPOWER Cluster 1 has 20 cores (each core has 8 SMT threads) POWER8 8335-GTA processors at 3491 MHz and 256GB RAM. The nodes are equipped with Mellanox ConnectX-4 EDR HCAs. The operating system used is Red Hat Enterprise Linux Server release 7.2. These experiments are performed in 5 DataNodes with a total of 60 maps. HDFS block size is kept to 256MB. Each NodeManager is configured to run with 12 concurrent containers assigning a minimum of 4GB memory per container. The NameNode runs in a different node of the Hadoop cluster..


• TeraSort: RDMA design outperforms IPoIB by a maximum of 35%

• Sort: RDMA design outperforms IPoIB by a maximum of 25%

Performance of RDMA-Spark on OpenPOWER

TeraSort Sort

0200400600800

100012001400

60 80 100

Exec

utio

n Ti

me

(sec

)

Data Size (GB)

IPoIB RDMA

0

50

100

150

200

250

300

60 80 100

Exec

utio

n Ti

me

(sec

)

Data Size (GB)

IPoIB RDMA35%25%

Presenter

Presentation Notes

Experimental Testbed: Each node in OpenPOWER Cluster 1 has 20 cores (each core has 8 SMT threads) POWER8 8335-GTA processors at 3491 MHz and 256GB RAM. The nodes are equipped with Mellanox ConnectX-4 EDR HCAs. The operating system used is Red Hat Enterprise Linux Server release 7.2. These experiments are performed in 5 DataNodes with a total of 60 maps. HDFS block size is kept to 256MB. Each NodeManager is configured to run with 12 concurrent containers assigning a minimum of 4GB memory per container. The NameNode runs in a different node of the Hadoop cluster..


Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure










• Supported through X-ScaleSolutions (http://x-scalesolutions.com)• Benefits:

– Help and guidance with installation of the library

– Platform-specific optimizations and tuning

– Timely support for operational issues encountered with the library

– Web portal interface to submit issues and tracking their progress

– Advanced debugging techniques

– Application-specific optimizations and tuning

– Obtaining guidelines on best practices

– Periodic information on major fixes and updates

– Information on major releases

– Help with upgrading to the latest release

– Flexible Service Level Agreements • Support provided to Lawrence Livermore National Laboratory (LLNL) for the last two years

Commercial Support for MVAPICH2, HiBD, and HiDL Libraries

http://x-scalesolutions.com/


• Has joined the OpenPOWER Consortium as a silver ISV member• Provides flexibility:

– To have MVAPICH2, HiDL and HiBD libraries getting integrated into the OpenPOWER software stack

– A part of the OpenPOWER ecosystem

– Can participate with different vendors for bidding, installation and deployment process

• Introduced two new integrated products with support for OpenPOWER systems (Presented at the OpenPOWER North America Summit)

– X-ScaleHPC

– X-ScaleAI

– Send an e-mail to [email protected] for free trial!!

Silver ISV Member for the OpenPOWER Consortium + Products



X-ScaleHPC Package

• Scalable solutions of communication middleware based on OSU MVAPICH2 libraries

• “out-of-the-box” fine-tuned and optimal performance on various HPC systems including OpenPOWER platforms

• Contact us for more details and a free trial!!– [email protected]

• Stop by X-ScaleSolutions booth (#2094) for a Demo!!



X-ScaleAI Package• High-Performance and scalable solutions for deep learning

– Fully exploiting HPC resources using our X-ScaleHPC package

• “out-of-the-box” optimal performance on OpenPOWER (POWER9) + GPU platforms such as #1 Summit system

• What’s in the X-ScaleAI package?– Fine-tuned CUDA-Aware MPI library– Google TensorFlow framework built with OpenPOWER system– Distributed training using Horovod on top of TensorFlow– Simple installation and execution in one command!

• Contact us for more details and a free trial!!– [email protected]

• Stop by X-ScaleSolutions booth (#2094) for a Demo!!

Presenter

Presentation Notes

Add a link if we have one



• Upcoming Exascale systems need to be designed with a holistic view of HPC, Big Data, Deep Learning, and Cloud

• OpenPOWER, InfiniBand, and NVIDIA GPGPUs are emerging technologies for such systems

• Presented a set of solutions from OSU to enable HPC, Big Data and Deep Learning through a convergent software architecture for OpenPOWER platforms

• X-ScaleSolutions is an ISV provider in the OpenPOWER consortium to provide commercial support, optimizations, tuning and training for the OSU solutions

• OpenPOWER users are encouraged to take advantage of these solutions to extract highest performance and scalability for their applications on OpenPOWER platforms

Concluding Remarks


• Presentations at OSU and X-Scale Booth (#2094)– Members of the MVAPICH, HiBD and HiDL members

– External speakers

• Presentations at SC main program (Tutorials, Workshops, BoFs, Posters, and Doctoral Showcase)

• Presentation at many other booths (Mellanox, Intel, Microsoft, and AWS) and satellite events

• Complete details available at http://mvapich.cse.ohio-state.edu/conference/752/talks/

Multiple Events at SC ‘19

http://mvapich.cse.ohio-state.edu/conference/752/talks/


Funding AcknowledgmentsFunding Support by

Equipment Support by


Personnel AcknowledgmentsCurrent Students (Graduate)

– A. Awan (Ph.D.)

– M. Bayatpour (Ph.D.)

– C.-H. Chu (Ph.D.)

– J. Hashmi (Ph.D.)– A. Jain (Ph.D.)

– K. S. Kandadi (M.S.)

Past Students – A. Augustine (M.S.)

– P. Balaji (Ph.D.)

– R. Biswas (M.S.)

– S. Bhagvat (M.S.)

– A. Bhat (M.S.)

– D. Buntinas (Ph.D.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– S. Chakraborthy (Ph.D.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– R. Rajachandrasekar (Ph.D.)

– D. Shankar (Ph.D.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

– J. Zhang (Ph.D.)

Past Research Scientist– K. Hamidouche

– S. Sur

– X. Lu

Past Post-Docs– D. Banerjee

– X. Besseron

– H.-W. Jin

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– K. Kulkarni (M.S.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– M. Li (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– A. Moody (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– Kamal Raj (M.S.)

– K. S. Khorassani (Ph.D.)– P. Kousha (Ph.D.)

– A. Quentin (Ph.D.)

– B. Ramesh (M. S.)

– S. Xu (M.S.)

– J. Lin

– M. Luo

– E. Mancini

Past Programmers– D. Bureddy

– J. Perkins

Current Research Specialist– J. Smith

– S. Marcarelli

– J. Vienne

– H. Wang

Current Post-doc– M. S. Ghazimeersaeed

– A. Ruhela

– K. ManianCurrent Students (Undergraduate)

– V. Gangal (B.S.)

– N. Sarkauskas (B.S.)

Past Research Specialist– M. Arnold

Current Research Scientist– H. Subramoni– Q. Zhou (Ph.D.)


Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

[email protected]

The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/

The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/

The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/

Follow us on


http://nowlab.cse.ohio-state.edu/



Accelerating HPC, Big Data and Deep Learning on ...hidl.cse.ohio-state.edu/static/media/talks/slide/open...Network Based Computing Laboratory OpenPOWER-ADG ‘19 3 • Challenges in

Documents