Efficient and Scalable Communication Middleware …...Efficient and Scalable Communication Middleware for Emerging Dense-GPU Clusters Ching-Hsiang Chu Advisor: DhabaleswarK.Panda Network

Efficient and Scalable Communication Middleware for Emerging Dense-GPU Clusters

Ching-Hsiang Chu

Advisor: Dhabaleswar K. Panda

Network Based Computing LabDepartment of Computer Science and Engineering

The Ohio State University, Columbus, OH

SC 19 Doctoral Showcase 2Network Based Computing Laboratory

• Introduction

• Problem Statement

• Detailed Description and Results

• Broader Impact on the HPC Community

• Expected Contributions

Outline


Trends in Modern HPC Architecture: Heterogeneous

• Multi-core/many-core technologies

• High Performance Interconnects

• High Performance Storage and Compute devices

• Variety of programming models (MPI, PGAS, MPI+X)

Accelerators / Coprocessors high compute density,high performance/watt

High Performance Interconnects InfiniBand, Omni-Path, EFA

<1usec latency, 100Gbps+ Bandwidth

Multi/ Many-core Processors

SSD, NVMe-SSD, NVRAM

Node local storage

#1 Summit(27,648 GPUs)

#2 Sierra (17,280 GPUs)#10 Lassen (2,664 GPUs)

#8 ABCI(4,352 GPUs)

#22 DGX SuperPOD(1,536 GPUs)


• Scale-up (up to 150 GB/s)

– PCIe, NVLink/NVSwitch

– Infinity Fabric, Gen-Z, CXL

• Scale-out (up to 25 GB/s)

– InfiniBand, Omni-path, Ethernet

– Cray Slingshot

Trends in Modern Large-scale Dense-GPU Systems


• Various scientific applications are ported to GPU– Reportedly significant speedup compared to CPU version

– High-resolution/precision results

GPU-enabled HPC Applications

Derek Leinweber, CSSM, University of Adelaide

Lattice Quantum Chromodynamics Weather Simulation Wave propagation simulation

23x faster than CPU 2.8x faster than CPU 25x faster than CPU

Fuhrer O, Osuna C, Lapillonne X, Gysi T, Bianco M, Schulthess T. Towards GPU-accelerated operational weather forecasting. GTC 2013.

https://geodynamics.org/cig/software/specfem3d_globe/


• Easy-to-use and high-performance frameworks

• Wide range of applications– Image Classification– Speech Recognition– Self-driving car– Healthcare– Climate Analytic

GPU-enabled Emerging Deep Learning Applications

Kurth T, Treichler S, Romero J, Mudigonda M, Luehr N, Phillips E, Mahesh A, Matheson M, Deslippe J, Fatica M, Houston M. Exascale deep learning for climate analytics. SC 2018 Nov 11 (p. 51). (Golden Bell Prize)

999 PetaFlop/s sustained, and 1.13 ExaFlop/s peak FP 16 performance over 4560 nodes (27,360 GPU)


insideMVAPICH2

• Supports and optimizes various communication patterns

• Overlaps data movement from GPU with RDMA transfers

GPU-Aware (CUDA-Aware) Communication Middleware

MPI-based Generic Communication Middleware DL-Specific Communication Middleware

• Ring-based collective operations

• Optimized for DL workloads on GPU systems


Can we design a generic GPU-enabled communication middleware to fully exploit GPU resources and interconnects for traditional HPC and emerging ML/DL applications?

Broad Challenge

Overlap Productivity

Perf

orm

ance

Reso

urce

U

tiliz

atio

n

Ideal

User Naive

User Advanced

Farther from the center is Better


• Introduction





Outline


• What kind of hardware capabilities can be leveraged to fully exploit the modern interconnects deployed in GPU clusters?

• What is the potential scope of communication patterns can be benefited from the proposed GPU-enabled communication middleware?

• How to leverage GPU resources such as high-bandwidth memory (HBM) and massive streaming multiprocessors (SM) to accelerate communication?

• What are design considerations for a GPU-enabled communication middleware to efficiently utilize the hardware features?

• What kind of performance benefits can be expected with the proposed GPU-enabled communication middleware?

• How can the traditional HPC and DL applications can take advantage of the proposed communication middleware without application-level modifications?

Problem Statements


Modern HPCHardware

Programing Models

InterconnectsInfiniBand NVLink/NVSwitchPCIe

Multi-Core ProcessorsOpenPOWERXeon

GPU-EnabledCommunication Middleware

Scalable Streaming Broadcast

GPU-enabled Cooperative and Link-efficient reduction operations

Efficient Non-contiguous Data Process: Scheduling and Transfer

Message Passing Interface (MPI)

AcceleratorGPU

HardwareCapability Hardware Multicast Inter-GPU Direct Load/StoreGPUDirect RDMA

Research Framework

ComputeModels

CUDA

OpenACC

ApplicationsTraditional HPC Applications

MILC COSMO

Machine/Deep Learning

TensorFlow CNTK

Benchmarks

StreamingOMB AWP-ODC HOOMD-Blue

GDRCOPY


• Introduction


• Detailed Description and Results – Scalable Streaming Broadcast for InfiniBand Networks– Efficient Scheduling of Non-contiguous Data Transfer– GPU-enabled Zero-copy Transfer for Non-contiguous Data – GPU-enabled Reduction operations



Outline


• Streaming & Deep Learning applications– Large-scale broadcast operations

– High computation-communication overlap

– No application-level modification

Motivated Example #1 – Need of scalable broadcast

Data Source

Data Distributor / Parameter Server

Online/Offline streaming

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

Data streaming-like broadcast operations

Ching-Hsiang Chu, et al., "Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast," IEEE TPDS, vol. 30, no. 3, pp. 575-588, 1 March 2019

GPU Node 1

GPU Node 2 GPU Node 4

GPU Node 3


• Streaming data through host– Fine-tuned chunked data

Ø Three-stage pipeline

– Leverages IB Scatter-Gather and GDR features

Ø Free-up PCIe resources for applicationsIB

Switch

IB HCA

CPU

GPU

Destination 1Header

d_in

IB HCA

CPU

GPU

Destination NHeader

d_in

IB Scatter (GDR Write)

Proposed Efficient and Scalable Solution

MPI_Bcast(d_in,…)


IB HCA

CPU

GPU

SourceHeader

d_out

1. Data Preparation2. IB Gather 3. IB Hardware Multicast

im_buf

MPI_Bcast(d_out,…)


• Evaluated @ RI2 GPU nodes

Performance Evaluation

0

0.5

1

1.5

8 16 8 16 8 16

AlexNet VGG ResNet-50

Spee

dup

Scale (Number of GPU nodes)

*CA-CNTK - Image Classification

0

2

4

6

8

10

Peak

Stre

amin

g

Peak

Stre

amin

g

Peak

Stre

amin

g

4 GPU Nodes 8 GPUs Nodes 16 GPU Nodes

Thro

ughp

ut (G

B/s)

Streaming WorkloadKnomial-GDRRing-GDR-PipelineTA-Zcpy-MCAST-GDR-Pipeline

6.9X

1.2X


*D. S. Banerjee, K. Hamidouche and D. K. Panda, "Re-Designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters," CloudCom 2016.


• Based on the architecture on RI2 cluster

Performance Model Validation and Prediction

0.001

0.1

10

1000

100000

2 4 8 16 32 64128

256512

10242048

Late

ncy

(s)

Number of Broadcast Sources

K-nomial-based: Model-based EstimationRing-based: Model-based EstimationMCAST-GDR-Opt: Model-based Estimation

0.001

0.01

0.1

1

10

100

2 4 8 16

Late

ncy

(s)

Number of Broadcast Sources

K-nomial-based: Model-based EstimationK-nomial-based: ExperimentRing-based: Model-based EstimationRing-based: ExperimentMCAST-GDR-Opt: Model-based EstimationMCAST-GDR-Opt: Experiment

Within 10% of error

𝑴 = 2𝑀𝐵; 𝑪 = 512 𝐾𝐵; 𝑼 = 4 𝐾𝐵;𝑩𝑯 ≈ 100 𝐺𝑏𝑝𝑠;𝑩𝑷𝑪𝑰𝒆 = 8 𝐺𝑏𝑝𝑠; 𝒕𝒐 𝒏 ≈1𝛼× ln 𝑛 , 15 ≤ 𝛼 ≤ 20



• Introduction



• Ongoing Work and Future Research Directions



Outline


• Wide usages of MPI derived datatype for Non-contiguous Data Transfer

– Requires Low-latency and high overlap processing

Motivated Example #2 – Non-contiguous Data Transfer

M. Martinasso, G. Kwasniewski, S. R. Alam, Thomas C. Schulthess, and T. Hoefler. “A PCIe congestion-aware performance model for densely populated accelerator servers. “ SC 2016

Weather Simulation: COSMO model

Mike Clark. “GPU Computing with QUDA, “Developer Technology Group, https://www.olcf.ornl.gov/wp-content/uploads/2013/02/Clark_M_LQCD.pdf

Quantum Chromodynamics: MILC with QUDA


CPU

Progress

GPU

Time

Initi

ate

Kern

el

Star

t Se

nd

Isend(1)

Initi

ate

Kern

el

Star

t Se

nd

Initi

ate

Kern

el

GPU

CPU

Initi

ate

Kern

el

Star

tSe

nd

Wait For Kernel(WFK)

Kernel on Stream

Isend(1)Existing Design

Proposed Design

Kernel on Stream

Kernel on Stream

Isend(2)Isend(3)

Kernel on Stream

Initi

ate

Kern

el

Star

t Se

nd


Kernel on Stream

Isend(1)

Initi

ate

Kern

el

Star

t Se

nd


Kernel on Stream

Isend(1) Wait

WFK

Star

t Se

nd

Wait

Progress

Start Finish Proposed Finish Existing

WFK

WFK

Expected Benefits

Existing GPU-enabled MPI Datatype Processing

Waste of computing resources on CPU and GPUCommon Scenario

*A, B…contain non-contiguous MPI Datatype

MPI_Isend (A,.. Datatype,…)MPI_Isend (B,.. Datatype,…)MPI_Isend (C,.. Datatype,…)MPI_Isend (D,.. Datatype,…)…

MPI_Waitall (…);

Ching-Hsiang Chu et al., "Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, " IEEE IPDPS 2016.


Proposed Event-based Design – Low Latency

MPI_Isend()stream

1cudaEventRecord()

HCA CPU GPU

MPI_Waitall()Query / Progress

Send

CompletionRequest

Complete

pack_kernel1<<< >>>

pack_kernel2<<< >>>

pack_kernel3<<< >>>cudaEventRecord()

cudaEventRecord()

MPI_Isend()

MPI_Isend()

stream3

stream2

Default stream


Proposed Callback-based Design – High Overlap

MPI_Isend()addCallback()

HCA CPU GPU

Send

CompletionRequestComplete

main helper callback

pack_kernel1<<< >>>

Callback

MPI_Waitall()

addCallback()

addCallback()CPU

Computations

pack_kernel2<<< >>>

pack_kernel3<<< >>>

MPI_Isend()

MPI_Isend()

CallbackCallback

stream1

stream2

stream3D

efault stream


Application-level (COSMO HaloExchange) Evaluation

0

0.2

0.4

0.6

0.8

1

1.2

16 32 64 96Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

CSCS GPU clusterDefault Callback-based Event-based

00.20.40.60.8

11.2

4 8 16 32Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

Wilkes GPU ClusterDefault Callback-based Event-based

2X 1.6X

MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2); MPI_Wait (request1, status1);MPI_Wait (request2, status2);

Ching-Hsiang Chu et al., "Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, " IEEE IPDPS 2016.


• Introduction






Outline


• Exploiting load-store capability of modern interconnects– Eliminate extra data copies and expensive packing/unpacking processing

Proposed Zero-copy (packing-free) Datatype TransferSo

urce

GPU

Mem

ory

Dest

inat

ion

GPU

Mem

ory

PCIe

/NVL

ink

Syst

em/H

CA M

emor

y

PCIe

/NVL

ink

Sour

ce G

PU M

emor

y

Dest

inat

ion

GPU

Mem

ory

PCIe

/NVL

ink

Load-Store

Copy

Existing Packing Schem Proposed Packing-free Schem

Ching-Hsiang Chu et al., “High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems”, HiPC 2019.


Performance Evaluation• Zero-copy (packing-free) for GPUs with peer-to-peer direct access over PCIe/NVLink

0

5

10

15

20

25

[6, 8,8,8,8] [6, 8,8,8,16] [6, 8,8,16,16] [6, 16,16,16,16]

MILC

Spee

dup

Problem size

GPU-based DDTBench mimics MILC communication kernel

OpenMPI 4.0.0 MVAPICH2-GDR 2.3.1 ProposedPlatform: Nvidia DGX-2 system

0

5

10

15

20

25

16 32 64

Exec

utio

n Ti

me

(s)

Number of GPUs

Communication Kernel of COSMO Model

MVAPICH2-GDR 2.3.1 ProposedPlatform: Cray CS-Storm

Improved 3.4X

(https://github.com/cosunae/HaloExchangeBenchmarks)

Improved 15X

Ching-Hsiang Chu et al., “High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems”, HiPC 2019.

https://github.com/cosunae/HaloExchangeBenchmarks


• Introduction






Outline


Motivated Example #3 – Reduction Op. for DL Training

• Can GPU resources help improving compute-intensive communications?– E.g., MPI_Reduce, MPI_Allreduce, MPI_Scan

– Emerging distributed deep learning training• Exchange and update weights

– Requires fast and high-bandwidth solutions

https://www.oreilly.com/ideas/distributed-tensorflow

Ben-Nun T, Hoefler T. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. arXivpreprint arXiv:1802.09941. 2018 Feb 26.

https://www.oreilly.com/ideas/distributed-tensorflow


• Existing designs1. Explicit copy the data from GPU to host memory

2. Host-to-Host communication to remote processes

3. Perform computation on CPU

4. Explicit copy the data from host to GPU memory

• Proposed designs1. GPU-to-GPU communication

• NVIDIA GPUDirect RDMA (GDR)

• Pipeline through host for large msg

2. Perform computation on GPU• Efficient CUDA kernels

How to leverage GPUs for MPI Reduction Operations?

CPU

Host Memory

GPU

PCIe IB Adapter

CPU

Host Memory

GPU

PCIeIB Adapter1

2

3

4

1

2

Node BNode A

Expensive!

Expensive!

Fast

Good for small data Relative slow for large data

Ching-Hsiang Chu et al., "CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters, " IEEE/ACM CCGrid 2016


Communication Computation Design Algorithm Benefit

Host<->HostCPU

BR-H-HH (Default) Binomial-Reduce Large scale, small messagesRD-H-HH (Default) Recursive doubling

GR-H-HH

Gather-Reduce Small scale, small messages

GPU

GR-HH

Host<->Device (GDR) GR-HD / GR-DH

Device<->Device(GDR)

GR-DD

BR-DD Binomial-Reduce

Large messagesfor any scale

BRB-DD Binomial-Reduce-Bcast

RD-DDRecursive doubling

Host<->Device (GDR) RD-HD/RD-DH

Alternative and Extended Designs


• Grouping GPUs which are fully connected by NVLinks– Contention-free communication within the group

• Cooperative Reduction Kernels to exploit load-compute-storeprimitives over NVLinks

Proposed NVGroup Allreduce

Ching-Hsiang Chu et al., “NV-Group: Cooperative and Link-Efficient Reductions for Deep Learning on NVLink-enabled Dense GPU Systems, ” (to be submitted)


Preliminary Results – Allreduce Benchmark

0123456

32M 64M 128M 256M

Band

wid

th (G

B/s)

Message Size (Bytes)

Bandwidth on 1,536 GPUs

Proposed NVGroupNCCL 2.4

1.7X better

0

100

200

300

400

500

4 16 64 256 1K 4K 16K

Late

ncy

(us)

Message Size (Bytes)

Latency on 1,536 GPUs

NVGroup NCCL 2.4

1.6X better

#1 Summit Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect

0

2

4

6

8

10

24 48 96 192 384 768 1536

Band

wid

th (G

B/s)

Number of GPUs

128MB Message

SpectrumMPI 10.2.0.11 OpenMPI 4.0.1

NCCL 2.4 Proposed NVGroup

1.7X better



Preliminary Results – Distributed Deep Learning Training• ResNet-50 Training using TensorFlow benchmark on a DGX-2 machine (16 Volta GPUs)

0

1000

2000

3000

4000

5000

6000

7000

1 2 4 8 16

Imag

e pe

r sec

ond

Number of GPUs

NCCL-2.4 MVAPICH2-GDR 2.3.1 Ideal

8% higher

0102030405060708090

100

1 2 4 8 16

Scal

ing

Effic

ienc

y (%

)

Number of GPUs

NCCL-2.4 Proposed

Scaling EfJiciency =Actual throughput

Ideal throughput at scale×100%



• Introduction






Outline


• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 3,000 organizations in 89 countries

– More than 553,000 (> 0.5 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (June ‘19 ranking)

• 3rd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China

• 16th, 556,104 cores (Oakforest-PACS) in Japan

• 19th, 367,024 cores (Stampede2) at TACC

• 31st, 241,108-core (Pleiades) at NASA and many others

– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)

– http://mvapich.cse.ohio-state.edu

MVAPICH2 Project

Partner in the 5th ranked TACC Frontera System

Empowering Top500 systems for over a decade

http://mvapich.cse.ohio-state.edu/


• Accelerating GPU-enabled HPC applications worldwide– MVAPICH2-GDR is widely used on many top-ranked GPU clusters worldwide including

Summit (#1), Sierra (#2), ABCI (#8), Lassen (#10) and more

• Enabling fast weather forecasting– MeteoSwiss uses the COSMO numerical weather forecasting model for the production of

regional and local forecast products

– Use MVAPICH2-GDR to accelerate GPU Communication

• Supporting scalable and reliable data dissemination– DoD streaming applications are using MVAPICH2-GDR to accelerate GPU-based broadcast

operations

– Tutorial sessions: PETTT ’17, PETTT ‘18

Impact to the Community


• Introduction






Outline


• Improving Scale-out and Scale-up performance by exploiting features in modern Interconnects

– Enabling Scalable Broadcast by using InfiniBand hardware multicast and GPUDirect RDMA

– Link-efficient schemes by exploiting load-store primitives over GPU interconnects

• Efficient GPU-enabled Communication Middleware– GPU-enabled MPI derived datatype processing

– GPU-enabled reduction operations

• Significant impact on the community– The abstraction for accelerator-enabled communication middleware

– Benefit HPC & ML/DL workloads

– Broader outreach through MVAPICH2-GDR public releases

Expected Contributions


Thank You!

Questions?

[email protected]

http://web.cse.ohio-state.edu/~chu.368

Efficient and Scalable Communication Middleware …...Efficient and Scalable Communication Middleware for Emerging Dense-GPU Clusters Ching-Hsiang Chu Advisor: DhabaleswarK.Panda Network

Documents