How to Design Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems? Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected]http://www.cse.ohio-state.edu/~panda Keynote Talk at HPCAC Stanford Conference (Feb ‘19) by
61
Embed
How to Design Scalable HPC, Deep Learning and Cloud ...€¦ · (Hadoop, Spark, HBase, Memcached, etc.) Deep Learning (Caffe, TensorFlow, BigDL, etc.) HPC (MPI, RDMA, Lustre, etc.)
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
How to Design Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems?
HPCAC-Stanford (Feb ‘19) 2Network Based Computing Laboratory
Big Data (Hadoop, Spark,
HBase, Memcached,
etc.)
Deep Learning(Caffe, TensorFlow, BigDL,
etc.)
HPC (MPI, RDMA, Lustre, etc.)
Increasing Usage of HPC, Big Data and Deep Learning
Convergence of HPC, Big Data, and Deep Learning!
Increasing Need to Run these applications on the Cloud!!
HPCAC-Stanford (Feb ‘19) 3Network Based Computing Laboratory
Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
Physical Compute
HPCAC-Stanford (Feb ‘19) 4Network Based Computing Laboratory
Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
HPCAC-Stanford (Feb ‘19) 5Network Based Computing Laboratory
Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
HPCAC-Stanford (Feb ‘19) 6Network Based Computing Laboratory
Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
Spark Job
Hadoop Job Deep LearningJob
HPCAC-Stanford (Feb ‘19) 7Network Based Computing Laboratory
• Traditional HPC
– Message Passing Interface (MPI), including MPI + OpenMP
– Exploiting Accelerators
• Deep Learning
– Caffe, CNTK, TensorFlow, and many more
• Cloud for HPC
– Virtualization with SR-IOV and Containers
HPC, Deep Learning and Cloud
HPCAC-Stanford (Feb ‘19) 8Network Based Computing Laboratory
Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory
Logical shared memory
Shared Memory Model
SHMEM, DSM
Distributed Memory Model
MPI (Message Passing Interface)
Partitioned Global Address Space (PGAS)
Global Arrays, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models
• Models can be mapped on different types of systems
– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
• PGAS models and Hybrid MPI+PGAS models are gradually receiving
importance
HPCAC-Stanford (Feb ‘19) 9Network Based Computing Laboratory
Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges
Programming ModelsMPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP,
OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.
Application Kernels/Applications
Networking Technologies(InfiniBand, 40/100GigE,
Aries, and Omni-Path)
Multi-/Many-coreArchitectures
Accelerators(GPU and FPGA)
MiddlewareCo-Design
Opportunities
and
Challenges
across Various
Layers
Performance
Scalability
Resilience
Communication Library or Runtime for Programming Models
Point-to-point
Communication
Collective
Communication
Energy-
Awareness
Synchronization
and Locks
I/O and
File Systems
Fault
Tolerance
HPCAC-Stanford (Feb ‘19) 10Network Based Computing Laboratory
• Scalability for million to billion processors– Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided)
– Scalable job start-up
– Low memory footprint
• Scalable Collective communication– Offload
– Non-blocking
– Topology-aware
• Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores)– Multiple end-points per node
• Support for efficient multi-threading
• Integrated Support for Accelerators (GPGPUs and FPGAs)
• Fault-tolerance/resiliency
• QoS support for communication and I/O
• Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, MPI+UPC++, CAF, …)
• Virtualization
• Energy-Awareness
Broad Challenges in Designing Runtimes for (MPI+X) at Exascale
HPCAC-Stanford (Feb ‘19) 11Network Based Computing Laboratory
Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 2,950 organizations in 86 countries
– More than 522,000 (> 0.5 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘18 ranking)
• 3rd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China
• 14th, 556,104 cores (Oakforest-PACS) in Japan
• 17th, 367,024 cores (Stampede2) at TACC
• 27th, 241,108-core (Pleiades) at NASA and many others
– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)
HPCAC-Stanford (Feb ‘19) 18Network Based Computing Laboratory
0
20
40
60
80
64
12
82
56
51
21
K2
K4
K8
K
16
K
32
K6
4K
12
8K
18
0K
23
0K
Tim
e (s
eco
nd
s)
Number of Processes
TACC Stampede2
MPI_Init Hello World
Startup Performance on KNL + Omni-Path
0
5
10
15
20
25
64
12
8
25
6
51
2
1K
2K
4K
8K
16
K
32
K
64
K
Tim
e (s
eco
nd
s)
Number of Processes
Oakforest-PACS
MPI_Init Hello World
22s
5.8s
21s
57s
• MPI_Init takes 22 seconds on 231,936 processes on 3,624 KNL nodes (Stampede2 – Full scale)• At 64K processes, MPI_Init and Hello World takes 5.8s and 21s respectively (Oakforest-PACS)• All numbers reported with 64 processes per node, MVAPICH2-2.3a• Designs integrated with mpirun_rsh, available for srun (SLURM launcher) as well
HPCAC-Stanford (Feb ‘19) 19Network Based Computing Laboratory
Cooperative Rendezvous Protocols
Platform: 2x14 core Broadwell 2680 (2.4 GHz)
Mellanox EDR ConnectX-5 (100 GBps)
Baseline: MVAPICH2X-2.3rc1, Open MPI v3.1.0Cooperative Rendezvous Protocols for Improved Performance and Overlap
S. Chakraborty, M. Bayatpour,, J Hashmi, H. Subramoni, and DK Panda,
SC ‘18 (Best Student Paper Award Finalist)
19%16% 10%
• Use both sender and receiver CPUs to progress communication concurrently
• Dynamically select rendezvous protocol based on communication primitives and sender/receiver
availability (load balancing)
• Up to 2x improvement in large message latency and bandwidth
• Up to 19% improvement for Graph500 at 1536 processes
0
500
1000
1500
28 56 112 224 448 896 1536
Tim
e (S
eco
nd
s)
Number of Processes
Graph500MVAPICH2
Open MPI
Proposed
0
200
400
28 56 112 224 448 896 1536
Tim
e (S
eco
nd
s)
Number of Processes
CoMD
MVAPICH2
Open MPI
Proposed
0
50
100
28 56 112 224 448 896 1536
Tim
e (S
eco
nd
s)
Number of Processes
MiniGhost
MVAPICH2
Open MPI
Proposed
Will be available in upcoming MVAPICH2-X 2.3rc2
HPCAC-Stanford (Feb ‘19) 20Network Based Computing Laboratory
• Offloaded computation/communication to peers ranks in reduction collective operation
• Up to 4X improvement for 4MB Reduce and up to 1.8X improvement for 4M AllReduce
73.2
1.8X
1
10
100
1000
10000
100000
16K 32K 64K 128K 256K 512K 1M 2M 4M
Message Size
MVAPICH2-2.3bIMPI-2017v1.132MVAPICH2-2.3rc1
OSU_Reduce (Broadwell 256 procs)
4X
36.1
37.9
16.8
J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and D. Panda, Designing Efficient Shared Address Space Reduction
Collectives for Multi-/Many-cores, International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018.Available in MVAPICH2-X 2.3rc1
HPCAC-Stanford (Feb ‘19) 24Network Based Computing Laboratory
Application-Level Benefits of XPMEM-Based Collectives
MiniAMR (Broadwell, ppn=16)
• Up to 20% benefits over IMPI for CNTK DNN training using AllReduce
• Up to 27% benefits over IMPI and up to 15% improvement over MVAPICH2 for MiniAMR application kernel
0
200
400
600
800
28 56 112 224
Exec
uti
on
Tim
e (s
)
No. of Processes
Intel MPIMVAPICH2MVAPICH2-XPMEM
CNTK AlexNet Training (Broadwell, B.S=default, iteration=50, ppn=28)
0
20
40
60
80
16 32 64 128 256
Exec
uti
on
Tim
e (s
)
No. of Processes
Intel MPI
MVAPICH2
MVAPICH2-XPMEM20%
9%
27%
15%
HPCAC-Stanford (Feb ‘19) 25Network Based Computing Laboratory
Efficient Zero-copy MPI Datatypes for Emerging Architectures
• New designs for efficient zero-copy based MPI derived datatype processing
• Efficient schemes mitigate datatype translation, packing, and exchange overheads
• Demonstrated benefits over prevalent MPI libraries for various application kernels
• To be available in the upcoming MVAPICH2-X release
0.1
1
10
100
2 4 8 16 28
Lo
gsca
le L
ate
ncy
(mill
ise
co
nd
s)
No. of Processes
MVAPICH2X-2.3IMPI 2018IMPI 2019MVAPICH2X-Opt
5X
0.1
1
10
100
1000
Grid Dimensions (x, y, z, t)
MVAPICH2X-2.3
IMPI 2018
MVAPICH2X-Opt
19X
0.01
0.1
1
10
Grid Dimensions (x, y, z, t)
MVAPICH2X-2.3
IMPI 2018
MVAPICH2X-Opt3X
3D-Stencil Datatype Kernel on
Broadwell (2x14 core)
MILC Datatype Kernel on KNL 7250 in Flat-Quadrant Mode (64-core)
NAS-MG Datatype Kernel on OpenPOWER (20-core)
FALCON: Efficient Designs for Zero-copy MPI Datatype Processing on Emerging Architectures, J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, D. K. (DK) Panda, 33rd IEEE International Parallel & Distributed Processing Symposium (IPDPS ’19), May 2019.
HPCAC-Stanford (Feb ‘19) 26Network Based Computing Laboratory
0
5000
10000
15000
20000
25000
30000
224 448 896
Per
form
ance
in G
FLO
PS
Number of Processes
MVAPICH2 Async MVAPICH2 Default IMPI 2019 Default
0
1
2
3
4
5
6
7
8
9
112 224 448
Tim
e p
er
loo
p in
sec
on
ds
Number of Processes
MVAPICH2 Async MVAPICH2 Default IMPI 2019 Async
Benefits of the New Asynchronous Progress Design: Broadwell + InfiniBand
Up to 33% performance improvement in P3DFFT application with 448 processesUp to 29% performance improvement in HPL application with 896 processes
Memory Consumption = 69%
P3DFFT High Performance Linpack (HPL)
26%
27% Lower is better Higher is better
A. Ruhela, H. Subramoni, S. Chakraborty, M. Bayatpour, P. Kousha, and D.K. Panda, Efficient Asynchronous Communication Progress for MPI without Dedicated Resources, EuroMPI 2018 Available in MVAPICH2-X 2.3rc1
HPCAC-Stanford (Feb ‘19) 43Network Based Computing Laboratory
• Traditional HPC
– Message Passing Interface (MPI), including MPI + OpenMP
– Exploiting Accelerators
• Deep Learning
– Caffe, CNTK, TensorFlow, and many more
• Cloud for HPC and BigData
– Virtualization with SR-IOV and Containers
HPC, Deep Learning and Cloud
HPCAC-Stanford (Feb ‘19) 44Network Based Computing Laboratory
• Deep Learning frameworks are a different game
altogether
– Unusually large message sizes (order of megabytes)
– Most communication based on GPU buffers
• Existing State-of-the-art
– cuDNN, cuBLAS, NCCL --> scale-up performance
– NCCL2, CUDA-Aware MPI --> scale-out performance
• For small and medium message sizes only!
• Proposed: Can we co-design the MPI runtime (MVAPICH2-
GDR) and the DL framework (Caffe) to achieve both?
– Efficient Overlap of Computation and Communication
– Efficient Large-Message Communication (Reductions)
– What application co-designs are needed to exploit
communication-runtime co-designs?
Deep Learning: New Challenges for MPI Runtimes
Scal
e-u
p P
erf
orm
ance
Scale-out PerformanceA. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)
cuDNN
gRPC
Hadoop
MPIMKL-DNN
DesiredNCCL1NCCL2
HPCAC-Stanford (Feb ‘19) 45Network Based Computing Laboratory
• MVAPICH2-GDR offers
excellent performance via
advanced designs for
MPI_Allreduce.
• Up to 11% better
performance on the RI2
cluster (16 GPUs)
• Near-ideal – 98% scaling
efficiency
Exploiting CUDA-Aware MPI for TensorFlow (Horovod)
MVAPICH2-GDR 2.3 (MPI-Opt) is up to 11% faster than MVAPICH2
2.3 (Basic CUDA support)
A. A. Awan et al., “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation”, Under Review, https://arxiv.org/abs/1810.11112