SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9
Post on 24-Jul-2020
0 Views
Preview:
Transcript
Sunrise or Sunset: Exploring the Design Space of Big Data So7ware Stacks
Dhabaleswar K. (DK) Panda The Ohio State University
E-‐mail: panda@cse.ohio-‐state.edu h<p://www.cse.ohio-‐state.edu/~panda
Panel PresentaAon at HPBDC ‘17
by
HPBDC ‘17 Panel 2 Network Based CompuAng Laboratory
Q1: Are Big Data So7ware Stacks Mature or Not?
• Big Data soEware stacks like Hadoop, Spark and Memcached have been there for mulKple years – Hadoop – 11 years (Apache Hadoop 0.1.0 released on April, 2006)
– Spark – 5 years (Apache Spark 0.5.1 released on June, 2012)
– Memcached – 14 years (IniKal release of Memcached on May 22, 2003)
• Increasingly being used in producKon environments
• OpKmized for commodity clusters with Ethernet and TCP/IP interface
• Not yet able to take full advantage of modern cluster and/or HPC technologies
HPBDC ‘17 Panel 3 Network Based CompuAng Laboratory
• SubstanKal impact on designing and uKlizing data management and processing systems in mulKple Kers
– Front-‐end data accessing and serving (Online) • Memcached + DB (e.g. MySQL), HBase
– Back-‐end data analyKcs (Offline) • HDFS, MapReduce, Spark
Data Management and Processing on Modern Clusters
Internet
Front-end Tier Back-end Tier
Web ServerWeb
ServerWeb Server
Memcached+ DB (MySQL)Memcached
+ DB (MySQL)Memcached+ DB (MySQL)
NoSQL DB (HBase)NoSQL DB
(HBase)NoSQL DB (HBase)
HDFS
MapReduce Spark
Data Analytics Apps/Jobs
Data Accessing and Serving
HPBDC ‘17 Panel 4 Network Based CompuAng Laboratory
• Focuses on large data and data analysis
• Hadoop (e.g. HDFS, MapReduce, RPC, HBase) environment is gaining a lot of momentum
• h<p://wiki.apache.org/hadoop/PoweredBy
Who Are Using Hadoop?
HPBDC ‘17 Panel 5 Network Based CompuAng Laboratory
• Generalize MapReduce to support new apps in same engine
• Two Key ObservaKons – General task support with DAG
– MulK-‐stage and interacKve apps require faster data sharing across parallel jobs
Spark Ecosystem
Spark
Spark Streaming (real-time)
GraphX (graph)
… Spark SQL
MLlib (Machine Learning)
BlinkDB
Standalone Apache Mesos YARN
Caffe, TensorFlow, BigDL, etc.
(Deep Learning)
HPBDC ‘17 Panel 6 Network Based CompuAng Laboratory
• Focuses on large data and data analysis with in-‐memory techniques
• Apache Spark is gaining a lot of momentum
• h<p://spark.apache.org/powered-‐by.html
Who Are Using Spark?
HPBDC ‘17 Panel 7 Network Based CompuAng Laboratory
Q2: What are the Main Driving forces for New-‐generaAon Big Data So7ware Stacks?
HPBDC ‘17 Panel 8 Network Based CompuAng Laboratory
Big Data (Hadoop, Spark,
HBase, Memcached,
etc.)
Deep Learning (Caffe, TensorFlow,
BigDL, etc.)
HPC (MPI, RDMA, Lustre, etc.)
Increasing Usage of HPC, Big Data and Deep Learning
Convergence of HPC, Big Data, and Deep Learning!!!
HPBDC ‘17 Panel 9 Network Based CompuAng Laboratory
How Can HPC Clusters with High-‐Performance Interconnect and Storage Architectures Benefit Big Data and Deep Learning ApplicaAons?
Bring HPC, Big Data processing, and Deep Learning into a “convergent trajectory”!
What are the major bo<lenecks in current Big Data processing and Deep Learning middleware (e.g.
Hadoop, Spark)?
Can the bo<lenecks be alleviated with new designs by taking advantage of HPC technologies?
Can RDMA-‐enabled high-‐performance interconnects benefit Big Data
processing and Deep Learning?
Can HPC Clusters with high-‐performance
storage systems (e.g. SSD, parallel file
systems) benefit Big Data and Deep Learning
applicaKons?
How much performance benefits
can be achieved through enhanced
designs? How to design benchmarks for evaluaKng the
performance of Big Data and Deep Learning middleware on HPC
clusters?
HPBDC ‘17 Panel 10 Network Based CompuAng Laboratory
Can We Run Big Data and Deep Learning Jobs on ExisAng HPC Infrastructure?
HPBDC ‘17 Panel 11 Network Based CompuAng Laboratory
Can We Run Big Data and Deep Learning Jobs on ExisAng HPC Infrastructure?
HPBDC ‘17 Panel 12 Network Based CompuAng Laboratory
Can We Run Big Data and Deep Learning Jobs on ExisAng HPC Infrastructure?
HPBDC ‘17 Panel 13 Network Based CompuAng Laboratory
Can We Run Big Data and Deep Learning Jobs on ExisAng HPC Infrastructure?
HPBDC ‘17 Panel 14 Network Based CompuAng Laboratory
Q3: What Chances are Provided for the Academia CommuniAes in Exploring the Design Spaces of Big Data
So7ware Stacks?
HPBDC ‘17 Panel 15 Network Based CompuAng Laboratory
Designing CommunicaAon and I/O Libraries for Big Data Systems: Challenges
Big Data Middleware (HDFS, MapReduce, HBase, Spark, gRPC/TensorFlow, and Memcached)
Networking Technologies (InfiniBand, 1/10/40/100 GigE
and Intelligent NICs)
Storage Technologies (HDD, SSD, NVM, and NVMe-‐
SSD)
Programming Models (Sockets)
ApplicaAons
Commodity CompuAng System Architectures
(MulA-‐ and Many-‐core architectures and accelerators)
RDMA Protocols
CommunicaAon and I/O Library Point-‐to-‐Point CommunicaAon
QoS & Fault Tolerance
Threaded Models and SynchronizaAon
Performance Tuning I/O and File Systems
VirtualizaAon (SR-‐IOV)
Benchmarks
HPBDC ‘17 Panel 16 Network Based CompuAng Laboratory
• RDMA for Apache Spark
• RDMA for Apache Hadoop 2.x (RDMA-‐Hadoop-‐2.x) – Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distribuKons
• RDMA for Apache HBase
• RDMA for Memcached (RDMA-‐Memcached)
• RDMA for Apache Hadoop 1.x (RDMA-‐Hadoop)
• OSU HiBD-‐Benchmarks (OHB) – HDFS, Memcached, HBase, and Spark Micro-‐benchmarks
• hip://hibd.cse.ohio-‐state.edu
• Users Base: 230 organizaKons from 30 countries
• More than 21,800 downloads from the project site
The High-‐Performance Big Data (HiBD) Project
Available for InfiniBand and RoCE Also run on Ethernet
HPBDC ‘17 Panel 17 Network Based CompuAng Laboratory
• High-‐Performance Design of Hadoop over RDMA-‐enabled Interconnects
– High performance RDMA-‐enhanced design with naKve InfiniBand and RoCE support at the verbs-‐level for HDFS, MapReduce, and RPC components
– Enhanced HDFS with in-‐memory and heterogeneous storage
– High performance design of MapReduce over Lustre
– Memcached-‐based burst buffer for MapReduce over Lustre-‐integrated HDFS (HHH-‐L-‐BB mode)
– Plugin-‐based architecture supporKng RDMA-‐based designs for Apache Hadoop, CDH and HDP
– Easily configurable for different running modes (HHH, HHH-‐M, HHH-‐L, HHH-‐L-‐BB, and MapReduce over Lustre) and different protocols (naKve InfiniBand, RoCE, and IPoIB)
• Current release: 1.1.0
– Based on Apache Hadoop 2.7.3
– Compliant with Apache Hadoop 2.7.1, HDP 2.5.0.3 and CDH 5.8.2 APIs and applicaKons
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Various mulK-‐core plarorms
• Different file systems with disks and SSDs and Lustre
RDMA for Apache Hadoop 2.x DistribuAon
hip://hibd.cse.ohio-‐state.edu
HPBDC ‘17 Panel 18 Network Based CompuAng Laboratory
• HHH: Heterogeneous storage devices with hybrid replicaKon schemes are supported in this mode of operaKon to have be<er fault-‐tolerance as well as performance. This mode is enabled by default in the package.
• HHH-‐M: A high-‐performance in-‐memory based setup has been introduced in this package that can be uKlized to perform all I/O operaKons in-‐memory and obtain as much performance benefit as possible.
• HHH-‐L: With parallel file systems integrated, HHH-‐L mode can take advantage of the Lustre available in the cluster.
• HHH-‐L-‐BB: This mode deploys a Memcached-‐based burst buffer system to reduce the bandwidth bo<leneck of shared file system access. The burst buffer design is hosted by Memcached servers, each of which has a local SSD.
• MapReduce over Lustre, with/without local disks: Besides, HDFS based soluKons, this package also provides support to run MapReduce jobs on top of Lustre alone. Here, two different modes are introduced: with local disks and without local disks.
• Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-‐M, HHH-‐L, and MapReduce over Lustre).
Different Modes of RDMA for Apache Hadoop 2.x
HPBDC ‘17 Panel 19 Network Based CompuAng Laboratory
• High-‐Performance Design of Spark over RDMA-‐enabled Interconnects – High performance RDMA-‐enhanced design with naKve InfiniBand and RoCE support at the verbs-‐level for Spark
– RDMA-‐based data shuffle and SEDA-‐based shuffle architecture
– Support pre-‐connecKon, on-‐demand connecKon, and connecKon sharing
– Non-‐blocking and chunk-‐based data transfer
– Off-‐JVM-‐heap buffer management
– Easily configurable for different protocols (naKve InfiniBand, RoCE, and IPoIB)
• Current release: 0.9.4
– Based on Apache Spark 2.1.0
– Tested with • Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Various mulK-‐core plarorms
• RAM disks, SSDs, and HDD
– hip://hibd.cse.ohio-‐state.edu
RDMA for Apache Spark DistribuAon
HPBDC ‘17 Panel 20 Network Based CompuAng Laboratory
• RDMA for Apache Hadoop 2.x and RDMA for Apache Spark are installed and available on SDSC Comet. – Examples for various modes of usage are available in:
• RDMA for Apache Hadoop 2.x: /share/apps/examples/HADOOP
• RDMA for Apache Spark: /share/apps/examples/SPARK/
– Please email help@xsede.org (reference Comet as the machine, and SDSC as the site) if you have any further quesKons about usage and configuraKon.
• RDMA for Apache Hadoop is also available on Chameleon Cloud as an appliance – h<ps://www.chameleoncloud.org/appliances/17/
HiBD Packages on SDSC Comet and Chameleon Cloud
M. TaAneni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet, XSEDE’16, July 2016
HPBDC ‘17 Panel 21 Network Based CompuAng Laboratory
0 50
100 150 200 250 300 350 400
80 120 160
ExecuA
on Tim
e (s)
Data Size (GB)
IPoIB (EDR) OSU-‐IB (EDR)
0 100 200 300 400 500 600 700 800
80 160 240
ExecuA
on Tim
e (s)
Data Size (GB)
IPoIB (EDR) OSU-‐IB (EDR)
Performance Numbers of RDMA for Apache Hadoop 2.x – RandomWriter & TeraGen in OSU-‐RI2 (EDR)
Cluster with 8 Nodes with a total of 64 maps
• RandomWriter – 3x improvement over IPoIB
for 80-‐160 GB file size
• TeraGen – 4x improvement over IPoIB for
80-‐240 GB file size
RandomWriter TeraGen
Reduced by 3x Reduced by 4x
HPBDC ‘17 Panel 22 Network Based CompuAng Laboratory
0 100 200 300 400 500 600 700 800
80 120 160
ExecuA
on Tim
e (s)
Data Size (GB)
IPoIB (EDR) OSU-‐IB (EDR)
Performance Numbers of RDMA for Apache Hadoop 2.x – Sort & TeraSort in OSU-‐RI2 (EDR)
Cluster with 8 Nodes with a total of 64 maps and 32 reduces
• Sort – 61% improvement over IPoIB for
80-‐160 GB data
• TeraSort – 18% improvement over IPoIB for
80-‐240 GB data
Reduced by 61% Reduced by 18%
Cluster with 8 Nodes with a total of 64 maps and 14 reduces
Sort TeraSort
0
100
200
300
400
500
600
80 160 240
ExecuA
on Tim
e (s)
Data Size (GB)
IPoIB (EDR) OSU-‐IB (EDR)
HPBDC ‘17 Panel 23 Network Based CompuAng Laboratory
• Design Features – RDMA based shuffle plugin – SEDA-‐based architecture – Dynamic connecKon
management and sharing
– Non-‐blocking data transfer – Off-‐JVM-‐heap buffer
management – InfiniBand/RoCE support
Design Overview of Spark with RDMA
• Enables high performance RDMA communicaKon, while supporKng tradiKonal socket interface
• JNI Layer bridges Scala based Spark with communicaKon library wri<en in naKve code
X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, AcceleraAng Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014
X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-‐Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData ‘16, Dec. 2016.
Spark Core
RDMA Capable Networks (IB, iWARP, RoCE ..)
Apache Spark Benchmarks/ApplicaAons/Libraries/Frameworks
1/10/40/100 GigE, IPoIB Network
Java Socket Interface Java NaAve Interface (JNI)
NaAve RDMA-‐based Comm. Engine
Shuffle Manager (Sort, Hash, Tungsten-‐Sort)
Block Transfer Service (Neiy, NIO, RDMA-‐Plugin) Neiy Server
NIO Server
RDMA Server
Neiy Client
NIO Client
RDMA Client
HPBDC ‘17 Panel 24 Network Based CompuAng Laboratory
• InfiniBand FDR, SSD, 64 Worker Nodes, 1536 Cores, (1536M 1536R)
• RDMA vs. IPoIB with 1536 concurrent tasks, single SSD per node. – SortBy: Total Kme reduced by up to 80% over IPoIB (56Gbps)
– GroupBy: Total Kme reduced by up to 74% over IPoIB (56Gbps)
Performance EvaluaAon on SDSC Comet – SortBy/GroupBy
64 Worker Nodes, 1536 cores, SortByTest Total Time 64 Worker Nodes, 1536 cores, GroupByTest Total Time
0
50
100
150
200
250
300
64 128 256
Time (sec)
Data Size (GB)
IPoIB
RDMA
0
50
100
150
200
250
64 128 256
Time (sec)
Data Size (GB)
IPoIB
RDMA
74% 80%
HPBDC ‘17 Panel 25 Network Based CompuAng Laboratory
• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R)
• RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. – 32 nodes/768 cores: Total Kme reduced by 37% over IPoIB (56Gbps)
– 64 nodes/1536 cores: Total Kme reduced by 43% over IPoIB (56Gbps)
Performance EvaluaAon on SDSC Comet – HiBench PageRank
32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time
0 50
100 150 200 250 300 350 400 450
Huge BigData GiganKc
Time (sec)
Data Size (GB)
IPoIB
RDMA
0
100
200
300
400
500
600
700
800
Huge BigData GiganKc
Time (sec)
Data Size (GB)
IPoIB
RDMA
43% 37%
HPBDC ‘17 Panel 26 Network Based CompuAng Laboratory
EvaluaAon with BigDL on RDMA-‐Spark
0
200
400
600
800
1000
24 48 96 192 384
One
Epo
ch Tim
e (sec)
Number of cores
IPoIB RDMA
• VGG training model on the CIFAR-‐10 dataset
• Evaluated on SDSC Comet supercomputer
• IniKal Results: RDMA-‐based Spark outperforms default Spark over IPoIB by a factor of 4.58x
4.58x
HPBDC ‘17 Panel 27 Network Based CompuAng Laboratory
Design Overview of NVM and RDMA-‐aware HDFS (NVFS) • Design Features
• RDMA over NVM • HDFS I/O with NVM
• Block Access • Memory Access
• Hybrid design • NVM with SSD as a hybrid storage for HDFS I/O
• Co-‐Design with Spark and HBase • Cost-‐effecKveness • Use-‐case
ApplicaAons and Benchmarks
Hadoop MapReduce Spark HBase
Co-‐Design (Cost-‐EffecAveness, Use-‐case)
RDMA Receiver
RDMA Sender
DFSClient RDMA
Replicator
RDMA Receiver
NVFS -‐BlkIO
Writer/Reader
NVM
NVFS-‐MemIO
SSD SSD SSD
NVM and RDMA-‐aware HDFS (NVFS) DataNode
N. S. Islam, M. W. Rahman , X. Lu, and D. K. Panda, High Performance Design for HDFS with Byte-‐Addressability of NVM and RDMA, 24th InternaAonal Conference on SupercompuAng (ICS), June 2016
HPBDC ‘17 Panel 28 Network Based CompuAng Laboratory
EvaluaAon with Hadoop MapReduce
0
50
100
150
200
250
300
350
Write Read
Average Th
roughp
ut (M
Bps)
HDFS (56Gbps)
NVFS-‐BlkIO (56Gbps)
NVFS-‐MemIO (56Gbps)
• TestDFSIO on SDSC Comet (32 nodes) – Write: NVFS-‐MemIO gains by 4x over
HDFS
– Read: NVFS-‐MemIO gains by 1.2x over HDFS
TestDFSIO
0
200
400
600
800
1000
1200
1400
Write Read
Average Th
roughp
ut (M
Bps)
HDFS (56Gbps)
NVFS-‐BlkIO (56Gbps)
NVFS-‐MemIO (56Gbps)
4x
1.2x
4x
2x
SDSC Comet (32 nodes) OSU Nowlab (4 nodes)
• TestDFSIO on OSU Nowlab (4 nodes) – Write: NVFS-‐MemIO gains by 4x over
HDFS
– Read: NVFS-‐MemIO gains by 2x over HDFS
HPBDC ‘17 Panel 29 Network Based CompuAng Laboratory
Overview of RDMA-‐Hadoop-‐Virt Architecture
• VirtualizaKon-‐aware modules in all the four main Hadoop components: – HDFS: VirtualizaKon-‐aware Block Management
to improve fault-‐tolerance
– YARN: Extensions to Container AllocaKon Policy to reduce network traffic
– MapReduce: Extensions to Map Task Scheduling Policy to reduce network traffic
– Hadoop Common: Topology DetecKon Module for automaKc topology detecKon
• CommunicaKons in HDFS, MapReduce, and RPC go through RDMA-‐based designs over SR-‐IOV enabled InfiniBand
HDFS
YARN
Had
oop
Com
mon
MapReduce
HBase Others
Virtual Machines Bare-Metal nodes Containers
Big Data Applications
Topo
logy
Det
ectio
n M
odul
e
Map Task Scheduling Policy Extension
Container Allocation Policy Extension
CloudBurst MR-MS Polygraph Others
Virtualization Aware Block Management
S. Gugnani, X. Lu, D. K. Panda. Designing VirtualizaAon-‐aware and AutomaAc Topology DetecAon Schemes for AcceleraAng Hadoop on SR-‐IOV-‐enabled Clouds. CloudCom, 2016.
HPBDC ‘17 Panel 30 Network Based CompuAng Laboratory
EvaluaAon with ApplicaAons
– 14% and 24% improvement with Default Mode for CloudBurst and Self-‐Join
– 30% and 55% improvement with Distributed Mode for CloudBurst and Self-‐Join
0
20
40
60
80
100
Default Mode Distributed Mode
EXEC
UTION TIM
E CloudBurst
RDMA-‐Hadoop RDMA-‐Hadoop-‐Virt
0 50 100 150 200 250 300 350 400
Default Mode Distributed Mode
EXEC
UTION TIM
E
Self-‐Join
RDMA-‐Hadoop RDMA-‐Hadoop-‐Virt
30% reducKon
55% reducKon
HPBDC ‘17 Panel 31 Network Based CompuAng Laboratory
• Deep Learning frameworks are a different game altogether – Unusually large message sizes (order of
megabytes)
– Most communicaKon based on GPU buffers
• How to address these newer requirements? – GPU-‐specific CommunicaKon Libraries (NCCL)
• NVidia's NCCL library provides inter-‐GPU communicaKon
– CUDA-‐Aware MPI (MVAPICH2-‐GDR) • Provides support for GPU-‐based communicaKon
• Can we exploit CUDA-‐Aware MPI and NCCL to support Deep Learning applicaKons?
Deep Learning: New Challenges for MPI RunAmes
1
32
4
InternodeComm.(Knomial)
1 2
CPU
PLX
3 4
PLX
Intranode Comm.(NCCLRing)
RingDirection
Hierarchical CommunicaAon (Knomial + NCCL ring)
HPBDC ‘17 Panel 32 Network Based CompuAng Laboratory
• NCCL has some limitaKons – Only works for a single node, thus, no scale-‐out on
mulKple nodes
– DegradaKon across IOH (socket) for scale-‐up (within a node)
• We propose opKmized MPI_Bcast
– CommunicaKon of very large GPU buffers (order of megabytes)
– Scale-‐out on large number of dense mulK-‐GPU nodes
• Hierarchical CommunicaKon that efficiently exploits: – CUDA-‐Aware MPI_Bcast in MV2-‐GDR
– NCCL Broadcast primiKve
Efficient Broadcast: MVAPICH2-‐GDR and NCCL
0 10 20 30
2 4 8 16 32 64 Time (secon
ds)
Number of GPUs
MV2-‐GDR MV2-‐GDR-‐Opt
Performance Benefits: Microso7 CNTK DL framework (25% avg. improvement )
Performance Benefits: OSU Micro-‐benchmarks
Efficient Large Message Broadcast using NCCL and CUDA-‐Aware MPI for Deep Learning, A. Awan , K. Hamidouche , A. Venkatesh , and D. K. Panda, The 23rd European MPI Users' Group MeeAng (EuroMPI 16), Sep 2016 [Best Paper Runner-‐Up]
2.2X
HPBDC ‘17 Panel 33 Network Based CompuAng Laboratory
0 100 200 300
2 4 8 16 32 64 128 Latency (m
s)
Message Size (MB)
Reduce – 192 GPUs
Large Message OpAmized CollecAves for Deep Learning
0
100
200
128 160 192 Latency (m
s)
No. of GPUs
Reduce – 64 MB
0 100 200 300
16 32 64 Latency (m
s)
No. of GPUs
Allreduce -‐ 128 MB
0
50
100
2 4 8 16 32 64 128 Latency (m
s)
Message Size (MB)
Bcast – 64 GPUs
0
50
100
16 32 64 Latency (m
s)
No. of GPUs
Bcast 128 MB
• MV2-‐GDR provides opKmized collecKves for large message sizes
• OpKmized Reduce, Allreduce, and Bcast
• Good scaling with large number of GPUs
• Available with MVAPICH2-‐GDR 2.2GA
0 100 200 300
2 4 8 16 32 64 128 Latency (m
s)
Message Size (MB)
Allreduce – 64 GPUs
HPBDC ‘17 Panel 34 Network Based CompuAng Laboratory
• Caffe : A flexible and layered Deep Learning framework.
• Benefits and Weaknesses – MulK-‐GPU Training within a single node
– Performance degradaKon for GPUs across different sockets
– Limited Scale-‐out
• OSU-‐Caffe: MPI-‐based Parallel Training
– Enable Scale-‐up (within a node) and Scale-‐out (across mulK-‐GPU nodes)
– network on ImageNet dataset
OSU-‐Caffe: Scalable Deep Learning
0
50
100
150
200
250
8 16 32 64 128
Training Tim
e (secon
ds)
No. of GPUs
GoogLeNet (ImageNet) on 128 GPUs
Caffe OSU-‐Caffe (1024) OSU-‐Caffe (2048) Invalid use case OSU-‐Caffe is publicly available from:
h<p://hidl.cse.ohio-‐state.edu
A. A. Awan, K. Hamidouche, J. Hashmi, and D. K. Panda, S-‐Caffe: Co-‐designing MPI RunAmes and Caffe for Scalable Deep Learning on Modern GPU Clusters, PPoPP, Sep 2017
HPBDC ‘17 Panel 35 Network Based CompuAng Laboratory
• High-‐Performance designs for Big Data middleware – NVM-‐aware communicaKon and I/O schemes for Big Data – SATA-‐/PCIe-‐/NVMe-‐SSD support – High-‐Bandwidth Memory support – Threaded Models and SynchronizaKon – Locality-‐aware designs
• Fault-‐tolerance/resiliency – MigraKon support with virtual machines – Data replicaKon
• Efficient data access and placement policies • Efficient task scheduling • Fast deployment and automaKc configuraKons on Clouds • OpKmizaKon for Deep Learning applicaKons
Open Challenges in Designing CommunicaAon and I/O Middleware for High-‐Performance Big Data Processing
HPBDC ‘17 Panel 36 Network Based CompuAng Laboratory
Sunrise or Sunset of Big Data So7ware?
Assuming 6:00 am as sunrise and 6:00 pm as sunset, We are at 8:00 am.
HPBDC ‘17 Panel 37 Network Based CompuAng Laboratory
panda@cse.ohio-‐state.edu
hip://www.cse.ohio-‐state.edu/~panda
Thank You!
Network-‐Based CompuKng Laboratory h<p://nowlab.cse.ohio-‐state.edu/
The High-‐Performance Big Data Project h<p://hibd.cse.ohio-‐state.edu/
top related