Top Banner
Accelerating and Benchmarking Big Data Processing on Modern Clusters Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Open RG Big Data Webinar (Sept ‘15) by
74

Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

May 22, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Accelerating and Benchmarking Big Data Processing on Modern Clusters

Dhabaleswar K. (DK) Panda The Ohio State University

E-mail: [email protected] http://www.cse.ohio-state.edu/~panda

Open RG Big Data Webinar (Sept ‘15)

by

Page 2: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Introduction to Big Data Applications and Analytics

• Big Data has become the one of the most important elements of business analytics

• Provides groundbreaking opportunities for enterprise information management and decision making

• The amount of data is exploding; companies are capturing and digitizing more information than ever

• The rate of information growth appears to be exceeding Moore’s Law

2 Open RG Big Data (Sept '15)

• Commonly accepted 3V’s of Big Data • Volume, Velocity, Variety Michael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf

• 5V’s of Big Data – 3V + Value, Veracity

Page 3: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• Substantial impact on designing and utilizing modern data management and processing systems in multiple tiers

– Front-end data accessing and serving (Online) • Memcached + DB (e.g. MySQL), HBase

– Back-end data analytics (Offline) • HDFS, MapReduce, Spark

Open RG Big Data (Sept '15)

Data Management and Processing on Modern Clusters

3

Page 4: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• Open-source implementation of Google MapReduce, GFS, and BigTable for Big Data Analytics • Hadoop Common Utilities (RPC, etc.), HDFS, MapReduce, YARN

• http://hadoop.apache.org

4

Overview of Apache Hadoop Architecture

Open RG Big Data (Sept '15)

Hadoop Distributed File System (HDFS)

MapReduce (Cluster Resource Management & Data Processing)

Hadoop Common/Core (RPC, ..)

Hadoop Distributed File System (HDFS)

YARN (Cluster Resource Management & Job Scheduling)

Hadoop Common/Core (RPC, ..)

MapReduce (Data Processing)

Other Models (Data Processing)

Hadoop 1.x Hadoop 2.x

Page 5: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

HDFS and MapReduce in Apache Hadoop

• HDFS: Primary storage of Hadoop; highly reliable and fault-tolerant – NameNode stores the file system namespace – DataNodes store data blocks

• MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on it, and write the intermediate

data to local disk

– Reduce tasks get data by shuffle, operate on it and write output to HDFS

• Adopted by many reputed organizations – eg: Facebook, Yahoo!

• Developed in Java for platform-independence and portability

• Uses Sockets/HTTP for communication!

5 Open RG Big Data (Sept '15)

Page 6: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Spark Architecture Overview

• An in-memory data-processing framework – Iterative machine learning jobs – Interactive data analytics – Scala based Implementation – Standalone, YARN, Mesos

• Scalable and communication intensive – Wide dependencies between

Resilient Distributed Datasets (RDDs)

– MapReduce-like shuffle operations to repartition RDDs

– Sockets based communication

Open RG Big Data (Sept '15)

http://spark.apache.org

6

Page 7: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• Three-layer architecture of Web 2.0 – Web Servers, Memcached Servers, Database Servers

• Memcached is a core component of Web 2.0 architecture

7

Overview of Web 2.0 Architecture and Memcached

Open RG Big Data (Sept '15)

Internet

Page 8: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Open RG Big Data (Sept '15) 8

Memcached Architecture

• Distributed Caching Layer – Allows to aggregate spare memory from multiple nodes – General purpose

• Typically used to cache database queries, results of API calls • Scalable model, but typical usage very network intensive

Page 9: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Big Data Processing • The High-Performance Big Data (HiBD) Project

• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark

– RDMA-based MapReduce on HPC Clusters with Lustre

– Enhanced HDFS with In-memory and Heterogeneous Storage

• RDMA-based designs for Memcached and HBase – RDMA-based Memcached with Hybrid Memory – Case study with OLDP – RDMA-based HBase

• Challenges in Designing Benchmarks for Big Data Processing – OSU HiBD Benchmarks

• Conclusion and Q&A

Presentation Outline

Open RG Big Data (Sept '15) 9

Page 10: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

High-End Computing (HEC): PetaFlop to ExaFlop

Open RG Big Data (Sept '15) 10

100-200 PFlops in 2016-2018

1 EFlops in 2020-2024?

Page 11: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)

11 Open RG Big Data (Sept '15)

0102030405060708090100

050

100150200250300350400450500

Perc

enta

ge o

f Clu

ster

s

Num

ber o

f Clu

ster

s

Timeline

Percentage of ClustersNumber of Clusters

87%

Page 12: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• High End Computing (HEC) is growing dramatically – High Performance Computing

– Big Data Computing

• Technology Advancement – Multi-core/many-core technologies and accelerators

– Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)

– Solid State Drives (SSDs) and Non-Volatile Random-Access Memory (NVRAM)

– Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)

Drivers for Modern HPC Clusters

Tianhe – 2 Titan Stampede Tianhe – 1A

Open RG Big Data (Sept '15) 12

Page 13: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

13

Trends of Networking Technologies in TOP500 Systems Percentage share of InfiniBand is steadily increasing

Interconnect Family – Systems Share

Open RG Big Data (Sept '15)

Courtesy: http://top500.org http://www.theplatform.net/2015/07/20/ethernet-will-have-to-work-harder-to-win-hpc/

Page 14: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• 259 IB Clusters (51%) in the June 2015 Top500 list

(http://www.top500.org)

• Installations in the Top 50 (24 systems):

Large-scale InfiniBand Installations

519,640 cores (Stampede) at TACC (8th) 76,032 cores (Tsubame 2.5) at Japan/GSIC (22nd)

185,344 cores (Pleiades) at NASA/Ames (11th) 194,616 cores (Cascade) at PNNL (25th)

72,800 cores Cray CS-Storm in US (13th) 76,032 cores (Makman-2) at Saudi Aramco (28th)

72,800 cores Cray CS-Storm in US (14th) 110,400 cores (Pangea) in France (29th)

265,440 cores SGI ICE at Tulip Trading Australia (15th) 37,120 cores (Lomonosov-2) at Russia/MSU (31st)

124,200 cores (Topaz) SGI ICE at ERDC DSRC in US (16th) 57,600 cores (SwiftLucy) in US (33rd)

72,000 cores (HPC2) in Italy (17th) 50,544 cores (Occigen) at France/GENCI-CINES (36th)

115,668 cores (Thunder) at AFRL/USA (19th) 76,896 cores (Salomon) SGI ICE in Czech Republic (40th)

147,456 cores (SuperMUC) in Germany (20th) 73,584 cores (Spirit) at AFRL/USA (42nd)

86,016 cores (SuperMUC Phase 2) in Germany (21st) and many more!

14 Open RG Big Data (Sept '15)

Page 15: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Open RG Big Data (Sept '15)

Kernel Space

All interconnects and protocols in OpenFabrics Stack Application / Middleware

Verbs

Ethernet Adapter

Ethernet Switch

Ethernet Driver

TCP/IP

1/10/40/100 GigE

InfiniBand Adapter

InfiniBand Switch

IPoIB

IPoIB

Ethernet Adapter

Ethernet Switch

Hardware Offload

TCP/IP

10/40 GigE-TOE

InfiniBand Adapter

InfiniBand Switch

User Space

RSockets

RSockets

iWARP Adapter

Ethernet Switch

TCP/IP

User Space

iWARP

RoCE Adapter

Ethernet Switch

RDMA

User Space

RoCE

InfiniBand Switch

InfiniBand Adapter

RDMA

User Space

IB Native

Sockets

Application / Middleware Interface

Protocol

Adapter

Switch

InfiniBand Adapter

InfiniBand Switch

RDMA

SDP

SDP

15

Page 16: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Big Data Processing • The High-Performance Big Data (HiBD) Project

• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark

– RDMA-based MapReduce on HPC Clusters with Lustre

– Enhanced HDFS with In-memory and Heterogeneous Storage

• RDMA-based designs for Memcached and HBase – RDMA-based Memcached with Hybrid Memory – Case study with OLDP – RDMA-based HBase

• Challenges in Designing Benchmarks for Big Data Processing – OSU HiBD Benchmarks

• Conclusion and Q&A

Presentation Outline

Open RG Big Data (Sept '15) 16

Page 17: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Wide Adoption of RDMA Technology

• Message Passing Interface (MPI) for HPC

• Parallel File Systems – Lustre

– GPFS

• Delivering excellent performance: – < 1.0 microsec latency

– 100 Gbps bandwidth

– 5-10% CPU utilization

• Delivering excellent scalability

Open RG Big Data (Sept '15) 17

Page 18: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Used by more than 2,450 organizations in 76 countries

– More than 289,000 downloads from the OSU site directly

– Empowering many TOP500 clusters (June ‘15 ranking) • 8th ranked 519,640-core cluster (Stampede) at TACC

• 11th ranked 185,344-core cluster (Pleiades) at NASA

• 22nd ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decade – System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) ->

– Stampede at TACC (8th in Jun’15, 462,462 cores, 5.168 Plops)

MVAPICH2 Software

18 Open RG Big Data (Sept '15)

Page 19: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

19

Latency & Bandwidth: MPI over IB with MVAPICH2

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

TrueScale-QDR

ConnectX-3-FDR

ConnectIB-DualFDR

ConnectX-4-EDR

Small Message Latency

Message Size (bytes)

Late

ncy

(us)

1.26 1.19

1.05 1.15

TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 Back-to-back

0

2000

4000

6000

8000

10000

12000

14000Unidirectional Bandwidth

Band

wid

th (M

Byte

s/se

c)

Message Size (bytes)

12465

3387

6356

11497

Open RG Big Data (Sept '15)

Page 20: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies (InfiniBand, 1/10/40/100 GigE and Intelligent NICs)

Storage Technologies

(HDD, SSD and NVMe-SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks

RDMA Protocol

Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges

Open RG Big Data (Sept '15)

Upper level Changes?

20

Page 21: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Can Big Data Processing Systems be Designed with High-Performance Networks and Protocols?

Current Design

Application

Sockets

1/10/40/100 GigE Network

• Sockets not designed for high-performance – Stream semantics often mismatch for upper layers – Zero-copy not available for non-blocking sockets

Our Approach

Application

OSU Design

10/40/100 GigE or InfiniBand

Verbs Interface

Open RG Big Data (Sept '15) 21

Page 22: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Big Data Processing • The High-Performance Big Data (HiBD) Project

• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark

– RDMA-based MapReduce on HPC Clusters with Lustre

– Enhanced HDFS with In-memory and Heterogeneous Storage

• RDMA-based designs for Memcached and HBase – RDMA-based Memcached with Hybrid Memory – Case study with OLDP – RDMA-based HBase

• Challenges in Designing Benchmarks for Big Data Processing – OSU HiBD Benchmarks

• Conclusion and Q&A

Presentation Outline

Open RG Big Data (Sept '15) 22

Page 23: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) – Plugins for Apache and HDP Hadoop distributions

• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)

• RDMA for Memcached (RDMA-Memcached)

• OSU HiBD-Benchmarks (OHB)

– HDFS and Memcached Micro-benchmarks

• http://hibd.cse.ohio-state.edu

• Users Base: 130 organizations from 20 countries

• More than 13,000 downloads from the project site

• RDMA for Apache HBase, Spark and CDH

The High-Performance Big Data (HiBD) Project

23 Open RG Big Data (Sept '15)

Page 24: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Open RG Big Data (Sept '15) 24

Different Modes of RDMA for Apache Hadoop 2.x

• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well as performance. This mode is enabled by default in the package.

• HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-memory and obtain as much performance benefit as possible.

• HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.

• MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top of Lustre alone. Here, two different modes are introduced: with local disks and without local disks.

• Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-L, and MapReduce over Lustre).

Page 25: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• High-Performance Design of Hadoop over RDMA-enabled Interconnects

– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components

– Enhanced HDFS with in-memory and heterogeneous storage

– High performance design of MapReduce over Lustre

– Plugin-based architecture supporting RDMA-based designs for Apache Hadoop and HDP

– Easily configurable for different running modes (HHH, HHH-M, HHH-L, and MapReduce over Lustre) and different protocols (native InfiniBand, RoCE, and IPoIB)

• Current release: 0.9.8

– Based on Apache Hadoop 2.7.1

– Compliant with Apache Hadoop 2.7.1 and HDP 2.3.0.0 APIs and applications

– Tested with • Mellanox InfiniBand adapters (DDR, QDR and FDR)

• RoCE support with Mellanox adapters

• Various multi-core platforms

• Different file systems with disks and SSDs and Lustre

– http://hibd.cse.ohio-state.edu

RDMA for Apache Hadoop 2.x Distribution

Open RG Big Data (Sept '15) 25

Page 26: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• High-Performance Design of Memcached over RDMA-enabled Interconnects

– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Memcached and libMemcached components

– High performance design of SSD-Assisted Hybrid Memory

– Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with IPoIB)

• Current release: 0.9.3

– Based on Memcached 1.4.22 and libMemcached 1.0.18

– Compliant with libMemcached APIs and applications

– Tested with • Mellanox InfiniBand adapters (DDR, QDR and FDR) • RoCE support with Mellanox adapters • Various multi-core platforms • SSD

– http://hibd.cse.ohio-state.edu

RDMA for Memcached Distribution

Open RG Big Data (Sept '15) 26

Page 27: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• Micro-benchmarks for Hadoop Distributed File System (HDFS) – Sequential Write Latency (SWL) Benchmark, Sequential Read Latency

(SRL) Benchmark, Random Read Latency (RRL) Benchmark, Sequential Write Throughput (SWT) Benchmark, Sequential Read Throughput (SRT) Benchmark

– Support benchmarking of • Apache Hadoop 1.x and 2.x HDFS, Hortonworks Data Platform (HDP) HDFS,

Cloudera Distribution of Hadoop (CDH) HDFS

• Micro-benchmarks for Memcached – Get Benchmark, Set Benchmark, and Mixed Get/Set Benchmark

• Current release: 0.8

• http://hibd.cse.ohio-state.edu

Open RG Big Data (Sept '15)

OSU HiBD Micro-Benchmark (OHB) Suite – HDFS & Memcached

27

Page 28: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Big Data Processing • The High-Performance Big Data (HiBD) Project

• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark

– RDMA-based MapReduce on HPC Clusters with Lustre

– Enhanced HDFS with In-memory and Heterogeneous Storage

• RDMA-based designs for Memcached and HBase – RDMA-based Memcached with Hybrid Memory – Case study with OLDP – RDMA-based HBase

• Challenges in Designing Benchmarks for Big Data Processing – OSU HiBD Benchmarks

• Conclusion and Q&A

Presentation Outline

Open RG Big Data (Sept '15) 28

Page 29: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• RDMA-based Designs and Performance Evaluation – HDFS – MapReduce – Spark

Acceleration Case Studies and In-Depth Performance Evaluation

Open RG Big Data (Sept '15) 29

Page 30: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Design Overview of HDFS with RDMA

• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Java based HDFS with communication library written in native code

HDFS

Verbs

RDMA Capable Networks (IB, iWARP, RoCE ..)

Applications

1/10/40/100 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Write Others

OSU Design

• Design Features – RDMA-based HDFS

write – RDMA-based HDFS

replication – Parallel replication

support – On-demand connection

setup – InfiniBand/RoCE

support

Open RG Big Data (Sept '15) 30

Page 31: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Communication Times in HDFS

• Cluster with HDD DataNodes

– 30% improvement in communication time over IPoIB (QDR)

– 56% improvement in communication time over 10GigE

• Similar improvements are obtained for SSD DataNodes

Reduced by 30%

Open RG Big Data (Sept '15)

0

5

10

15

20

25

2GB 4GB 6GB 8GB 10GB

Com

mun

icat

ion

Tim

e (s

)

File Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

31

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012

N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC '14, June 2014

Page 32: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Open RG Big Data (Sept '15)

Evaluations using Enhanced DFSIO of Intel HiBench on TACC-Stampede

• Cluster with 64 DataNodes (1K cores), single HDD per node – 64% improvement in throughput over IPoIB (FDR) for 256GB file size

– 37% improvement in latency over IPoIB (FDR) for 256GB file size

0

200

400

600

800

1000

1200

1400

64 128 256

Aggr

egat

ed T

hrou

ghpu

t (M

Bps)

File Size (GB)

IPoIB (FDR)OSU-IB (FDR) Increased by 64%

Reduced by 37%

0

100

200

300

400

500

600

64 128 256

Exec

utio

n Ti

me

(s)

File Size (GB)

IPoIB (FDR)

OSU-IB (FDR)

32

Page 33: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• RDMA-based Designs and Performance Evaluation – HDFS – MapReduce – Spark

Acceleration Case Studies and In-Depth Performance Evaluation

Open RG Big Data (Sept '15) 33

Page 34: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Design Overview of MapReduce with RDMA

MapReduce

Verbs

RDMA Capable Networks (IB, iWARP, RoCE ..)

OSU Design

Applications

1/10/40/100 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Job Tracker

Task Tracker

Map

Reduce

Open RG Big Data (Sept '15)

• Enables high performance RDMA communication, while supporting traditional socket interface • JNI Layer bridges Java based MapReduce with communication library written in native code

• Design Features – RDMA-based shuffle – Prefetching and caching map

output – Efficient Shuffle Algorithms – In-memory merge – On-demand Shuffle

Adjustment – Advanced overlapping

• map, shuffle, and merge • shuffle, merge, and reduce

– On-demand connection setup – InfiniBand/RoCE support

34

Page 35: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• For 240GB Sort in 64 nodes (512 cores) – 40% improvement over IPoIB (QDR)

with HDD used for HDFS

Performance Evaluation of Sort and TeraSort

Open RG Big Data (Sept '15)

0

200

400

600

800

1000

1200

Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB

Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job

Exec

utio

n Ti

me

(sec

)

IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR)

Sort in OSU Cluster

0

100

200

300

400

500

600

700

Data Size: 80 GB Data Size: 160 GBData Size: 320 GB

Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job

Exec

utio

n Ti

me

(sec

)

IPoIB (FDR) UDA-IB (FDR) OSU-IB (FDR)

TeraSort in TACC Stampede

• For 320GB TeraSort in 64 nodes (1K cores) – 38% improvement over IPoIB

(FDR) with HDD used for HDFS 35

Page 36: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• 50% improvement in Self Join over IPoIB (QDR) for 80 GB data size

• 49% improvement in Sequence Count over IPoIB (QDR) for 30 GB data size

Evaluations using PUMA Workload

Open RG Big Data (Sept '15)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AdjList (30GB) SelfJoin (80GB) SeqCount (30GB) WordCount (30GB) InvertIndex (30GB)

Nor

mal

ized

Exe

cutio

n Ti

me

Benchmarks

10GigE IPoIB (QDR) OSU-IB (QDR)

36

Page 37: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• RDMA-based Designs and Performance Evaluation – HDFS – MapReduce – Spark

Acceleration Case Studies and In-Depth Performance Evaluation

Open RG Big Data (Sept '15) 37

Page 38: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Design Overview of Spark with RDMA

• Design Features – RDMA based shuffle – SEDA-based plugins – Dynamic connection

management and sharing – Non-blocking and out-of-

order data transfer – Off-JVM-heap buffer

management – InfiniBand/RoCE support

Open RG Big Data (Sept '15)

• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Scala based Spark with communication library written in native code X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data

Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014

38

Page 39: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Performance Evaluation on TACC Stampede - GroupByTest

• Intel SandyBridge + FDR

• Cluster with 8 HDD Nodes, single disk per node, 128 concurrent tasks – up to 83% over IPoIB (56Gbps)

• Cluster with 16 HDD Nodes, single disk per node, 256 concurrent tasks – up to 79% over IPoIB (56Gbps)

8 Worker Nodes, 128 Cores, (128M 128R) 16 Worker Nodes, 256 Cores, (256M 256R)

Open RG Big Data (Sept '15) 39

Page 40: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Performance Evaluation on TACC Stampede - SortByTest

• Intel SandyBridge + FDR, 16 Worker Nodes, 256 Cores, (256M 256R)

• RDMA-based design for Spark 1.4.0

• RDMA vs. IPoIB with 256 concurrent tasks, single disk per node and RAMDisk. For SortByKey Test:

– Shuffle time reduced by up to 77% over IPoIB (56Gbps)

– Total time reduced by up to 58% over IPoIB (56Gbps)

16 Worker Nodes, SortByTest Shuffle Time 16 Worker Nodes, SortByTest Total Time

Open RG Big Data (Sept '15) 40

0

5

10

15

20

25

30

35

40

16 32 64

Tim

e (s

ec)

Data Size (GB)

IPoIB RDMA

0

5

10

15

20

25

30

35

40

16 32 64

Tim

e (s

ec)

Data Size (GB)

IPoIB RDMA 77%

58%

Page 41: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Big Data Processing • The High-Performance Big Data (HiBD) Project

• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark

– RDMA-based MapReduce on HPC Clusters with Lustre

– Enhanced HDFS with In-memory and Heterogeneous Storage

• RDMA-based designs for Memcached and HBase – RDMA-based Memcached with Hybrid Memory – Case study with OLDP – RDMA-based HBase

• Challenges in Designing Benchmarks for Big Data Processing – OSU HiBD Benchmarks

• Conclusion and Q&A

Presentation Outline

Open RG Big Data (Sept '15) 41

Page 42: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Optimize Hadoop YARN MapReduce over Parallel File Systems

MetaData Servers

Object Storage Servers

Compute Nodes App Master

Map Reduce

Lustre Client

Lustre Setup

• HPC Cluster Deployment – Hybrid topological solution of Beowulf

architecture with separate I/O nodes – Lean compute nodes with light OS; more

memory space; small local storage – Sub-cluster of dedicated I/O nodes with

parallel file systems, such as Lustre • MapReduce over Lustre

– Local disk is used as the intermediate data directory

– Lustre is used as the intermediate data directory

Open RG Big Data (Sept '15) 42

Page 43: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Intermediate Data Directory

Design Overview of Shuffle Strategies for MapReduce over Lustre

Open RG Big Data (Sept '15)

• Design Features – Two shuffle approaches

• Lustre read based shuffle • RDMA based shuffle

– Hybrid shuffle algorithm to take benefit from both shuffle approaches

– Dynamically adapts to the better shuffle approach for each shuffle request based on profiling values for each Lustre read operation

– In-memory merge and overlapping of different phases are kept similar to RDMA-enhanced MapReduce design

Map 1 Map 2 Map 3

Lustre

Reduce 1 Reduce 2

Lustre Read / RDMA

In-memory merge/sort

reduce

M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, High Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA, IPDPS, May 2015.

In-memory merge/sort

reduce

43

Page 44: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• For 500GB Sort in 64 nodes – 44% improvement over IPoIB (FDR)

Performance Improvement of MapReduce over Lustre on TACC-Stampede

Open RG Big Data (Sept '15)

• For 640GB Sort in 128 nodes – 48% improvement over IPoIB (FDR)

0

200

400

600

800

1000

1200

300 400 500

Job

Exec

utio

n Ti

me

(sec

)

Data Size (GB)

IPoIB (FDR)OSU-IB (FDR)

0

50

100

150

200

250

300

350

400

450

500

20 GB 40 GB 80 GB 160 GB 320 GB 640 GB

Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128

Job

Exec

utio

n Ti

me

(sec

)

IPoIB (FDR) OSU-IB (FDR)

M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMA-based Approach Benefit?, Euro-Par, August 2014.

• Local disk is used as the intermediate data directory Reduced by 48% Reduced by 44%

44

Page 45: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• For 80GB Sort in 8 nodes – 34% improvement over IPoIB (QDR)

Case Study - Performance Improvement of MapReduce over Lustre on SDSC-Gordon

Open RG Big Data (Sept '15)

• For 120GB TeraSort in 16 nodes – 25% improvement over IPoIB (QDR)

• Lustre is used as the intermediate data directory

0

100

200

300

400

500

600

700

800

900

40 60 80

Job

Exec

utio

n Ti

me

(sec

)

Data Size (GB)

IPoIB (QDR)OSU-Lustre-Read (QDR)OSU-RDMA-IB (QDR)OSU-Hybrid-IB (QDR)

0

100

200

300

400

500

600

700

800

900

40 80 120

Job

Exec

utio

n Ti

me

(sec

) Data Size (GB)

IPoIB (QDR)OSU-Lustre-Read (QDR)OSU-RDMA-IB (QDR)OSU-Hybrid-IB (QDR)

Reduced by 25% Reduced by 34%

45

Page 46: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Big Data Processing • The High-Performance Big Data (HiBD) Project

• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark

– RDMA-based MapReduce on HPC Clusters with Lustre

– Enhanced HDFS with In-memory and Heterogeneous Storage

• RDMA-based designs for Memcached and HBase – RDMA-based Memcached with Hybrid Memory – Case study with OLDP – RDMA-based HBase

• Challenges in Designing Benchmarks for Big Data Processing – OSU HiBD Benchmarks

• Conclusion and Q&A

Presentation Outline

Open RG Big Data (Sept '15) 46

Page 47: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Triple-H

Heterogeneous Storage

• Design Features – Two modes

• Standalone

• Lustre-Integrated

– Policies to efficiently utilize the heterogeneous storage devices

• RAM, SSD, HDD, Lustre

– Eviction/Promotion based on data usage pattern

– Hybrid Replication

– Lustre-Integrated mode:

• Lustre-based fault-tolerance

47

Enhanced HDFS with In-memory and Heterogeneous Storage

Hybrid Replication

Data Placement Policies

Eviction/ Promotion

RAM Disk SSD HDD

Lustre

N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture, CCGrid ’15, May 2015

Applications

Open RG Big Data (Sept '15)

Page 48: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• For 160GB TestDFSIO in 32 nodes – Write Throughput: 7x improvement

over IPoIB (FDR)

– Read Throughput: 2x improvement over IPoIB (FDR)

Performance Improvement on TACC Stampede (HHH)

• For 120GB RandomWriter in 32 nodes – 3x improvement over IPoIB (QDR)

0

1000

2000

3000

4000

5000

6000

write read

Tota

l Thr

ough

put (

MBp

s)

IPoIB (FDR) OSU-IB (FDR)

TestDFSIO

0

50

100

150

200

250

300

8:30 16:60 32:120

Exec

utio

n Ti

me

(s)

Cluster Size:Data Size

IPoIB (FDR) OSU-IB (FDR)

RandomWriter

Open RG Big Data (Sept '15)

Increased by 7x

Increased by 2x Reduced by 3x

48

Page 49: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• For 60GB Sort in 8 nodes

– 24% improvement over default HDFS

– 54% improvement over Lustre – 33% storage space saving compared to default HDFS

Performance Improvement on SDSC Gordon (HHH-L)

050

100150200250300350400450500

20 40 60

Exec

utio

n Ti

me

(s)

Data Size (GB)

HDFS-IPoIB (QDR)Lustre-IPoIB (QDR)OSU-IB (QDR)

Storage space for 60GB Sort Sort

Storage Used (GB)

HDFS-IPoIB (QDR)

360

Lustre-IPoIB (QDR)

120

OSU-IB (QDR) 240

Open RG Big Data (Sept '15)

Reduced by 54%

49

Page 50: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Big Data Processing • The High-Performance Big Data (HiBD) Project

• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark

– RDMA-based MapReduce on HPC Clusters with Lustre

– Enhanced HDFS with In-memory and Heterogeneous Storage

• RDMA-based designs for Memcached and HBase – RDMA-based Memcached with Hybrid Memory – Case study with OLDP – RDMA-based HBase

• Challenges in Designing Benchmarks for Big Data Processing – OSU HiBD Benchmarks

• Conclusion and Q&A

Presentation Outline

Open RG Big Data (Sept '15) 50

Page 51: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Memcached-RDMA Design

• Server and client perform a negotiation protocol – Master thread assigns clients to appropriate worker thread

• Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread

• All other Memcached data structures are shared among RDMA and Sockets worker threads

• Native IB-verbs-level Design and evaluation with – Server : Memcached (http://memcached.org)

– Client : libmemcached (http://libmemcached.org)

– Different networks and protocols: 10GigE, IPoIB, native IB (RC, UD)

Sockets Client

RDMA Client

Master Thread

Sockets Worker Thread

Verbs Worker Thread

Sockets Worker Thread

Verbs Worker Thread

Shared

Data

Memory Slabs Items

1

1

2

2

Open RG Big Data (Sept '15) 51

Page 52: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

1

10

100

10001 2 4 8 16 32 64 128

256

512 1K 2K 4K

Tim

e (u

s)

Message Size

OSU-IB (FDR)IPoIB (FDR)

0

100

200

300

400

500

600

700

16 32 64 128 256 512 1024 2048 4080

Thou

sand

s of T

rans

actio

ns p

er

Seco

nd (T

PS)

No. of Clients

• Memcached Get latency – 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us – 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us

• Memcached Throughput (4bytes) – 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s – Nearly 2X improvement in throughput

Memcached GET Latency Memcached Throughput

Memcached Performance (FDR Interconnect)

Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)

Latency Reduced by nearly 20X

2X

Open RG Big Data (Sept '15) 52

Page 53: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• ohb_memlat & ohb_memthr latency & throughput micro-benchmarks

• Memcached-RDMA can - improve query latency by up to 70% over IPoIB (32Gbps)

- improve throughput by up to 2X over IPoIB (32Gbps)

- No overhead in using hybrid mode when all data can fit in memory

Open RG Big Data (Sept '15)

Performance Benefits on SDSC-Gordon – OHB Latency & Throughput Micro-Benchmarks

0123456789

10

64 128 256 512 1024

Thro

ughp

ut (m

illio

n tr

ans/

sec)

No. of Clients

IPoIB (32Gbps)RDMA-Mem (32Gbps)RDMA-Hybrid (32Gbps)

0

100

200

300

400

500

Aver

age

late

ncy

(us)

Message Size (Bytes)

IPoIB (32Gbps)RDMA-Mem (32Gbps)RDMA-Hyb (32Gbps) 2X

53

Page 54: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Big Data Processing • The High-Performance Big Data (HiBD) Project

• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark

– RDMA-based MapReduce on HPC Clusters with Lustre

– Enhanced HDFS with In-memory and Heterogeneous Storage

• RDMA-based designs for Memcached and HBase – RDMA-based Memcached with Hybrid Memory – Case study with OLDP – RDMA-based HBase

• Challenges in Designing Benchmarks for Big Data Processing – OSU HiBD Benchmarks

• On-going and Future Activities • Conclusion and Q&A

Presentation Outline

Open RG Big Data (Sept '15) 54

Page 55: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• Illustration with Read-Cache-Read access pattern using modified mysqlslap load testing tool

• Memcached-RDMA can - improve query latency by up to 66% over IPoIB (32Gbps)

- throughput by up to 69% over IPoIB (32Gbps)

Open RG Big Data (Sept '15) 55

Micro-benchmark Evaluation for OLDP workloads

012345678

64 96 128 160 320 400

Late

ncy

(sec

)

No. of Clients

Memcached-IPoIB (32Gbps)

Memcached-RDMA (32Gbps)

0

500

1000

1500

2000

2500

3000

3500

64 96 128 160 320 400

Thro

ughp

ut (

Kq/s

)

No. of Clients

Memcached-IPoIB (32Gbps)

Memcached-RDMA (32Gbps)

D. Shankar, X. Lu, J. Jose, M. W. Rahman, N. Islam, and D. K. Panda, Can RDMA Benefit On-Line Data Processing Workloads with Memcached and MySQL, ISPASS’15

Page 56: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Transactional workloads. Example: TATP • Up to 29% improvement in overall throughput as compared to default

Memcached running over IPoIB

Web-Oriented workloads. Example: Twitter workload

• Up to 42% improvement in overall throughput compared to default Memcached running over IPoIB

Open RG Big Data (Sept '15) 56

Evaluation with Transactional and Web-oriented Workloads

Evaluation with TATP workload using modified OLTP-Bench

Evaluation with Twitter Workload using modified OLTP-Bench

0

10

20

30

40

50

60

4 8 16 32

Thro

ughp

ut (

Kq/s

)

No. of Clients

Memcached-IPoIB (32Gbps)

Memcached-RDMA (32Gbps)

0

10

20

30

40

50

4 8 16 32 64 128

Thro

ughp

ut (

Kq/s

)

No. of Clients

Memcached-IPoIB (32Gbps)

Memcached-RDMA (32Gbps)

Page 57: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Big Data Processing • The High-Performance Big Data (HiBD) Project

• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark

– RDMA-based MapReduce on HPC Clusters with Lustre

– Enhanced HDFS with In-memory and Heterogeneous Storage

• RDMA-based designs for Memcached and HBase – RDMA-based Memcached with Hybrid Memory – Case study with OLDP – RDMA-based HBase

• Challenges in Designing Benchmarks for Big Data Processing – OSU HiBD Benchmarks

• Conclusion and Q&A

Presentation Outline

Open RG Big Data (Sept '15) 57

Page 58: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Open RG Big Data (Sept '15)

HBase-RDMA Design Overview

• JNI Layer bridges Java based HBase with communication library written in native code

• Enables high performance RDMA communication, while supporting traditional socket interface

HBase

IB Verbs

RDMA Capable Networks (IB, iWARP, RoCE ..)

OSU-IB Design

Applications

1/10/40/100 GigE, IPoIB Networks

Java Socket Interface

Java Native Interface (JNI)

58

Page 59: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Open RG Big Data (Sept '15)

HBase Micro-benchmark (Single-Server-Multi-Client) Results

• HBase Get latency

– 4 clients: 104.5 us; 16 clients: 296.1 us

• HBase Get throughput

– 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec

• 27% improvement in throughput for 16 clients over 10GE

0

50

100

150

200

250

300

350

400

450

1 2 4 8 16

Tim

e (u

s)

No. of Clients

0

10000

20000

30000

40000

50000

60000

1 2 4 8 16

Ops

/sec

No. of Clients

10GigE

Latency Throughput

J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy, and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS’12

OSU-IB (DDR)

IPoIB (DDR)

59

Page 60: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Open RG Big Data (Sept '15)

HBase – YCSB Read-Write Workload

• HBase Get latency (Yahoo! Cloud Service Benchmark) – 64 clients: 2.0 ms; 128 Clients: 3.5 ms – 42% improvement over IPoIB for 128 clients

• HBase Put latency – 64 clients: 1.9 ms; 128 Clients: 3.5 ms – 40% improvement over IPoIB for 128 clients

0

1000

2000

3000

4000

5000

6000

7000

8 16 32 64 96 128

Tim

e (u

s)

No. of Clients

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

8 16 32 64 96 128

Tim

e (u

s)

No. of Clients

10GigE

Read Latency Write Latency

OSU-IB (QDR) IPoIB (QDR)

60

Page 61: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Big Data Processing • The High-Performance Big Data (HiBD) Project

• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark

– RDMA-based MapReduce on HPC Clusters with Lustre

– Enhanced HDFS with In-memory and Heterogeneous Storage

• RDMA-based designs for Memcached and HBase – RDMA-based Memcached with Hybrid Memory – Case study with OLDP – RDMA-based HBase

• Challenges in Designing Benchmarks for Big Data Processing – OSU HiBD Benchmarks

• Conclusion and Q&A

Presentation Outline

Open RG Big Data (Sept '15) 61

Page 62: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies (InfiniBand, 1/10/40/100 GigE and Intelligent NICs)

Storage Technologies

(HDD, SSD and NVMe-SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks

RDMA Protocol

Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges

Open RG Big Data (Sept '15)

Upper level Changes?

62

Page 63: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• The current benchmarks provide some performance behavior

• However, do not provide any information to the designer/developer on: – What is happening at the lower-layer?

– Where the benefits are coming from?

– Which design is leading to benefits or bottlenecks?

– Which component in the design needs to be changed and what will be its impact?

– Can performance gain/loss at the lower-layer be correlated to the performance gain/loss observed at the upper layer?

Open RG Big Data (Sept '15)

Are the Current Benchmarks Sufficient for Big Data Management and Processing?

63

Page 64: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

OSU MPI Micro-Benchmarks (OMB) Suite • A comprehensive suite of benchmarks to

– Compare performance of different MPI libraries on various networks and systems – Validate low-level functionalities – Provide insights to the underlying MPI-level designs

• Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and bi-directional bandwidth

• Extended later to – MPI-2 one-sided – Collectives – GPU-aware data movement – OpenSHMEM (point-to-point and collectives) – UPC

• Has become an industry standard • Extensively used for design/development of MPI libraries, performance comparison

of MPI libraries and even in procurement of large-scale systems • Available from http://mvapich.cse.ohio-state.edu/benchmarks

• Available in an integrated manner with MVAPICH2 stack

Open RG Big Data (Sept '15) 64

Page 65: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies (InfiniBand, 1/10/40/100 GigE and Intelligent NICs)

Storage Technologies

(HDD, SSD and NVMe-SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks

RDMA Protocols

Challenges in Benchmarking of RDMA-based Designs

Current Benchmarks

No Benchmarks

Correlation?

Open RG Big Data (Sept '15) 65

Page 66: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies (InfiniBand, 1/10/40/100 GigE and Intelligent NICs)

Storage Technologies

(HDD, SSD and NVMe-SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks

RDMA Protocols

Iterative Process – Requires Deeper Investigation and Design for Benchmarking Next Generation Big Data Systems and Applications

Open RG Big Data (Sept '15)

Applications-Level

Benchmarks

Micro-Benchmarks

66

Page 67: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• Evaluate the performance of standalone HDFS

• Five different benchmarks – Sequential Write Latency (SWL)

– Sequential or Random Read Latency (SRL or RRL)

– Sequential Write Throughput (SWT)

– Sequential Read Throughput (SRT)

– Sequential Read-Write Throughput (SRWT)

Open RG Big Data (Sept '15)

OSU HiBD Micro-Benchmark (OHB) Suite - HDFS

Benchmark File Name

File Size

HDFS Parameter

Readers Writers Random/ Sequential

Read

Seek Interval

SWL √ √ √

SRL/RRL √ √ √ √ √ (RRL)

SWT √ √ √

SRT √ √ √

SRWT √ √ √ √

N. S. Islam, X. Lu, M. W. Rahman, J. Jose, and D. K. Panda, A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters, Int'l Workshop on Big Data Benchmarking (WBDB '12), December 2012.

67

Page 68: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

OSU HiBD Micro-Benchmark (OHB) Suite - RPC • Two different micro-benchmarks to evaluate the performance of standalone

Hadoop RPC – Latency: Single Server, Single Client

– Throughput: Single Server, Multiple Clients

• A simple script framework for job launching and resource monitoring

• Calculates statistics like Min, Max, Average

• Network configuration, Tunable parameters, DataType, CPU Utilization

Component Network

Address Port Data Type Min Msg

Size Max Msg

Size No. of

Iterations Handlers Verbose

lat_client √ √ √ √ √ √ √

lat_server √ √ √ √

Component Network Address

Port Data Type Min Msg Size

Max Msg Size

No. of Iterations

No. of Clients Handlers Verbose

thr_client √ √ √ √ √ √ √

thr_server √ √ √ √ √ √

X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, A Micro-Benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks, Int'l Workshop on Big Data Benchmarking (WBDB '13), July 2013.

Open RG Big Data (Sept '15) 68

Page 69: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• Evaluate the performance of stand-alone MapReduce

• Does not require or involve HDFS or any other distributed file system

• Considers various factors that influence the data shuffling phase – underlying network configuration, number of map and reduce tasks, intermediate

shuffle data pattern, shuffle data size etc.

• Three different micro-benchmarks based on intermediate shuffle data patterns

– MR-AVG micro-benchmark: intermediate data is evenly distributed among reduce tasks.

– MR-RAND micro-benchmark: intermediate data is pseudo-randomly distributed among reduce tasks.

– MR-SKEW micro-benchmark: intermediate data is unevenly distributed among reduce tasks.

Open RG Big Data (Sept '15)

OSU HiBD Micro-Benchmark (OHB) Suite - MapReduce

D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, A Micro-Benchmark Suite for Evaluating Hadoop MapReduce on High-Performance Networks, BPOE-5 (2014).

69

Page 70: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• Upcoming Releases of RDMA-enhanced Packages will support – CDH plugin

– Spark

– HBase

• Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will support

– MapReduce, RPC

• Exploration of other components (Threading models, QoS, Virtualization, Accelerators, etc.)

• Advanced designs with upper-level changes and optimizations

Upcoming HiBD Releases and Future Activities

Open RG Big Data (Sept '15) 70

Page 71: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

• Presented an overview of Big Data processing middleware

• Provided an overview of cluster technologies

• Discussed challenges in accelerating Big Data middleware

• Presented initial designs to take advantage of InfiniBand/RDMA for Hadoop, Spark, Memcached, and HBase

• Presented challenges in designing benchmarks

• Results are promising

• Many other open issues need to be solved

• Will enable Big processing community to take advantage of modern HPC technologies to carry out their analytics in a fast and scalable manner

Open RG Big Data (Sept '15)

Concluding Remarks

71

Page 72: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

Personnel Acknowledgments Current Students

– A. Augustine (M.S.)

– A. Awan (Ph.D.)

– A. Bhat (M.S.)

– S. Chakraborthy (Ph.D.)

– C.-H. Chu (Ph.D.)

– N. Islam (Ph.D.)

Past Students – P. Balaji (Ph.D.)

– D. Buntinas (Ph.D.)

– S. Bhagvat (M.S.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

Past Research Scientist – S. Sur

Current Post-Doc – J. Lin

– D. Shankar

Current Programmer – J. Perkins

Past Post-Docs – H. Wang

– X. Besseron

– H.-W. Jin

– M. Luo

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– A. Moody (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.) – R. Rajachandrasekar (Ph.D.)

– K. Kulkarni (M.S.)

– M. Li (Ph.D.)

– M. Rahman (Ph.D.)

– D. Shankar (Ph.D.)

– A. Venkatesh (Ph.D.)

– J. Zhang (Ph.D.)

– E. Mancini

– S. Marcarelli

– J. Vienne

Current Senior Research Associates – K. Hamidouche

– X. Lu

Past Programmers – D. Bureddy

– H. Subramoni

Current Research Specialist – M. Arnold

Open RG Big Data (Sept '15) 72

Page 73: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

73

International Workshop on High-Performance Big Data Computing (HPBDC) HPBDC 2015 was held with Int’l Conference on Distributed Computing Systems

(ICDCS ‘15), Columbus, Ohio, USA, Monday, June 29th, 2015

Two Keynote Talks: Dan Stanzione (TACC) and Zhiwei Xu (ICT/CAS)

Two Invited Talks: Jianfeng Zhan (ICT/CAS), Raghunath Nambiar (Cisco) Panel: Jianfeng Zhan (ICT/CAS)

Four Research Papers

http://web.cse.ohio-state.edu/~luxi/hpbdc2015

HPBDC 2016 will be held in conjunction with IPDPS ’16

http://web.cse.ohio-state.edu/~luxi/hpbdc2016

Submission: January 2016

Open RG Big Data (Sept '15)

Page 74: Accelerating and Benchmarking Big Data Processing on Modern …€¦ · •MapReduce: Computing framework of Hadoop; highly scalable – Map tasks read data from HDFS, operate on

[email protected]

Open RG Big Data (Sept '15)

Thank You!

The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/

74

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

The MVAPICH2/MVAPICH2-X Project http://mvapich.cse.ohio-state.edu/