© 2012 Mellanox Technologies 1 - Mellanox Confidential - Accelerating Big Data with RDMA solutions HPC advisory council, June 2013
© 2012 Mellanox Technologies 1 - Mellanox Confidential -
Accelerating Big Data with RDMA solutions
HPC advisory council, June 2013
© 2012 Mellanox Technologies 2 - Mellanox Confidential -
Leading Supplier of End-to-End Interconnect Solutions
Host/Fabric Software ICs Switches/Gateways Adapter Cards Cables
Comprehensive End-to-End InfiniBand and Ethernet Portfolio
Virtual Protocol Interconnect
Storage Front / Back-End
Server / Compute Switch / Gateway
56G IB & FCoIB 56G InfiniBand
10/40/56GbE & FCoE 10/40/56GbE
Fibre Channel
Virtual Protocol Interconnect
© 2012 Mellanox Technologies 3 - Mellanox Confidential -
Three Areas for Accelerations
Data Analytics
• Explore inefficiencies in existing analytics frameworks and systems
• Accelerate data processing to deliver faster results
Storage
• Explore ways to refine dominant file system
• Take advantage for direct attached disk to accelerate data access
Distributed Storage
• Leverage popular distributed storage systems with Big Data applications
• Use existing systems for usage with Big Data frameworks
© 2012 Mellanox Technologies 4 - Mellanox Confidential -
Motivation to Accelerate Data Analytics
Data Analysis Requires Faster Network
• Hadoop Map Reduce Framework is a network
intensive workload
- Mapped data is shuffled between nodes in the cluster
• Data Replication
- A high availability event triggers Multi-Tera of data
movement
Provide Higher Data Value
• Expose SSD’s low latency capabilities
• Better server/CPU utilization
* Data Source: Intersect360 Research, 2012, IT and Data scientists survey
Big Data Applications Require High Bandwidth and Low Latency Interconnect
© 2012 Mellanox Technologies 5 - Mellanox Confidential -
A scalable fault-tolerant distributed system for data storage and processing
Hadoop has two main systems
• Hadoop Distributed File System: self-healing high-bandwidth clustered storage.
• MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data
programming abstraction.
Key values
• Flexibility – Store any data, Run any analysis.
• Scalability – Start at 1TB/3-nodes grow to petabytes/1000s of nodes.
• Economics – Cost per TB at a fraction of traditional options.
Hadoop Framework
HDFS™ (Hadoop Distributed File System)
Map Reduce HBase
DISK DISK DISK DISK DISK DISK
Hive Pig
Map Reduce
HDFS™ (Hadoop Distributed File System)
© 2012 Mellanox Technologies 6 - Mellanox Confidential -
Plug-in architecture • Open-source, latest GA version 3.1 (6/10/2013)
• Google code repository at: https://code.google.com/p/uda-plugin/
Accelerates Map Reduce Jobs • Accelerated merge sort
Efficient Shuffle Provider • Data transfer over RDMA
• Supports InfiniBand and Ethernet
Supported Hadoop Distributions • Apache 3.0 – In the main trunk!
• Apache 2.0.3 – In the main trunk*!
• Apache Hadoop 1.0.x ; 1.1.x
• Cloudera Distribution Hadoop 3 update 4 (CDH3u4)
• Cloudera Distribution Hadoop 4 (CDH4)
• Hortonworks HDP 1.1
Supported Hardware • ConnectX®-3 VPI
• SwitchX-2 based systems
Unstructured Data Accelerator - UDA
HDFS™ (Hadoop Distributed File System)
Map Reduce HBase
DISK DISK DISK DISK DISK DISK
Hive Pig
Map Reduce
© 2012 Mellanox Technologies 7 - Mellanox Confidential -
Map Reduce Serialization
© 2012 Mellanox Technologies 8 - Mellanox Confidential -
Shuffle
Merge
New
Algorithm
Time start
Map Map Map Map
Map Map Map Map
Map Map Map Map
Map
Stage
Reduce
Reduce
Header fetch
Header fetch
shuffle merge
shuffle merge
New Pipelined Data Flow
8
© 2012 Mellanox Technologies 9 - Mellanox Confidential -
UDA - Software Architecture
JobTracker
TaskTracker
ReduceTask
TaskTracker
MapTask
Hadoop (Java)
RDMA NIC / HCA
UDA Plugin (C++)
MOFSupplier
Data Engine RDMA
Server
NetMerger
RDMA Client Merging
Thread
Merging
Thread
Merging
Thread
© 2012 Mellanox Technologies 10 - Mellanox Confidential -
Double Map Reduce Performance with UDA
*TeraSort is a popular benchmark used to measure the performance of Hadoop cluster
~50% Disk Access CPU Efficiency 2.5X
**1TB Data Set, 16x dual X5670 (Westmere) Machines, 10x HDD Base; Vanilla GPHD1.2; UDA GPHD1.2+UDA
~2X Faster Job Completion! Increase the Value of Data!
FDR Infiniband
© 2012 Mellanox Technologies 11 - Mellanox Confidential -
HiBench is a combine test suite from Intel
• Tests: IO, Map Reduce, Machine Learning, Clustering and search applications
Faster Network provides between 15% and 100% performance Improvement!
• Some applications are more I/O bounded than others
HiBench Benchmark Results
© 2012 Mellanox Technologies 12 - Mellanox Confidential -
Linearly scalable, column index database
Enable 30% more queries
Cut latency gaps by 50%
Cassandra, Initial Results
© 2012 Mellanox Technologies 13 - Mellanox Confidential -
Three Areas for Accelerations
Data Analytics
• Explore inefficiencies in existing analytics frameworks and systems
• Accelerate data processing to deliver faster results
Storage
• Explore ways to refine dominant file system
• Take advantage for direct attached disk to accelerate data access
Distributed Storage
• Leverage popular distributed storage systems with Big Data applications
• Use existing systems for usage with Big Data frameworks
© 2012 Mellanox Technologies 14 - Mellanox Confidential -
The Great Things in Hadoop Distributed File System
• HDFS is a block storage solution
• Block size can be modified to provide efficient solutions for very large files
• Inherent reliability, no need for high end storage solution to make sure data is there!
• Tuned for Hadoop work loads, write one and read many
© 2012 Mellanox Technologies 15 - Mellanox Confidential -
The Less Great Things in HDFS
It’s hard to manage
the different setting
to get the right nodes
into the right capabilities.
Ingress and extraction
of data requires
additional tools.
Small files or latency sensitive Default 3x Replication Metadata Server Failure
© 2012 Mellanox Technologies 16 - Mellanox Confidential -
Considerations When Planning Capacity
Growth Rate Cost of Storage Data Retention
Do you need
Real-Time Analytics ?
Value Byte ?
If it’s not hot, is it
worth storing
on a high performance
storage?
© 2012 Mellanox Technologies 17 - Mellanox Confidential -
HDFS is the Hadoop File System
• The underlying File system for HBase and other NoSQL Data Bases
More Drives, Higher Throughput is Needed
SSDs Solutions Must use Higher Throughput
• Bounded by 1GbE and 10GbE
HDFS Acceleration; Joint Project With Ohio State University
HDFS™ (Hadoop Distributed File System)
Map Reduce HBase
DISK DISK DISK DISK DISK DISK
Hive Pig
© 2012 Mellanox Technologies 18 - Mellanox Confidential -
HDFS Acceleration; Joint Project With Ohio State University
© 2012 Mellanox Technologies 19 - Mellanox Confidential -
HDFS Acceleration; Joint Project With Ohio State University
© 2012 Mellanox Technologies 20 - Mellanox Confidential -
SSDs Become De-Facto standard in HDFS deployment
• Read capability is a critical factor for application performance
E-DFSIO, Part of Intel’s HiBench test suite, profiles aggregated throughput on the cluster
• 1GbE network impede any performance benefit from SSD deployment
Unlocking the Power SSDs In Hadoop Environment
E-DFSIO, Showing the Power of SSD @ HDFS
© 2012 Mellanox Technologies 21 - Mellanox Confidential -
Three Areas for Accelerations
Data Analytics
• Explore inefficiencies in existing analytics frameworks and systems
• Accelerate data processing to deliver faster results
Storage
• Explore ways to refine dominant file system
• Take advantage for direct attached disk to accelerate data access
Distributed Storage
• Leverage popular distributed storage systems with Big Data applications
• Use existing systems for usage with Big Data frameworks
© 2012 Mellanox Technologies 22 - Mellanox Confidential -
OrangeFS as Hadoop Storage Solution
© 2012 Mellanox Technologies 23 - Mellanox Confidential -
Lustre as Hadoop Storage Solution
Source: Map/Reduce on Lustre, Hadoop Performance in HPC Environments, Nathan Rutman, Senior Architect, Networked Storage Solutions, Xyratex
© 2012 Mellanox Technologies 24 - Mellanox Confidential -
CEPH as Hadoop Storage Solution
Generating lot of Interest since the Ceph kernel client was pulled into Linux kernel 2.6.34
• Object-based parallel file system
• Scalable metadata server
• Each file can specify it’s own striping strategy and object size
• Automatic rebalancing of data with minimal data movement
• Hadoop module for integrating Ceph has been in development since 0.12 release
Benchmarks on Ceph is still WIP
• We are currently working on using running benchmarks on Ceph – Stay tuned!!
© 2012 Mellanox Technologies 25 - Mellanox Confidential -
Mellanox VPI Card
• MCX354A-FCBT
Mellanox Edge Switches
• MSX10xx; MSX60xx
Cloudera Certified – CDH3 and CDH4
© 2012 Mellanox Technologies 26 - Mellanox Confidential -
E5-26x0 (Sandy Bridge) Machines • Dual Socket
• 4+ cores each socket
• 32GB+ of DRAM
Disk Drives • At least 5 x 1TB, SAS, 10K RPM
Hadoop Configuration • At least one Name Node + Job Tracker
• At least 4 Data Nodes
Installation: • Your selection of Hadoop Distribution or other Big Data solution (Such as Cassandra)
Networking • ConnectX-3 VPI card, FDR, 40GbE and 10GbE
• SwitchX based systems: MSX6036F, MSX1036B and MSX1016
• Mellanox’s FDR, 40GbE and 10GbE Cable Solutions
http://www.mellanox.com/related-docs/whitepapers/WP_Deploying_Hadoop.pdf
Simple Building Block for Big Data Solution
© 2012 Mellanox Technologies 27 - Mellanox Confidential -
EMC 1000-Node Analytic Platform
Accelerates Industry's Hadoop Development
24 PetaByte of physical storage
• Half of every written word since inception of mankind
Mellanox VPI Solutions
Test Drive Your Big Data
2X Faster Hadoop Job Run-Time Hadoop
Acceleration
High Throughput, Low Latency, RDMA Critical for ROI
© 2012 Mellanox Technologies 28 - Mellanox Confidential -
Thank You