Benchmarking “No One Size Fits All” Big Data Analytics BigFrame Team The Hong Kong Polytechnic University Duke University HP Labs
Feb 25, 2016
Benchmarking “No One Size Fits All”
Big Data AnalyticsBigFrame Team
The Hong Kong Polytechnic UniversityDuke University
HP Labs
Analytics System Landscape
• MPP DBo Greenplum, SQL server PDW, Teradata, etc.
• Columnaro Vertica, Redshift, Vectorwise, etc.
• MapReduceo Hadoop, Hive, HadoopDB, Tenzing, etc
• Streamingo Storm, Streambase, etc
• Grapho Pregel, GraphLab, etc
• Multi-tenancyo Mesos, Yarn, etc
Analytics System Landscape
• MPP DBo Greenplum, SQL server PDW, Teradata, etc.
• Columnaro Vertica, Redshift, Vectorwise, etc.
• MapReduceo Hadoop, Hive, HadoopDB, Tenzing, etc
• Streamingo Storm, Streambase, etc
• Grapho Pregel, GraphLab, etc
• Multi-tenancyo Mesos, Yarn, etc
What does this mean for Big Data Practitioners?
Gives them a lot of power!
Even the mighty may need a little help
Challenges for PractitionersWhich system touse for the app that I am developing?
• Features (e.g. graph data)
• Performance (e.g., claims like System A is 50x faster than B)
• Resource efficiency• Growth and scalability• Multi-tenancyApp Developers,
Data Scientists
Challenges for PractitionersWhich system touse for the app that I am developing?
Different parts of my app have different requirements
Compose "best of breed" systems Or Use "one size fits all" System?
App Developers, Data Scientists
Challenges for PractitionersWhich system touse for the app that I am developing?
Different parts of my app have different requirements
Managing manysystems is hard!
App Developers, Data Scientists
System Admins CIO
Total Cost of Ownership (TCO)?
NeedBenchmarks
One Approach
Categorize systems
Develop a benchmark per system category
Useful, But ...
• MPP DB, Columnaro TPC-H/TPC-DS, Berkeley Big Data Benchmark etc.
• MapReduceo Terasort, DFSIO, GridMix, HiBench etc.
• Streamingo Linear Road, etc.
• Grapho Graph 500, PageRank, etc.
• ...
Problem: May miss the Big Picture
Problem: May miss the Big Picture
• Cannot capture the complexities and end-to-end behavior of big data applications and deployments:o Bottleneckso Data conversion, transfer, & loading overheadso Storage costs & other parts of the data life-cycleo Resource management challengeso Total Cost of Ownership (TCO)
A Better Approach:
BigBench or Deep Analytics Pipeline:• Applications driven• Involved multiple types of data:
o Structuredo Semi-structuredo Unstructured
• Involved multiple types of operator:o Relation Operators: join, group byo Text Analytics: Sentiment analysiso Machine Learning
Problem:
Give a man fish and you will feed him for a day.
Give him fishing gear and you will feed him for life.
--Anonymous
Benchmark
X
XBenchmark Generator
BigFrameA Benchmark Generator for
Big Data Analytics
How a user uses BigFrame
HiveMapReduce
HBase
BigFrame Interface
BenchmarkGenerator
Benchmark Driver for System Under
Test
bigif(benchmark input format)
bigspec(benchmark
specification)
result
run the benchmark
System Under Test
bigspec: Benchmark Specification
HiveMapReduce
HBase
What should be captured by the benchmark input format
• The 3Vs
VolumeVelocity
Variety
bigif: BigFrame's InputFormat
Benchmark Generation
bigif(benchmark input format)
bigspec(benchmark
specification)BenchmarkGenerator
bigif describes points in a discrete space of
{Data, Query} X {Variety, Volume, Velocity}
1. Initial data to load2. Data refresh pattern3. Query streams4. Evaluation metrics
Benchmark generation can be addressed as a search problem within a rich application domain
Application Domain Modeled Currently
E-commerce sales,promotions,
recommendations
Social media sentiment &
influence
Benchmark generation can be addressed as a search problem within a rich application domain
Application Domain Modeled Currently
Application Domain Modeled Currently
Item
Web_sales
Promotion
Application Domain Modeled Currently
Use Case 1: Exploratory BI
• Large volumes of relational data
• Mostly aggregation and few join
• Can Spark's performance match that of a MPP DB
BigFrame will generate a benchmark specification containing
relational data and (SQL-ish) queries
Data Variety = {Relational}
Query Variety = {Micro}
Use Case 2: Complex BI
• Large volumes of relational data
• Even larger volumes of text data
• Combined analytics
Data Variety = {Relational, text}
Query Variety = {Macro} (application-focused instead of micro-benchmark)
BigFrame will generate a benchmark specification that includes
sentiment analysis tasks over tweets
Use Case 3: Dashboards
• Large volume and velocity of relational and text data
• Continuously-updated Dashboards
Data Velocity= Fast
Query Variety = continuous(as opposed to Exploratory)
BigFrame will generate a benchmark specification that includes data refresh as well as continuous queries whose results change upon data refresh
Working with the community
• First release of BigFrame planned for August 2013o open source with extensibility APIs
• Benchmark Driver for more systems• Utilities (accessed through the benchmark
Driver to drill down into system behavior during benchmarking)
• Instantiate the BigFrame pipeline for more app domains
Take Away
• Benchmarks shape a field (for better or worse); they are how we determine the value of change.
--(David Patterson, University of California Berkeley, 1994).
• Benchmarks meet different needs for different people• End customers, application developers, system
designers, system administrators, researchers, CIOs
• BigFrame helps users generate benchmarks that best meet their needs