Top Banner
Demystifying Systems for Interactive and Real-time Analytics The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs
42

The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Mar 30, 2015

Download

Documents

Haylie Fields
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Demystifying Systems for Interactive and Real-time

Analytics

The BigFrame TeamDuke University, Hong Kong Polytechnic

University, and HP Labs

Page 2: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System Landscape

Page 3: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System Landscape

Gamma

Aster

Netezza

DB2 PE

Teradata SQL Server Parallel DataWarehouse

Greenplum

Page 4: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System Landscape

HP Vertica

ParAccel

Redshift

Vectorwise

Page 5: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System LandscapeHadoo

p

Tenzing

Hive

Mahout

HadoopDB

Pig

Page 6: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System LandscapeDremel

Drill StingerImpala

SparkDryad SCOPE

Page 7: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System Landscape

Cassandra

HBaseBigtable

Druid

HANA

Spanner

Megastore

Splunk

Page 8: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System Landscape

StormGraphLab

Streambase

CassovaryGraphX

Solr

ElasticSearch

SciDBCloudera Search

MadLINQ

Pregel

HAMA

Page 9: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System Landscape

Mesos

YARN

Serengeti

Cloud platforms

Page 10: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

What does this mean for Big Data Practitioners?

Page 11: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Gives them a lot of power!

From: http://animeonly.org/Digital-Wallpapers/Digital-renders/Spiderman-95061p.html

Page 12: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Even the mighty may need a little help

Page 13: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Challenges for Practitioners

Which system touse for the app that I

am developing?

• Features (e.g., graph data)

• Performance (e.g., claims like

System A is 50x faster than B)

• Resource efficiency

• Growth and scalability

• Multi-tenancy

App Developers, Data Scientists

Page 14: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Different parts of my app have different

requirements

Compose “best of breed” systems

ORUse “one size fits

all” system?

Managing manysystems is hard!

System Admins

Challenges for Practitioners

Which system touse for the app that I

am developing?

App Developers, Data Scientists

Page 15: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Managing manysystems is hard!

Different parts of my app have different

requirements

Total Cost of Ownership (TCO)?

CIOSystem Admins

Challenges for Practitioners

Which system touse for the app that I

am developing?

App Developers, Data Scientists

Page 16: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Numbers make decisions easier

Page 17: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Need benchmarks

Page 18: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

One Approach

Develop a benchmark per system category

Categorize systems

Page 19: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Useful, But …

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Star Schema BenchmarkTPC-H / TPC-DS

Counting triangles

Terasort

GridMixSWIM

HiBench

DFSIO

MapReduce Vs. Parallel DB /Hive Benchmark (in HiBench) /Berkeley Big Data Benchmark

Yahoo Cloud Serving Benchmark (YCSB)YCSB Variants

CH-benchCHmark

MulTe

Graph 500PageRank

RDF Benchmarks

Information Extraction Benchmark

Linear Road

SS-DB

Page 20: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Problem #1 May Miss the Big Picture

Page 21: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Problem #1 May Miss the Big Picture

Cannot capture the complexities and end-to-end behavior of big data applications and deployments:

(i) Bottlenecks(ii) Data conversion, transfer, & loading overheads(iii) Storage costs & other parts of the data life-cycle(iv) Resource management challenges(v) Total Cost of Ownership (TCO)

Page 22: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Give a man a fish and you will feed him for a day.

Give him fishing gear and you will feed him for life.

-- Anonymous

Problem #2 Benchmark

BenchmarkGenerator

Page 23: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

BigFrame: A Benchmark Generator for Big

Data Analytics

Page 24: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

How a user uses BigFrame

BigFrame

Interface

bigif(benchmark

input format)BenchmarkGenerator

bspec(benchmark specification)

HBase

Hive

MapReduce

Benchmark Driver for System

Under Testrun the benchmark

results

System Under Test

Page 25: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

bspec: Benchmark Specification

HBase

Hive

MapReduce

System Under Test

2. Data refreshpattern

Time

3. Query streams

4. E

valu

atio

n m

etri

cs

1. Data forinitial load

Page 26: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

What does the user(want to) specify?

BigFrame

Interface

bigif(benchmark

input format)

Page 27: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

The 3Vs

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenantVolume

VarietyVelocity

Page 28: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

bigif: BigFrame’s InputFormat

Data Variety

Relational, text, array,

graph

Small,medium,

large

Data Volume

QueryVolume

Queryconcurrency

& classes

DataVelocity

At rest,slow,fast

Micro,Macro

QueryVariety

Exploratory,Continuous

QueryVelocity

Page 29: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Benchmark Generationbigif

(benchmark input format)

BenchmarkGenerator

bspec(benchmark specification)

bigif describes pointsin a discrete space of

{Data,Query} X{Variety,Volume,Velocity}

1. Initial data to load 2. Data refresh pattern3. Query streams4. Evaluation metrics

Benchmark generation can beaddressed as a search problem

within a rich application domain

Page 30: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Application Domain Modeled Currently

E-commerce sales,

promotions, recommendati

ons

Social mediasentiment &

influence

Benchmark generation can beaddressed as a search problem

within a rich application domain

Page 31: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Application Domain Modeled Currently

Item

Customer

Web_sales

Promotion

Tweets

Relationships

Page 32: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Application Domain Modeled Currently

Item

Web_sales

Promotion

Page 33: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Application Domain Modeled Currently

Page 34: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Benchmark Generationbigif

(benchmark input format)

BenchmarkGenerator

bspec(benchmark specification)

bigif describes pointsin a discrete space of

{Data,Query} X{Variety,Volume,Velocity}

1. Initial data to load 2. Data refresh pattern3. Query streams4. Evaluation metrics

BigFrame can generate Data, Queries, and Arrival Patterns with the user-specified {Variety,Volume,Velocity}

requirements from the application domain

Page 35: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Use Cases of BigFrame

Page 36: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Use Case I: Exploratory BI

• Large volumes of relational data

• Mostly aggregation and few joins

• Can Spark’s performance match that of an MPP DB?

Data Variety = {Relational}

Query Variety = Micro

BigFrame will generate a benchmark specification containing

relational data and (SQL-ish) queries

Page 37: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Use Case II: Complex BI

• Large volumes of relational data

• Even larger volumes of text data

• Combined analytics

Data Variety = {Relational, Text}

Query Variety = Macro (application-focused instead of

micro-benchmarking)

BigFrame will generate a benchmark specification that includes

sentiment analysis tasks over tweets

Page 38: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

• Large volume and velocity of

relational and text data

Use Case III: Dashboards

• Continuously-updated Dashboards

Query Velocity = Continuous

(as opposed to Exploratory)

Data Velocity =Fast

BigFrame will generate a benchmark specification that includes data refresh as well as continuous queries whose results

change upon data refresh

Page 39: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Use Case IV: Does One Size Fit All?

• Growing set of applications have to

process relational, text, & graph data

• Compose “best of breed”

systems or use a “one size

fits all” system?

Data Variety = {Relational, Text,

Graph}

BigFrame will generate a benchmark specification that includes composite workflows

with relational, text, and graph analytics

Query Variety = Macro

Page 40: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Use Case V: Multi-tenancy and SLAs

• Big data deployments are

increasingly multi-tenant and

need to meet SLAs

Specifiedthrough Query

Volume dimension

BigFrame can generate a benchmark specification containing a specified number of concurrent query streams with class labels for queries (e.g., Batch, Interactive, or Streaming)

Page 41: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Working with the Community

• First release of BigFrame planned for August 2013

• With feedback from benchmark developers (BigBench)

• Open-source with extensibility APIs

• Benchmark Drivers for more systems

• Utilities (accessed through the Benchmark Driver to

drill down into system behavior during benchmarking)

• Instantiate the BigFrame pipeline for more app domains

Page 42: The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Take Away• “Benchmarks shape a field (for better or worse) …”

-- David Patterson, Univ. of California, Berkeley

• Benchmarks meet different needs for different people

• End customers, application developers, system designers,

system administrators, researchers, CIOs

• BigFrame helps users generate benchmarks that best

meet their needs