The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.

Post on 30-Mar-2015

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Demystifying Systems for Interactive and Real-time

Analytics

The BigFrame TeamDuke University, Hong Kong Polytechnic

University, and HP Labs

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System Landscape

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System Landscape

Gamma

Aster

Netezza

DB2 PE

Teradata SQL Server Parallel DataWarehouse

Greenplum

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System Landscape

HP Vertica

ParAccel

Redshift

Vectorwise

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System LandscapeHadoo

p

Tenzing

Hive

Mahout

HadoopDB

Pig

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System LandscapeDremel

Drill StingerImpala

SparkDryad SCOPE

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System Landscape

Cassandra

HBaseBigtable

Druid

HANA

Spanner

Megastore

Splunk

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System Landscape

StormGraphLab

Streambase

CassovaryGraphX

Solr

ElasticSearch

SciDBCloudera Search

MadLINQ

Pregel

HAMA

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System Landscape

Mesos

YARN

Serengeti

Cloud platforms

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

What does this mean for Big Data Practitioners?

Gives them a lot of power!

From: http://animeonly.org/Digital-Wallpapers/Digital-renders/Spiderman-95061p.html

Even the mighty may need a little help

Challenges for Practitioners

Which system touse for the app that I

am developing?

• Features (e.g., graph data)

• Performance (e.g., claims like

System A is 50x faster than B)

• Resource efficiency

• Growth and scalability

• Multi-tenancy

App Developers, Data Scientists

Different parts of my app have different

requirements

Compose “best of breed” systems

ORUse “one size fits

all” system?

Managing manysystems is hard!

System Admins

Challenges for Practitioners

Which system touse for the app that I

am developing?

App Developers, Data Scientists

Managing manysystems is hard!

Different parts of my app have different

requirements

Total Cost of Ownership (TCO)?

CIOSystem Admins

Challenges for Practitioners

Which system touse for the app that I

am developing?

App Developers, Data Scientists

Numbers make decisions easier

Need benchmarks

One Approach

Develop a benchmark per system category

Categorize systems

Useful, But …

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Star Schema BenchmarkTPC-H / TPC-DS

Counting triangles

Terasort

GridMixSWIM

HiBench

DFSIO

MapReduce Vs. Parallel DB /Hive Benchmark (in HiBench) /Berkeley Big Data Benchmark

Yahoo Cloud Serving Benchmark (YCSB)YCSB Variants

CH-benchCHmark

MulTe

Graph 500PageRank

RDF Benchmarks

Information Extraction Benchmark

Linear Road

SS-DB

Problem #1 May Miss the Big Picture

Problem #1 May Miss the Big Picture

Cannot capture the complexities and end-to-end behavior of big data applications and deployments:

(i) Bottlenecks(ii) Data conversion, transfer, & loading overheads(iii) Storage costs & other parts of the data life-cycle(iv) Resource management challenges(v) Total Cost of Ownership (TCO)

Give a man a fish and you will feed him for a day.

Give him fishing gear and you will feed him for life.

-- Anonymous

Problem #2 Benchmark

BenchmarkGenerator

BigFrame: A Benchmark Generator for Big

Data Analytics

How a user uses BigFrame

BigFrame

Interface

bigif(benchmark

input format)BenchmarkGenerator

bspec(benchmark specification)

HBase

Hive

MapReduce

Benchmark Driver for System

Under Testrun the benchmark

results

System Under Test

bspec: Benchmark Specification

HBase

Hive

MapReduce

System Under Test

2. Data refreshpattern

Time

3. Query streams

4. E

valu

atio

n m

etri

cs

1. Data forinitial load

What does the user(want to) specify?

BigFrame

Interface

bigif(benchmark

input format)

The 3Vs

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenantVolume

VarietyVelocity

bigif: BigFrame’s InputFormat

Data Variety

Relational, text, array,

graph

Small,medium,

large

Data Volume

QueryVolume

Queryconcurrency

& classes

DataVelocity

At rest,slow,fast

Micro,Macro

QueryVariety

Exploratory,Continuous

QueryVelocity

Benchmark Generationbigif

(benchmark input format)

BenchmarkGenerator

bspec(benchmark specification)

bigif describes pointsin a discrete space of

{Data,Query} X{Variety,Volume,Velocity}

1. Initial data to load 2. Data refresh pattern3. Query streams4. Evaluation metrics

Benchmark generation can beaddressed as a search problem

within a rich application domain

Application Domain Modeled Currently

E-commerce sales,

promotions, recommendati

ons

Social mediasentiment &

influence

Benchmark generation can beaddressed as a search problem

within a rich application domain

Application Domain Modeled Currently

Item

Customer

Web_sales

Promotion

Tweets

Relationships

Application Domain Modeled Currently

Item

Web_sales

Promotion

Application Domain Modeled Currently

Benchmark Generationbigif

(benchmark input format)

BenchmarkGenerator

bspec(benchmark specification)

bigif describes pointsin a discrete space of

{Data,Query} X{Variety,Volume,Velocity}

1. Initial data to load 2. Data refresh pattern3. Query streams4. Evaluation metrics

BigFrame can generate Data, Queries, and Arrival Patterns with the user-specified {Variety,Volume,Velocity}

requirements from the application domain

Use Cases of BigFrame

Use Case I: Exploratory BI

• Large volumes of relational data

• Mostly aggregation and few joins

• Can Spark’s performance match that of an MPP DB?

Data Variety = {Relational}

Query Variety = Micro

BigFrame will generate a benchmark specification containing

relational data and (SQL-ish) queries

Use Case II: Complex BI

• Large volumes of relational data

• Even larger volumes of text data

• Combined analytics

Data Variety = {Relational, Text}

Query Variety = Macro (application-focused instead of

micro-benchmarking)

BigFrame will generate a benchmark specification that includes

sentiment analysis tasks over tweets

• Large volume and velocity of

relational and text data

Use Case III: Dashboards

• Continuously-updated Dashboards

Query Velocity = Continuous

(as opposed to Exploratory)

Data Velocity =Fast

BigFrame will generate a benchmark specification that includes data refresh as well as continuous queries whose results

change upon data refresh

Use Case IV: Does One Size Fit All?

• Growing set of applications have to

process relational, text, & graph data

• Compose “best of breed”

systems or use a “one size

fits all” system?

Data Variety = {Relational, Text,

Graph}

BigFrame will generate a benchmark specification that includes composite workflows

with relational, text, and graph analytics

Query Variety = Macro

Use Case V: Multi-tenancy and SLAs

• Big data deployments are

increasingly multi-tenant and

need to meet SLAs

Specifiedthrough Query

Volume dimension

BigFrame can generate a benchmark specification containing a specified number of concurrent query streams with class labels for queries (e.g., Batch, Interactive, or Streaming)

Working with the Community

• First release of BigFrame planned for August 2013

• With feedback from benchmark developers (BigBench)

• Open-source with extensibility APIs

• Benchmark Drivers for more systems

• Utilities (accessed through the Benchmark Driver to

drill down into system behavior during benchmarking)

• Instantiate the BigFrame pipeline for more app domains

Take Away• “Benchmarks shape a field (for better or worse) …”

-- David Patterson, Univ. of California, Berkeley

• Benchmarks meet different needs for different people

• End customers, application developers, system designers,

system administrators, researchers, CIOs

• BigFrame helps users generate benchmarks that best

meet their needs

top related