Top Banner
Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael Jordan, Samuel Madden, Ion Stoica
61

Approximate Queries on Very Large Data

Feb 24, 2016

Download

Documents

Bern Igoche

Approximate Queries on Very Large Data. Sameer Agarwal. Joint work with Ariel Kleiner , Henry Milner, Barzan Mozafari , Ameet Talwalkar , Michael Jordan, Samuel Madden, Ion Stoica. UC Berkeley. Our Goal. Support interactive SQL-like aggregate queries over massive sets of data. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Approximate Queries on Very Large Data

Approximate Queries on Very Large Data

UC Berkeley

Sameer AgarwalJoint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael Jordan, Samuel Madden, Ion Stoica

Page 2: Approximate Queries on Very Large Data

Our GoalSupport interactive SQL-like aggregate queries over massive sets of data

Page 3: Approximate Queries on Very Large Data

Our GoalSupport interactive SQL-like aggregate queries over massive sets of data

blinkdb> SELECT AVG(jobtime) FROM very_big_log AVG, COUNT,

SUM, STDEV, PERCENTILE

etc.

Page 4: Approximate Queries on Very Large Data

Support interactive SQL-like aggregate queries over massive sets of data

blinkdb> SELECT AVG(jobtime) FROM very_big_log WHERE src = ‘hadoop’

FILTERS, GROUP BY clauses

Our Goal

Page 5: Approximate Queries on Very Large Data

Support interactive SQL-like aggregate queries over massive sets of data

blinkdb> SELECT AVG(jobtime) FROM very_big_log WHERE src = ‘hadoop’ LEFT OUTER JOIN logs2

ON very_big_log.id = logs.id

JOINS, Nested Queries etc.

Our Goal

Page 6: Approximate Queries on Very Large Data

Support interactive SQL-like aggregate queries over massive sets of data

blinkdb> SELECT my_function(jobtime) FROM very_big_log WHERE src = ‘hadoop’ LEFT OUTER JOIN logs2

ON very_big_log.id = logs.id

ML Primitives,User Defined Functions

Our Goal

Page 7: Approximate Queries on Very Large Data

Hard Disks

½ - 1 Hour 1 - 5 Minutes 1 second

?Memory

100 TB on 1000 machines

Query Execution on Samples

Page 8: Approximate Queries on Very Large Data

ID

City Buff Ratio

1 NYC 0.782 NYC 0.133 Berkele

y0.25

4 NYC 0.195 NYC 0.116 Berkele

y0.09

7 NYC 0.188 NYC 0.159 Berkele

y0.13

10

Berkeley

0.49

11

NYC 0.19

12

Berkeley

0.10

Query Execution on SamplesWhat is the average buffering ratio in the table?

0.2325

Page 9: Approximate Queries on Very Large Data

ID

City Buff Ratio

1 NYC 0.782 NYC 0.133 Berkele

y0.25

4 NYC 0.195 NYC 0.116 Berkele

y0.09

7 NYC 0.188 NYC 0.159 Berkele

y0.13

10

Berkeley

0.49

11

NYC 0.19

12

Berkeley

0.10

Query Execution on SamplesWhat is the average buffering ratio in the table?

ID City Buff Ratio

Sampling Rate

2 NYC 0.13 1/46 Berkele

y0.25 1/4

8 NYC 0.19 1/4

UniformSample

0.190.2325

Page 10: Approximate Queries on Very Large Data

ID

City Buff Ratio

1 NYC 0.782 NYC 0.133 Berkele

y0.25

4 NYC 0.195 NYC 0.116 Berkele

y0.09

7 NYC 0.188 NYC 0.159 Berkele

y0.13

10

Berkeley

0.49

11

NYC 0.19

12

Berkeley

0.10

Query Execution on SamplesWhat is the average buffering ratio in the table?

ID City Buff Ratio

Sampling Rate

2 NYC 0.13 1/46 Berkele

y0.25 1/4

8 NYC 0.19 1/4

UniformSample

0.19 +/- 0.050.2325

Page 11: Approximate Queries on Very Large Data

ID

City Buff Ratio

1 NYC 0.782 NYC 0.133 Berkele

y0.25

4 NYC 0.195 NYC 0.116 Berkele

y0.09

7 NYC 0.188 NYC 0.159 Berkele

y0.13

10

Berkeley

0.49

11

NYC 0.19

12

Berkeley

0.10

Query Execution on SamplesWhat is the average buffering ratio in the table?ID City Buff

RatioSampling Rate

2 NYC 0.13 1/23 Berkele

y0.25 1/2

5 NYC 0.19 1/26 Berkele

y0.09 1/2

8 NYC 0.18 1/212 Berkele

y0.49 1/2

UniformSample

$0.22 +/- 0.02

0.23250.19 +/- 0.05

Page 12: Approximate Queries on Very Large Data

Speed/Accuracy Trade-offEr

ror

30 mins

Time to Execute on

Entire Dataset

InteractiveQueries

2 secExecution Time (Sample Size)

Page 13: Approximate Queries on Very Large Data

Erro

r

30 mins

Time to Execute on

Entire Dataset

InteractiveQueries

2 sec

Speed/Accuracy Trade-off

Pre-ExistingNoise Execution Time (Sample Size)

Page 14: Approximate Queries on Very Large Data

Sampling Vs. No Sampling

0100200300400500600700800900

1000

1 10-1 10-2 10-3 10-4 10-5

Fraction of full data

Que

ry R

espo

nse

Tim

e (S

econ

ds)

103

1020

18 13 10 8

10x as response timeis dominated by I/O

Page 15: Approximate Queries on Very Large Data

Sampling Vs. No Sampling

0100200300400500600700800900

1000

1 10-1 10-2 10-3 10-4 10-5

Fraction of full data

Que

ry R

espo

nse

Tim

e (S

econ

ds)

103

1020

18 13 10 8

(0.02%)(0.07%) (1.1%) (3.4%) (11%)

Error Bars

Page 16: Approximate Queries on Very Large Data

What is BlinkDB?A framework built on Shark and Spark that … - creates and maintains a variety of

uniform and stratified samples from underlying data

- returns fast, approximate answers with error bars by executing queries on samples of data

- verifies the correctness of the error bars that it returns at runtime

Page 17: Approximate Queries on Very Large Data

What is BlinkDB?A framework built on Shark and Spark that … - creates and maintains a variety of

uniform and stratified samples from underlying data

- returns fast, approximate answers with error bars by executing queries on samples of data

- verifies the correctness of the error bars that it returns at runtime

Page 18: Approximate Queries on Very Large Data

Uniform Samples

2

4

1

3

Page 19: Approximate Queries on Very Large Data

Uniform Samples

2

4

1

3

U

Page 20: Approximate Queries on Very Large Data

Uniform Samples

2

4

1

3

U

ID

City Data

1 NYC 0.782 NYC 0.133 Berkele

y0.25

4 NYC 0.195 NYC 0.116 Berkele

y0.09

7 NYC 0.188 NYC 0.159 Berkele

y0.13

10

Berkeley

0.49

11

NYC 0.19

12

Berkeley

0.10

Page 21: Approximate Queries on Very Large Data

Uniform Samples

2

4

1

3

U1. FILTER rand() < 1/32. Adds per-row Weights3. (Optional) ORDER BY rand()

ID

City Data

1 NYC 0.782 NYC 0.133 Berkele

y0.25

4 NYC 0.195 NYC 0.116 Berkele

y0.09

7 NYC 0.188 NYC 0.159 Berkele

y0.13

10

Berkeley

0.49

11

NYC 0.19

12

Berkeley

0.10

Page 22: Approximate Queries on Very Large Data

Uniform Samples

2

4

1

3

U

ID City Data Weight

2 NYC 0.13 1/38 NYC 0.25 1/36 Berkele

y0.09 1/3

11

NYC 0.19 1/3

ID

City Data

1 NYC 0.782 NYC 0.133 Berkele

y0.25

4 NYC 0.195 NYC 0.116 Berkele

y0.09

7 NYC 0.188 NYC 0.159 Berkele

y0.13

10

Berkeley

0.49

11

NYC 0.19

12

Berkeley

0.10

Doesn’t change Shark RDD Semantics

Page 23: Approximate Queries on Very Large Data

Stratified Samples

2

4

1

3

Page 24: Approximate Queries on Very Large Data

Stratified Samples

2

4

1

3

S

Page 25: Approximate Queries on Very Large Data

Stratified Samples

2

4

1

3

S

ID

City Data

1 NYC 0.782 NYC 0.133 Berkele

y0.25

4 NYC 0.195 NYC 0.116 Berkele

y0.09

7 NYC 0.188 NYC 0.159 Berkele

y0.13

10

Berkeley

0.49

11

NYC 0.19

12

Berkeley

0.10

Page 26: Approximate Queries on Very Large Data

Stratified Samples

2

4

1

3

ID

City Data

1 NYC 0.782 NYC 0.133 Berkele

y0.25

4 NYC 0.195 NYC 0.116 Berkele

y0.09

7 NYC 0.188 NYC 0.159 Berkele

y0.13

10

Berkeley

0.49

11

NYC 0.19

12

Berkeley

0.10

S1

ID

City Data

1 NYC 0.782 NYC 0.133 Berkele

y0.25

4 NYC 0.195 NYC 0.116 Berkele

y0.09

7 NYC 0.188 NYC 0.159 Berkele

y0.13

10

Berkeley

0.49

11

NYC 0.19

12

Berkeley

0.10

SPLIT

Page 27: Approximate Queries on Very Large Data

Stratified Samples

2

4

1

3

S1

ID

City Data

1 NYC 0.782 NYC 0.133 Berkele

y0.25

4 NYC 0.195 NYC 0.116 Berkele

y0.09

7 NYC 0.188 NYC 0.159 Berkele

y0.13

10

Berkeley

0.49

11

NYC 0.19

12

Berkeley

0.10

S2

City Count

NYC 7Berkeley

5GROUP

Page 28: Approximate Queries on Very Large Data

Stratified Samples

2

4

1

3

S1

ID

City Data

1 NYC 0.782 NYC 0.133 Berkele

y0.25

4 NYC 0.195 NYC 0.116 Berkele

y0.09

7 NYC 0.188 NYC 0.159 Berkele

y0.13

10

Berkeley

0.49

11

NYC 0.19

12

Berkeley

0.10

S2

City Count

Ratio

NYC 7 2/7Berkeley

5 2/5GROUP

Page 29: Approximate Queries on Very Large Data

Stratified Samples

2

4

1

3

S1

ID

City Data

1 NYC 0.782 NYC 0.133 Berkele

y0.25

4 NYC 0.195 NYC 0.116 Berkele

y0.09

7 NYC 0.188 NYC 0.159 Berkele

y0.13

10

Berkeley

0.49

11

NYC 0.19

12

Berkeley

0.10

S2

City Count

Ratio

NYC 7 2/7Berkeley

5 2/5

S2

JOIN

Page 30: Approximate Queries on Very Large Data

Stratified Samples

2

4

1

3

S1

S2

S2

U

ID City Data

Weight

2 NYC 0.13 2/78 NYC 0.25 2/76 Berkeley 0.09 2/512 Berkeley 0.49 2/5

Doesn’t change Shark RDD Semantics

Page 31: Approximate Queries on Very Large Data

What is BlinkDB?A framework built on Shark and Spark that … - creates and maintains a variety of

uniform and stratified samples from underlying data

- returns fast, approximate answers with error bars by executing queries on samples of data

- verifies the correctness of the error bars that it returns at runtime

Page 32: Approximate Queries on Very Large Data

Error EstimationClosed Form Aggregate Functions- Central Limit Theorem- Applicable to AVG, COUNT, SUM,

VARIANCE and STDEV

Page 33: Approximate Queries on Very Large Data

Error EstimationClosed Form Aggregate Functions- Central Limit Theorem- Applicable to AVG, COUNT, SUM,

VARIANCE and STDEV

Page 34: Approximate Queries on Very Large Data

Error EstimationClosed Form Aggregate Functions- Central Limit Theorem- Applicable to AVG, COUNT, SUM,

VARIANCE and STDEVA

1

2

Sample

AVGSUM

COUNTSTDEV

VARIANCE

A

1

2

Sample

A

±ε

A

Page 35: Approximate Queries on Very Large Data

Error EstimationGeneralized Aggregate Functions- Statistical Bootstrap - Applicable to complex and nested

queries, UDFs, joins etc.

Page 36: Approximate Queries on Very Large Data

Error EstimationGeneralized Aggregate Functions- Statistical Bootstrap - Applicable to complex and nested

queries, UDFs, joins etc.

Sample

A

Sample

A

R

A1A2A100

B±ε

Page 37: Approximate Queries on Very Large Data

What is BlinkDB?A framework built on Shark and Spark that … - creates and maintains a variety of

random and stratified samples from underlying data

- returns fast, approximate answers with error bars by executing queries on samples of data

- verifies the correctness of the error bars that it returns at runtime

Page 38: Approximate Queries on Very Large Data

Kleiner’s Diagnostics

Erro

r

Sample Size

More Data Higher Accuracy

300 Data Points 97% Accuracy

[KDD 13]

Page 39: Approximate Queries on Very Large Data

What is BlinkDB?A framework built on Shark and Spark that … - creates and maintains a variety of

random and stratified samples from underlying data

- returns fast, approximate answers with error bars by executing queries on samples of data

- verifies the correctness of the error bars that it returns at runtime

Page 40: Approximate Queries on Very Large Data

Single Pass Execution

Sample

A

Approximate Query on a Sample

Page 41: Approximate Queries on Very Large Data

Single Pass Execution

Sample

A

State Age Metric

Weight

CA 20 1971 1/4CA 22 2819 1/4MA 22 3819 1/4MA 30 3091 1/4

Page 42: Approximate Queries on Very Large Data

Single Pass Execution

Sample

R

A

State Age Metric

Weight

CA 20 1971 1/4CA 22 2819 1/4MA 22 3819 1/4MA 30 3091 1/4

Resampling Operator

Page 43: Approximate Queries on Very Large Data

Single Pass Execution

Sample

A

R

Sample “Pushdown”

State Age Metric

Weight

CA 20 1971 1/4CA 22 2819 1/4MA 22 3819 1/4MA 30 3091 1/4

Page 44: Approximate Queries on Very Large Data

Single Pass Execution

Sample

A

RMetric Weigh

t1971 1/43819 1/4

Sample “Pushdown”

Page 45: Approximate Queries on Very Large Data

Single Pass Execution

Sample

A

RMetric Weigh

t1971 1/43819 1/4

Resampling Operator

Page 46: Approximate Queries on Very Large Data

Single Pass Execution

Sample

A

RMetric Weigh

t1971 1/43819 1/4

Metric Weight

1971 1/41971 1/4

Metric Weight

3819 1/43819 1/4

Resampling Operator

Page 47: Approximate Queries on Very Large Data

Single Pass Execution

Sample

A

RMetric Weigh

t1971 1/43819 1/4

Metric Weight

1971 1/41971 1/4

Metric Weight

3819 1/43819 1/4

A A1 An

Page 48: Approximate Queries on Very Large Data

Single Pass Execution

Sample

A

RMetric Weigh

t1971 1/43819 1/4

Metric Weight

1971 1/41971 1/4

Metric Weight

3819 1/43819 1/4

A A1 An

Page 49: Approximate Queries on Very Large Data

Sample

A

RMetric Weig

ht1971 1/43819 1/4

Leverage Poissonized Resampling to generate

samples with replacement

Single Pass Execution

Page 50: Approximate Queries on Very Large Data

Sample

A

RMetric Weig

htS1

1971 1/4 23819 1/4 1

Sample from a Poisson (1) Distribution

Single Pass Execution

A1

Page 51: Approximate Queries on Very Large Data

Sample

A

RMetri

cWeig

htS1 Sk

1971 1/4 2 13819 1/4 1 0

Construct all Resamplesin Single Pass

Single Pass Execution

A1 Ak

Page 52: Approximate Queries on Very Large Data

S S1 S100… Da1 Da100

… Db1 Db100… Dc1 Dc100

SAMPLE BOOTSTRAPWEIGHTS

DIAGNOSTICSWEIGHTS

Single Pass Execution

Sample

A

R

Page 53: Approximate Queries on Very Large Data

S S1 S100… Da1 Da100

… Db1 Db100… Dc1 Dc100

SAMPLE BOOTSTRAPWEIGHTS

DIAGNOSTICSWEIGHTS

Single Pass Execution

Sample

A

R

Additional Overhead: 200 bytes/row

Page 54: Approximate Queries on Very Large Data

S S1 S100… Da1 Da100

… Db1 Db100… Dc1 Dc100

Single Pass Execution

Sample

A

R

Embarrassingly Parallel

SAMPLE BOOTSTRAPWEIGHTS

DIAGNOSTICSWEIGHTS

Page 55: Approximate Queries on Very Large Data

Resp

onse

Tim

e (s

)

Query Execution

Single Pass Execution

Page 56: Approximate Queries on Very Large Data

Single Pass Execution

Resp

onse

Tim

e (s

)

Error EstimationOverhead

Page 57: Approximate Queries on Very Large Data

Single Pass Execution

Resp

onse

Tim

e (s

) DiagnosticsOverhead

Page 58: Approximate Queries on Very Large Data

BlinkDB Architecture

Hadoop Storage (e.g., HDFS, Hbase, Presto)

Metastore

Hadoop/Spark/Presto

SQL Parser

Query Optimize

r

Physical PlanSerDes,

UDFsExecution

Driver

Command-line Shell Thrift/JDBC

Page 59: Approximate Queries on Very Large Data

Getting Started1. Alpha 0.1.1 released and available at

http://blinkdb.org2. Allows you to create random and stratified

samples on native tables and materialized views

3. Adds approximate aggregate functions with statistical closed forms to HiveQL

4. Compatible with Apache Hive, AMP Lab’s Shark and Facebook’s Presto (storage, serdes, UDFs, types, metadata)

Page 60: Approximate Queries on Very Large Data

Feature Roadmap1. Support a majority of Hive

Aggregates and User-Defined Functions

2. Runtime Correctness Tests3. Integrating BlinkDB with Shark and

Facebook’s Presto as an experimental feature

4. Automatic Sample Management5. Streaming Support

Page 61: Approximate Queries on Very Large Data

1. Approximate queries is an important means to achieve interactivity in processing large datasets

2. BlinkDB..- approximate answers with error bars by executing queries on

small samples of data- supports existing Hive/Shark/Presto queries

3. For more information, please check out our EuroSys 2013 (http://bit.ly/blinkdb-1) and KDD 2014 (http://bit.ly/blinkdb-2) papers

Summary

Thanks!