Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael Jordan, Samuel Madden, Ion Stoica
Feb 24, 2016
Approximate Queries on Very Large Data
UC Berkeley
Sameer AgarwalJoint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael Jordan, Samuel Madden, Ion Stoica
Our GoalSupport interactive SQL-like aggregate queries over massive sets of data
Our GoalSupport interactive SQL-like aggregate queries over massive sets of data
blinkdb> SELECT AVG(jobtime) FROM very_big_log AVG, COUNT,
SUM, STDEV, PERCENTILE
etc.
Support interactive SQL-like aggregate queries over massive sets of data
blinkdb> SELECT AVG(jobtime) FROM very_big_log WHERE src = ‘hadoop’
FILTERS, GROUP BY clauses
Our Goal
Support interactive SQL-like aggregate queries over massive sets of data
blinkdb> SELECT AVG(jobtime) FROM very_big_log WHERE src = ‘hadoop’ LEFT OUTER JOIN logs2
ON very_big_log.id = logs.id
JOINS, Nested Queries etc.
Our Goal
Support interactive SQL-like aggregate queries over massive sets of data
blinkdb> SELECT my_function(jobtime) FROM very_big_log WHERE src = ‘hadoop’ LEFT OUTER JOIN logs2
ON very_big_log.id = logs.id
ML Primitives,User Defined Functions
Our Goal
Hard Disks
½ - 1 Hour 1 - 5 Minutes 1 second
?Memory
100 TB on 1000 machines
Query Execution on Samples
ID
City Buff Ratio
1 NYC 0.782 NYC 0.133 Berkele
y0.25
4 NYC 0.195 NYC 0.116 Berkele
y0.09
7 NYC 0.188 NYC 0.159 Berkele
y0.13
10
Berkeley
0.49
11
NYC 0.19
12
Berkeley
0.10
Query Execution on SamplesWhat is the average buffering ratio in the table?
0.2325
ID
City Buff Ratio
1 NYC 0.782 NYC 0.133 Berkele
y0.25
4 NYC 0.195 NYC 0.116 Berkele
y0.09
7 NYC 0.188 NYC 0.159 Berkele
y0.13
10
Berkeley
0.49
11
NYC 0.19
12
Berkeley
0.10
Query Execution on SamplesWhat is the average buffering ratio in the table?
ID City Buff Ratio
Sampling Rate
2 NYC 0.13 1/46 Berkele
y0.25 1/4
8 NYC 0.19 1/4
UniformSample
0.190.2325
ID
City Buff Ratio
1 NYC 0.782 NYC 0.133 Berkele
y0.25
4 NYC 0.195 NYC 0.116 Berkele
y0.09
7 NYC 0.188 NYC 0.159 Berkele
y0.13
10
Berkeley
0.49
11
NYC 0.19
12
Berkeley
0.10
Query Execution on SamplesWhat is the average buffering ratio in the table?
ID City Buff Ratio
Sampling Rate
2 NYC 0.13 1/46 Berkele
y0.25 1/4
8 NYC 0.19 1/4
UniformSample
0.19 +/- 0.050.2325
ID
City Buff Ratio
1 NYC 0.782 NYC 0.133 Berkele
y0.25
4 NYC 0.195 NYC 0.116 Berkele
y0.09
7 NYC 0.188 NYC 0.159 Berkele
y0.13
10
Berkeley
0.49
11
NYC 0.19
12
Berkeley
0.10
Query Execution on SamplesWhat is the average buffering ratio in the table?ID City Buff
RatioSampling Rate
2 NYC 0.13 1/23 Berkele
y0.25 1/2
5 NYC 0.19 1/26 Berkele
y0.09 1/2
8 NYC 0.18 1/212 Berkele
y0.49 1/2
UniformSample
$0.22 +/- 0.02
0.23250.19 +/- 0.05
Speed/Accuracy Trade-offEr
ror
30 mins
Time to Execute on
Entire Dataset
InteractiveQueries
2 secExecution Time (Sample Size)
Erro
r
30 mins
Time to Execute on
Entire Dataset
InteractiveQueries
2 sec
Speed/Accuracy Trade-off
Pre-ExistingNoise Execution Time (Sample Size)
Sampling Vs. No Sampling
0100200300400500600700800900
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
Que
ry R
espo
nse
Tim
e (S
econ
ds)
103
1020
18 13 10 8
10x as response timeis dominated by I/O
Sampling Vs. No Sampling
0100200300400500600700800900
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
Que
ry R
espo
nse
Tim
e (S
econ
ds)
103
1020
18 13 10 8
(0.02%)(0.07%) (1.1%) (3.4%) (11%)
Error Bars
What is BlinkDB?A framework built on Shark and Spark that … - creates and maintains a variety of
uniform and stratified samples from underlying data
- returns fast, approximate answers with error bars by executing queries on samples of data
- verifies the correctness of the error bars that it returns at runtime
What is BlinkDB?A framework built on Shark and Spark that … - creates and maintains a variety of
uniform and stratified samples from underlying data
- returns fast, approximate answers with error bars by executing queries on samples of data
- verifies the correctness of the error bars that it returns at runtime
Uniform Samples
2
4
1
3
Uniform Samples
2
4
1
3
U
Uniform Samples
2
4
1
3
U
ID
City Data
1 NYC 0.782 NYC 0.133 Berkele
y0.25
4 NYC 0.195 NYC 0.116 Berkele
y0.09
7 NYC 0.188 NYC 0.159 Berkele
y0.13
10
Berkeley
0.49
11
NYC 0.19
12
Berkeley
0.10
Uniform Samples
2
4
1
3
U1. FILTER rand() < 1/32. Adds per-row Weights3. (Optional) ORDER BY rand()
ID
City Data
1 NYC 0.782 NYC 0.133 Berkele
y0.25
4 NYC 0.195 NYC 0.116 Berkele
y0.09
7 NYC 0.188 NYC 0.159 Berkele
y0.13
10
Berkeley
0.49
11
NYC 0.19
12
Berkeley
0.10
Uniform Samples
2
4
1
3
U
ID City Data Weight
2 NYC 0.13 1/38 NYC 0.25 1/36 Berkele
y0.09 1/3
11
NYC 0.19 1/3
ID
City Data
1 NYC 0.782 NYC 0.133 Berkele
y0.25
4 NYC 0.195 NYC 0.116 Berkele
y0.09
7 NYC 0.188 NYC 0.159 Berkele
y0.13
10
Berkeley
0.49
11
NYC 0.19
12
Berkeley
0.10
Doesn’t change Shark RDD Semantics
Stratified Samples
2
4
1
3
Stratified Samples
2
4
1
3
S
Stratified Samples
2
4
1
3
S
ID
City Data
1 NYC 0.782 NYC 0.133 Berkele
y0.25
4 NYC 0.195 NYC 0.116 Berkele
y0.09
7 NYC 0.188 NYC 0.159 Berkele
y0.13
10
Berkeley
0.49
11
NYC 0.19
12
Berkeley
0.10
Stratified Samples
2
4
1
3
ID
City Data
1 NYC 0.782 NYC 0.133 Berkele
y0.25
4 NYC 0.195 NYC 0.116 Berkele
y0.09
7 NYC 0.188 NYC 0.159 Berkele
y0.13
10
Berkeley
0.49
11
NYC 0.19
12
Berkeley
0.10
S1
ID
City Data
1 NYC 0.782 NYC 0.133 Berkele
y0.25
4 NYC 0.195 NYC 0.116 Berkele
y0.09
7 NYC 0.188 NYC 0.159 Berkele
y0.13
10
Berkeley
0.49
11
NYC 0.19
12
Berkeley
0.10
SPLIT
Stratified Samples
2
4
1
3
S1
ID
City Data
1 NYC 0.782 NYC 0.133 Berkele
y0.25
4 NYC 0.195 NYC 0.116 Berkele
y0.09
7 NYC 0.188 NYC 0.159 Berkele
y0.13
10
Berkeley
0.49
11
NYC 0.19
12
Berkeley
0.10
S2
City Count
NYC 7Berkeley
5GROUP
Stratified Samples
2
4
1
3
S1
ID
City Data
1 NYC 0.782 NYC 0.133 Berkele
y0.25
4 NYC 0.195 NYC 0.116 Berkele
y0.09
7 NYC 0.188 NYC 0.159 Berkele
y0.13
10
Berkeley
0.49
11
NYC 0.19
12
Berkeley
0.10
S2
City Count
Ratio
NYC 7 2/7Berkeley
5 2/5GROUP
Stratified Samples
2
4
1
3
S1
ID
City Data
1 NYC 0.782 NYC 0.133 Berkele
y0.25
4 NYC 0.195 NYC 0.116 Berkele
y0.09
7 NYC 0.188 NYC 0.159 Berkele
y0.13
10
Berkeley
0.49
11
NYC 0.19
12
Berkeley
0.10
S2
City Count
Ratio
NYC 7 2/7Berkeley
5 2/5
S2
JOIN
Stratified Samples
2
4
1
3
S1
S2
S2
U
ID City Data
Weight
2 NYC 0.13 2/78 NYC 0.25 2/76 Berkeley 0.09 2/512 Berkeley 0.49 2/5
Doesn’t change Shark RDD Semantics
What is BlinkDB?A framework built on Shark and Spark that … - creates and maintains a variety of
uniform and stratified samples from underlying data
- returns fast, approximate answers with error bars by executing queries on samples of data
- verifies the correctness of the error bars that it returns at runtime
Error EstimationClosed Form Aggregate Functions- Central Limit Theorem- Applicable to AVG, COUNT, SUM,
VARIANCE and STDEV
Error EstimationClosed Form Aggregate Functions- Central Limit Theorem- Applicable to AVG, COUNT, SUM,
VARIANCE and STDEV
Error EstimationClosed Form Aggregate Functions- Central Limit Theorem- Applicable to AVG, COUNT, SUM,
VARIANCE and STDEVA
1
2
Sample
AVGSUM
COUNTSTDEV
VARIANCE
A
1
2
Sample
A
±ε
A
Error EstimationGeneralized Aggregate Functions- Statistical Bootstrap - Applicable to complex and nested
queries, UDFs, joins etc.
Error EstimationGeneralized Aggregate Functions- Statistical Bootstrap - Applicable to complex and nested
queries, UDFs, joins etc.
Sample
A
Sample
A
R
A1A2A100
…
…
B±ε
What is BlinkDB?A framework built on Shark and Spark that … - creates and maintains a variety of
random and stratified samples from underlying data
- returns fast, approximate answers with error bars by executing queries on samples of data
- verifies the correctness of the error bars that it returns at runtime
Kleiner’s Diagnostics
Erro
r
Sample Size
More Data Higher Accuracy
300 Data Points 97% Accuracy
[KDD 13]
What is BlinkDB?A framework built on Shark and Spark that … - creates and maintains a variety of
random and stratified samples from underlying data
- returns fast, approximate answers with error bars by executing queries on samples of data
- verifies the correctness of the error bars that it returns at runtime
Single Pass Execution
Sample
A
Approximate Query on a Sample
Single Pass Execution
Sample
A
State Age Metric
Weight
CA 20 1971 1/4CA 22 2819 1/4MA 22 3819 1/4MA 30 3091 1/4
Single Pass Execution
Sample
R
A
State Age Metric
Weight
CA 20 1971 1/4CA 22 2819 1/4MA 22 3819 1/4MA 30 3091 1/4
Resampling Operator
Single Pass Execution
Sample
A
R
Sample “Pushdown”
State Age Metric
Weight
CA 20 1971 1/4CA 22 2819 1/4MA 22 3819 1/4MA 30 3091 1/4
Single Pass Execution
Sample
A
RMetric Weigh
t1971 1/43819 1/4
Sample “Pushdown”
Single Pass Execution
Sample
A
RMetric Weigh
t1971 1/43819 1/4
Resampling Operator
Single Pass Execution
Sample
A
RMetric Weigh
t1971 1/43819 1/4
Metric Weight
1971 1/41971 1/4
Metric Weight
3819 1/43819 1/4
Resampling Operator
Single Pass Execution
Sample
A
RMetric Weigh
t1971 1/43819 1/4
Metric Weight
1971 1/41971 1/4
Metric Weight
3819 1/43819 1/4
A A1 An
…
Single Pass Execution
Sample
A
RMetric Weigh
t1971 1/43819 1/4
Metric Weight
1971 1/41971 1/4
Metric Weight
3819 1/43819 1/4
A A1 An
…
Sample
A
RMetric Weig
ht1971 1/43819 1/4
Leverage Poissonized Resampling to generate
samples with replacement
Single Pass Execution
Sample
A
RMetric Weig
htS1
1971 1/4 23819 1/4 1
Sample from a Poisson (1) Distribution
Single Pass Execution
A1
Sample
A
RMetri
cWeig
htS1 Sk
1971 1/4 2 13819 1/4 1 0
Construct all Resamplesin Single Pass
Single Pass Execution
A1 Ak
S S1 S100… Da1 Da100
… Db1 Db100… Dc1 Dc100
…
SAMPLE BOOTSTRAPWEIGHTS
DIAGNOSTICSWEIGHTS
Single Pass Execution
Sample
A
R
S S1 S100… Da1 Da100
… Db1 Db100… Dc1 Dc100
…
SAMPLE BOOTSTRAPWEIGHTS
DIAGNOSTICSWEIGHTS
Single Pass Execution
Sample
A
R
Additional Overhead: 200 bytes/row
S S1 S100… Da1 Da100
… Db1 Db100… Dc1 Dc100
…
Single Pass Execution
Sample
A
R
Embarrassingly Parallel
SAMPLE BOOTSTRAPWEIGHTS
DIAGNOSTICSWEIGHTS
Resp
onse
Tim
e (s
)
Query Execution
Single Pass Execution
Single Pass Execution
Resp
onse
Tim
e (s
)
Error EstimationOverhead
Single Pass Execution
Resp
onse
Tim
e (s
) DiagnosticsOverhead
BlinkDB Architecture
Hadoop Storage (e.g., HDFS, Hbase, Presto)
Metastore
Hadoop/Spark/Presto
SQL Parser
Query Optimize
r
Physical PlanSerDes,
UDFsExecution
Driver
Command-line Shell Thrift/JDBC
Getting Started1. Alpha 0.1.1 released and available at
http://blinkdb.org2. Allows you to create random and stratified
samples on native tables and materialized views
3. Adds approximate aggregate functions with statistical closed forms to HiveQL
4. Compatible with Apache Hive, AMP Lab’s Shark and Facebook’s Presto (storage, serdes, UDFs, types, metadata)
Feature Roadmap1. Support a majority of Hive
Aggregates and User-Defined Functions
2. Runtime Correctness Tests3. Integrating BlinkDB with Shark and
Facebook’s Presto as an experimental feature
4. Automatic Sample Management5. Streaming Support
1. Approximate queries is an important means to achieve interactivity in processing large datasets
2. BlinkDB..- approximate answers with error bars by executing queries on
small samples of data- supports existing Hive/Shark/Presto queries
3. For more information, please check out our EuroSys 2013 (http://bit.ly/blinkdb-1) and KDD 2014 (http://bit.ly/blinkdb-2) papers
Summary
Thanks!