BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Post on 17-Jan-2016
221 Views
Preview:
Transcript
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
ACM EuroSys 2013 (Best Paper Award)
MotivationSupport interactive SQL-like aggregate queries over massive sets of data
FeatureMost queries focus on global message of the whole table.
blinkdb> SELECT AVG(jobtime)
FROM very_big_log AVG, COUNT, SUM, STDEV, PERCENTILE
etc.
Where and group semantics focus on limited clauses.
blinkdb> SELECT AVG(jobtime)
FROM very_big_log
WHERE src = ‘hadoop’
FILTERS, GROUP BY clauses
Feature
Hard Disks
½ - 1 Hour 1 - 5 Minutes 1 second
?Memory
100 TB on 1000 machines
Query Execution on Samples
ID
City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley
0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley
0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley
0.13
10
Berkeley
0.49
11
NYC 0.19
12
Berkeley
0.10
Query Execution on SamplesWhat is the average buffering ratio in the table?
0.2325
ID
City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley
0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley
0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley
0.13
10
Berkeley
0.49
11
NYC 0.19
12
Berkeley
0.10
Query Execution on SamplesWhat is the average buffering ratio in the table?
ID City Buff Ratio
Sampling Rate
2 NYC 0.13 1/4
6 Berkeley
0.25 1/4
8 NYC 0.19 1/4
UniformSample
0.190.2325
ID
City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley
0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley
0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley
0.13
10
Berkeley
0.49
11
NYC 0.19
12
Berkeley
0.10
Query Execution on SamplesWhat is the average buffering ratio in the table?
ID City Buff Ratio
Sampling Rate
2 NYC 0.13 1/4
6 Berkeley
0.25 1/4
8 NYC 0.19 1/4
UniformSample
0.19 +/- 0.050.2325
ID
City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley
0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley
0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley
0.13
10
Berkeley
0.49
11
NYC 0.19
12
Berkeley
0.10
Query Execution on SamplesWhat is the average buffering ratio in the table?ID City Buff Ratio Sampling
Rate
2 NYC 0.13 1/2
3 Berkeley
0.25 1/2
5 NYC 0.19 1/2
6 Berkeley
0.09 1/2
8 NYC 0.18 1/2
12 Berkeley
0.49 1/2
UniformSample
$0.22 +/- 0.02
0.23250.19 +/- 0.05
Speed/Accuracy Trade-off
Erro
r
30 mins
Time to Execute on
Entire Dataset
InteractiveQueries
2 sec
Execution Time (Sample Size)
Erro
r
30 mins
Time to Execute on
Entire Dataset
InteractiveQueries
2 sec
Speed/Accuracy Trade-off
Pre-ExistingNoise Execution Time (Sample Size)
Sampling Vs. No Sampling
0100200300400500600700800900
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
Que
ry R
espo
nse
Tim
e (S
econ
ds)
103
1020
18 13 10 8
10x as response timeis dominated by I/O
Sampling Vs. No Sampling
0100200300400500600700800900
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
Que
ry R
espo
nse
Tim
e (S
econ
ds)
103
1020
18 13 10 8
(0.02%)(0.07%) (1.1%) (3.4%) (11%)
Error Bars
What is BlinkDB?A framework built on Shark and Spark that …
- creates and maintains a variety of uniform and stratified samples from underlying data
- returns fast, approximate answers with error bars by executing queries on samples of data
- verifies the correctness of the error bars that it returns at runtime
BlinkDB• Background
• System Overview
• Sample Creation
• BlinkDB Runtime
• Inplementation & Evaluation
Background• One common assumption is that future queries
will be similar to historical queries.
• The meaning of “similarity” can differ.
• This choice of model of past workloads is one of the key differences between BlinkDB and prior work
Workload Taxonomy
System overviewBlinkDB extends the Apache Hive frame work by adding two major components to it:
(1)an offline sampling module that creates and maintains samples over time
(2) a run-time sample selection module that creates an Error-Latency Profile(ELP) for queries
Supported queries• standard SQL aggregate queries involving COUNT,
AVG, SUM and QUANTILE. Queries involving these operations can be annotated with either an error bound, or a time constraint.
• Nested or joines queries not supported yet, but not a hindrance
It would also be straight forward to extend BlinkDB to deal with foreign keyjoins between two sampled tables (or a self join on one sampled table) where both tables have a stratified sample on the set of columns used for joins.
Sample CreationWhy Stratified samples are useful?
Samples carry storage costs, so we can only build a limited number of them.
Stratified Samples• when uniform sample is useful?
• A uniform sample may not contain any members of the subset at all, leading to a missing row in the final output of the query.
Stratified Samples for a single query• e/t->n, how to
estimate n at runtime will be illustrated later
• If uniform sampling is used, the expected size of
• If is small can be very small or even zero.
Stratified Samples for a single query
This problem has been studied before. Briefly, since error decreases at a decreasing rate as sample size increases, the best choices imply assigns equal sample size to each groups. In addition, the assignment of sample sizes is deterministic.
[16] S. Lohr. Sampling: design and analysis. Thomson, 2009.
K=
Optimizing a set of stratified samples for all queries sharing a QCS • n will change through queries.
Columns selection optimization• Sparsity of the data. A stratified sample on ϕ is
useful when the original table T contains many small groups under ϕ.
• Workload. A stratified sample is only useful when it is beneficial to actual queries. A query has a QCS qj with some(unknown) probability pj
• Storage cost. is the storage cost(inrows) of building a stratified sample on a set of columns ϕ.(for simplicity , k is fixed)
• In practice, we set M=K=100000
Let the overall storage capacity budget(in row) be . Our goal is to select β column sets from among m possible QCSs , say
can also be useful by partially covering qj
• The size of this optimization problem increases exponentially with the number of columns in T, which looks worrying. However, it is possible to solve these problems in practice by applying some simple optimizations, like considering only column sets that actually occurred in the past queries, or eliminating column sets that are unrealistically large.
BlinkDB Runtime• Selecting the Sample candidates
if , qj⊆ϕi, we simply pick the the smallest ϕi
elsehigh selectivity
• Selecting the Right Sample/Size Error Profile & Latency Profile for every candidate using standard closed form statistical error estimates[16]
Predict mainly based on:1. For all standard SQL aggregates, the variance is proportional to∼1/n, and thus the standard deviation (or the statistical error) is proportional to∼1/√n2. BlinkDB simply predicts n by assuming that latency scales linearly with input size, as is commonly observed with a majority of I/O bounded queries in parallel distributed execution environments.
Bias correction
use stratified sample to simulate a normal sample by trace the sample rate of every group.
InplementationEnables queries with response time and error bounds
Creates or updates the set of random and multi-dimensional samples
re-writes the query and iteratively assigns it an appropriately sized uniform or stratified sample
Modify all pre-existing aggregation functions with statistical closed forms to return errors bars and confidence intervals in addition to there result.
Sample refresh
• inaccuracies in analysis based on multiple queries.Multiple queries on unchanged biased sample will not help to convergence.
• periodically( typically, daily) samples from the original data to avoid correlation among the answers to queries which use the same sample.
ID
City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley
0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley
0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley
0.13
10
Berkeley
0.49
11
NYC 0.19
12
Berkeley
0.10
Query Execution on Samples
ID City Buff Ratio
Sampling Rate
2 NYC 0.13 1/4
6 Berkeley
0.25 1/4
8 NYC 0.19 1/4
UniformSample
0.190.2325
Time cost for sample• uniform samples are generally created in a few
hundred seconds.
• creating stratified samples on a set of columns takes anywhere between a 5− 30 minutes depending on the number of unique values to stratify on, which decides the number of reducers and the amount of data shuffled.
Evaluationworkloads and sample storage cost
• 8(a) and 8(b) show there relative sizes of the set of stratified sample(s) created for50%, 100%and 200% storage budget on Conviva and TPC-H workloads respectively
• A storage budget of x% indicates that the cumulative size of all the samples will not exceed times the origin data.
QCS choices change through the storage budget
Response time improvement by sample
Error by different samples
Error Convergence
Time and error bound
Scaling Up• Highly selective queries
Those queries that only operate on a small fraction of input dataconsist of one or more highly selective WHERE clauses
• those queries that are intended to crunch huge amounts of data Average among x=2
Average among all the data
Conclusion• BlinkDB, a parallel, sampling-based approximate query engine that
provides support for ad-hoc queries with error and response time constraints
• two key ideas: (i) a multi-dimensional sampling strategy that builds and maintains a variety of samples.(ii) a run-time dynamic sample selection strategy that uses parts of a sample to estimate query selectivity and chooses the best samples for satisfying query constraints.
• Answer a “range” of queries within 2 seconds on 17 TB of data with 90-98% accuracy.
top related