Top Banner
A Robust, Optimization- Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented By: Vivek Tanneeru Venkata Dinesh Jammula
46

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Feb 11, 2016

Download

Documents

Varsha Beerjoo

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries. Surajit Chaudhuri Gautam Das Vivek Narasayya Presented By: Vivek Tanneeru Venkata Dinesh Jammula. Outline. Introduction Objective Drawbacks of Previous work Related Work - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

A Robust, Optimization-Based Approach for Approximate Answering

of Aggregate QueriesSurajit ChaudhuriGautam DasVivek Narasayya

Presented By:Vivek Tanneeru

Venkata Dinesh Jammula

Page 2: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Outline1. Introduction2. Objective3. Drawbacks of Previous work4. Related Work5. Architecture for Approximate Query Processing6. Classical Sampling Techniques7. Special Case of a Fixed Load8. Lifting Workload to Query Distributions9. Relational for Stratified Sampling10. Solution for Single-Table Selection Queries with Aggregation11. Extensions for General Work Load12. Comparisons13. Experimental Results14. Summary15. References

Page 3: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

1. Introduction

• Decision Support applications - OLAP and data mining for analyzing large databases

• Approximate answers to queries given accurately and efficiently benefit the scalability of these applications

• Workload information in picking samples of the data

Page 4: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

2. Objective

• Pre-compute a sample as an optimization problem

• Minimize error in estimation of aggregates• Implemented on Microsoft SQL Server 2000,

for an effective solution to be deployed in Commercial DBMS

Page 5: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

3. Drawbacks of Previous work• Lack of rigorous problem formulations lead to

solutions that are difficult to evaluate theoretically

• Does not deal with uncertainty in expected workload

• Ignores the variance in data distribution of aggregated columns

Page 6: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

4. Related Work

• Weighted Sampling• Outlier Index• Congressional Sampling• On the fly Sampling• Histograms

Page 7: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

5. Architecture for Approximate Query Processing

Preliminaries:• Consider Queries with selections, foreign-key

joins and GROUP BY, containing aggregation functions such as COUNT, SUM and AVG.

• Assume a pre-designated amount of storage space is available for selecting samples from the database

• Selecting samples can be randomized or deterministic

Page 8: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Architecture

Page 9: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Error Metrics If correct answer for query Q is y while approximate answer is y’

Relative error : E(Q) = |y - y’| / ySquared error : SE(Q) = (|y - y’| / y)²

If correct answer for the ith group is yi while approximate answer is yi’Squared error in answering a GROUP BY query Q :

SE(Q) = (1/g) Σi ((yi – yi’)/ yi)²

Given a probability distribution of queries pw• Mean squared error for the distribution:

MSE(pw) =ΣQ pw(Q)*SE(Q), (where pw(Q) is probability of query Q)• Root mean squared error (L2):

RMSE(pw) = √MSE(pw) Other error metrics

• L1 metric : the expected relative error over all queries in workload• L∞ metric : the max error over all queries

Page 10: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

6. Classical Sampling TechniquesUniform Sampling: LEMMA 1(a) μ is an unbiased estimator for y, namely, E[μ] = y; (b) μ· n is an unbiased estimator for Y namely

E[μ· n] = Y ; (c) the variance (or standard error) in estimating y is

E[(μ− y) 2] = S2/k; (d) the variance in estimating Y is

E[(μ·n−Y ) 2] = n2S2/k; and (e) the relative squared error in estimating Y is

E[((μ·n − Y )/Y ) 2] = n2S2/Y2k.

Page 11: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Classical Sampling TechniquesStratified Sampling:LEMMA 2

(a) μ is an unbiased estimator for y, namely, E[μ] = y; (b) μ · n is an unbiased estimator for Y, namely,

E[μ · n] = Y ; (c) the variance in estimating y is

E[(μ − y) 2] = 1/ n2 ∑j nj2 Sj

2/ kj ; (d) the variance in estimating Y is

E[(μ· n−Y ) 2] = ∑ j nj2 Sj

2 / kj ; and (e) the relative squared errorin estimating Y is

E[((μ · n − Y )/Y ) 2] = 1/ Y2 ∑ j nj2 Sj

2 /kj .

Page 12: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Classical Sampling Techniques

Neyman Allocation:LEMMA 3 Given a population R = {y1, . . . , yn}, k and r, the

optimal way to form r strata and allocate k samples among all strata is to first sort R and select strata boundaries so that ∑ j n j S j is minimized, and then, for the j th strata, to set the number of samples k j as

k j = k(n j S j / ∑ j n j S j )

Page 13: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Classical Sampling Techniques

• Multivariate Stratified Sampling

• Weighted Sampling

• Error Estimation and Confidence Intervals

Page 14: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

7. Special Case: Fixed Workload

• Problem: FIXEDSAMPInput: R, W, kOutput: A sample of k records (with appropriate additional columns) such that MSE(W) is minimized.

Page 15: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Fundamental Regions• Fundamental Regions: For a given relation R and

workload W, consider partitioning the records in R into a minimum number of regions R1, R2, …, Rr such that for any region Rj, each query in W selects either all records in Rj or none.

Page 16: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Solution for FIXEDSAMP

Step 1. Identify Fundamental Regions– Case A. r <= k– Case B. r > k

Step 2 Pick Sample Records

Step 3 Assign values to additional columns

Page 17: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

8. Lifting Workload to Query Distributions

• Resilient to the situation when incoming query is “similar” but not identical to queries in the workload

• Pw : lifted workload, probability distribution• Pw (Q’) : Related to the amount of similarity

of Q’ to the workload• Not concerned with syntactic similarity of

query expressions

Page 18: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Lifted workload (Cont.)

• Two parameters δ (½ ≤ δ ≤1) and γ (0 ≤ γ ≤ ½) define the degree to which the workload “influences” the query distribution. For any given record inside (resp. outside) RQ, the parameter δ (resp. γ) represents the probability that an incoming query will select this record.

• P{Q}(R’) is the probability of occurrence of any query that selects exactly the set of records R’.

Page 19: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Lifted workload (Cont.)

• n1 , n2, n3, and n4 are the counts of records in the regions.n2 or n4 large (large overlap), P{Q}(R’) is high

n1 or n3 large (small overlap), P{Q}(R’) is low

• We elaborate on this issue by analyzing the effects of (four) different boundary settings of these parameters.

1. δ → 1 and γ → 0: implies that incoming queries are identical to workload queries.2. δ → 1 and γ → ½: implies that incoming queries are supersets of workload queries.3. δ → ½ and γ → 0: implies that incoming queries are subsets

of workload queries.4. δ → ½ and γ → ½: implies that incoming queries are unrestricted.

Page 20: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

9. Rationale for Stratified Sampling

Consider a population, i.e. a set of numbers R = {y1,.,yn}. Let the average be y, the sum be Y and the variance be S2. Suppose we uniformly sample k numbers. Let the mean of the sample be μ.

The quantity μ is an unbiased estimator for y, i.e. E[μ] = ythe variance (i.e., squared error) in estimating y is E[(μ-y) 2] = S2/k.

Page 21: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Stratified Sampling (Cont… )Product ID Revenue1 102 103 104 1000

Query Q1 :SELECT COUNT(*)FROM RWHERE PRODUCTID IN (3,4);

Population POPQ1 = {0,0,1,1}

Thus, a stratified sampling scheme partitions R into r strata containing n1, ., nr records (where Σnj = n), with k1, …, kr records uniformly sampled from each stratum (where Σkj = k).

Page 22: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

10. Solution for single-table selection queries with Aggregation

• Stratificationa.) How many strata r to partition relation R into,b.) Records from R that belong to each strata

• Allocationhow to divide k( the number of records available for the sample) into integers k1, …, kr across r strata such that Σkj = k

• Samplinguniformly samples kj records from stratum Rj toform the final sample of k records

Page 23: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Solution for COUNT aggregate• Stratification: From Lemma 1.Lemma 1: For a workload W consisting of COUNT queries, the

fundamental regions represent an optimal stratification.

• Allocation: We want to minimize the error over queries in pw .k1, … kr are unknown variables such that Σkj = k.

From Equation (2) on earlier slide,

MSE(pW) can be expressed as a weighted sum of the MSE of each query in the workload:

Lemma 2: MSE(pW) = Σi wi MSE(p{Q})

q

iQi RpwRpi

1}{W )'()'(

Page 24: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Allocation (cont…)For any Q ε W, we express MSE(p{Q}) as a function of the kj’s

Lemma 3 : For a COUNT query Q in W,Let ApproxMSE(p{Q}) =

Then,

Page 25: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Outline of Proof: Since we have an (approximate) formula for MSE(p{Q}), we can express MSE(pw) as a function of the kj’s variables.

Corollary 1 : MSE(pw) = Σj(αj / kj), where each αj is a function of n1,…,nr, δ, and γ.

αj captures the “importance” of a region; it is positively correlated with nj as well as the frequency of queries in the workload that access Rj.

Now we can minimize MSE(pw).

Lemma 4: Σj (αj / kj) is minimized subject to Σj kj = kif kj = k * ( sqrt(αj) / Σi sqrt(αi) )

This provides a closed-form and computationally inexpensive solution to the allocation problem since αj depends only on δ, γ and the number of tuples in each fundamental region.

Page 26: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Stratification: • Bucketing Technique

We further divide fundamental regions with large variance into a set of finer regions, each of which has significantly lower internal variance.

• Treat each region as strata• From optimal Neyman Allocation Technique,We have: h*r finer strataGood to have a large h, but h is set to value 6.

Solution for SUM aggregate

Page 27: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Cont…Allocation:• Like COUNT, we express an optimization problem with h*r unknowns k1,…,

kh*r. • Unlike COUNT, the specific values of the aggregate column in each region

(as well as the variance of values in each region) influence MSE(p{Q}).• Let yj(Yj) be the average (sum) of the aggregate column values of all

records in region Rj. Since the variance within each region is small, each value within the region can be approximated as simply yj. Thus to express MSE(p{Q}) as a function of the kj’s for a SUM query Q in W:

Page 28: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Pragmatic Issues

• Identifying Fundamental Regions• Handling Large Number of Fundamental

Regions• Obtaining Integer Solutions• Obtaining an Unbiased Estimator

Page 29: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Putting all together

Page 30: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

11. Extensions

• GROUP BY

• JOIN

• Other Extensions

Page 31: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

12. Comparisons

Weighted Sampling• Records that are accessed more frequently have a

greater chance of being included into the sample• Assumes fixed workloadOutlier Indexing• Form their own stratum that is sampled in its

entirety• Assumes fixed workload

Page 32: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Comparisons (cont…)

Congressional Sampling• Allocation of samples between two strata

• To minimize MSE,

Page 33: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

13. Experimental Results

PREVIOUS WORKS: USAMP – uniform random sampling

WSAMP – weighted sampling

OTLIDX – outlier indexing combined with weighted sampling

CONG – Congressional sampling

Page 34: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Experimental Setup

• Databases: Used the popular TPC-R benchmark for experiments

• Workloads: Generated several workloads over TCP-R schema using an automatic query generation program

• Parameters: Varied the parameters like,– Skew of the data– Sampling fraction between 0.1 % - 10 %– Workload size was varied between 25 - 800 queries

• Error Metric: Report the average error over all queries in the workload

Page 35: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Training Set vs Test Set

The basic idea is to split the availableworkload into two sets: – the training workload and – the test workloadTraining Set: The workload used to

determine the sampleTest Set: The workload used to estimate

the error

Page 36: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Results : Quality vs Sampling Fraction

Page 37: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Cont…

Page 38: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Cont…

Page 39: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Cont…

Page 40: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Quality vs Overlap between Training Set and Test Set

Page 41: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Quality vs Data Skew

Page 42: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Cont…

Page 43: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Cont…

Page 44: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

14. Summary

• A comprehensive solution to the problem of identifying samples for approximately answering aggregation queries

• Its implementation on a database system• With a novel technique for lifting a workload, we make

our solution robust enough to work well even for workloads that are similar but not identical to the given workload.

• Handles the problems of data variance, heterogeneous mixes of queries, GROUP BY and foreign-key joins.

Page 45: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

15. References

• Surajit Chaudhuri, Gautam Das, Vivek Narasayya: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries. SIGMOD Conference 2001.

• Surajit Chaudhuri, Gautam Das, Vivek Narasayya. Optimized Stratified Sampling for Approximate Query Processing. ACM Transactions on Database Systems (TODS), 32(2): 9 (2007)

Page 46: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Thank YouQuestions ?

Presented By:Vivek Tanneeru

Venkata Jammula