ufdcimages.uflib.ufl.eduufdcimages.uflib.ufl.edu/UF/E0/05/21/85/00001/NIRKHIWALE_S.pdf · TABLE OF CONTENTS page ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . .

A SAMPLING ALGEBRA FOR SCALABLE APPROXIMATE QUERY PROCESSING

By

SUPRIYA NIRKHIWALE

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2018

c⃝ 2018 Supriya Nirkhiwale

To Vedaant and Kshitij

ACKNOWLEDGMENTS

Thanks to my advisor, Alin Dobra, for introducing me to an interesting set of problems. I

have learnt from him how the right abstraction makes problems simple and tractable. Thanks

to my husband Kshitij, my son Vedaant, my family, and my friends, without whom none of this

would have been possible.

4

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1 Analytical Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2 Bootstratpping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.3 Other Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 TECHNICAL PRELIMINARIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1 Generalized Uniform Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Multivariate Delta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 ANALYSIS OF SAMPLING QUERY PLANS . . . . . . . . . . . . . . . . . . . . . 31

4.1 SOA-Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2 GUS Quasi-Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3 Interaction between GUS and Relational Operators . . . . . . . . . . . . . . . 384.4 Interactions between GUS Operators . . . . . . . . . . . . . . . . . . . . . . 41

5 EFFICIENT VARIANCE ESTIMATION: IMPLEMENTATION & EXPERIMENTS . . 44

5.1 Estimating yS Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.2 Sub-Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.3.2 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.3.3 Running Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.3.4 Sub-Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6 COVARIANCE BETWEEN MULTIPLE ESTIMATES: CHALLENGES . . . . . . . . 55

6.1 Base Lemma for Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.2 Covariance Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.3 Notion of Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.4 Support for Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.5 Support for Sub-Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5

7 COVARIANCE BETWEEN MULTIPLE ESTIMATES: SOLUTION . . . . . . . . . . 61

7.1 SOA-COV Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.2 Base Lemma and Overall Sampling . . . . . . . . . . . . . . . . . . . . . . . 637.3 SOA-COV Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

8 EFFICIENT COVARIANCE ANALYSIS: IMPLEMENTATION & EXPERIMENTS . . 80

8.1 Leveraging Foreign Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 828.2 Sub-Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 848.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

8.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 858.3.2 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 858.3.3 Running Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 878.3.4 Foreign Key Optimization . . . . . . . . . . . . . . . . . . . . . . . . 89

9 A SAMPLING ALGEBRA FOR GENERAL MOMENT MATCHING . . . . . . . . . 90

9.1 k-Generalized Uniform Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 909.2 kMA equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 969.3 Interaction between kMA and Relational Operators . . . . . . . . . . . . . . . 97

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6

LIST OF TABLES

Table page

4-1 GUS parameters for known sampling methods on a single relation . . . . . . . . . . 36

4-2 Notation used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

7-1 SOA-COV rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

7-2 SOA-COV composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

7

LIST OF FIGURES

Figure page

1-1 Queries on samples of TPC-H relations . . . . . . . . . . . . . . . . . . . . . . . . 16

4-1 Transformation of the query plan to allow analysis . . . . . . . . . . . . . . . . . . 42

5-1 Transformation of the query plan to allow analysis . . . . . . . . . . . . . . . . . . 47

5-2 Plot of percentage of times the true value lies in the estimated confidence intervalsvs desired confidence level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5-3 Plots of running time vs selection parameter with and without sub-sampling. . . . . 52

5-4 Plot of fluctuation of confidence interval widths obtained with sub-sampling, wrtthe true confidence interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6-1 Covariance in a single query plan . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6-2 Applying SOA-Equivalence for covariance computation . . . . . . . . . . . . . . . . 59

7-1 Setting for the base lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7-2 SOA-COV for Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7-3 SOA-COV for GUS/Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7-4 SOA-COV for Join/GUS interaction . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7-5 SOA composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7-6 Deriving Covariance-analyzable plans . . . . . . . . . . . . . . . . . . . . . . . . . 73

7-7 Applying SOA-COV Equivalence for covariance computation in Example 12 . . . . . 74




8-1 Plot of percentage of times the true value lies in the estimated confidence intervalsvs desired confidence level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

8-2 Plot of runtime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

8

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

A SAMPLING ALGEBRA FOR SCALABLE APPROXIMATE QUERY PROCESSING

By

Supriya Nirkhiwale

May 2018

Chair: Alin V. DobraMajor: Computer Engineering

As of 2005, sampling has been incorporated in all major databases. While efficient

sampling techniques are easily realizable, determining the accuracy of an estimate obtained

from the sample is still an unresolved problem. In the first part of this dissertation, we present

a theoretical framework that allows an elegant treatment of the problem. We base our work

on generalized uniform sampling (GUS), a class of sampling methods that subsumes a wide

variety of sampling techniques. We introduce a key notion of equivalence that allows GUS

sampling operators to commute with selection and join, and derivation of confidence intervals

for SUM-like aggregates obtained by a very general class of queries.

The use of sampling for approximate query processing in large data warehousing

environments has another significant limitation: sampling large tables is expensive. For

applications like approximate exploration, it can be wasteful to compute a single sample to

estimate only one value. Resources are better utilized if a single sample can be used to obtain

multiple estimates from the data. So far, it has not been possible to acheive this because

multiple estimates from the same sample are correlated and the theory to compute these

correlations was missing. In the second part of this dissertation, we provide a theoretical

framework for a lightweight, add-on tool to any database for computing covariance betwen

two estimates that arise from a common set of base relation samples. This theory also makes

it possible to a compute covariance matrix between groups of data in a GROUPBY query or

variances of estimates like AVG that are functions of SUM-like aggregates.

9

We illustrate the theory through extensive examples and give indications on how to use it

to provide meaningful estimation in database systems.

10

CHAPTER 1INTRODUCTION

Sampling has long been used by database practitioners to speed up query evaluation,

especially over very large data sets. For many years it was common to see SQL code of the

form “WHERE RAND() > 0.99”. Widespread use of this sort of code lead to the inclusion of

the TABLESAMPLE clause in the SQL-2003 standard [1]. Since then, all major databases have

incorporated native support for sampling over relations. One such query, using the TPC-H

schema, is:

SELECT SUM(l_discount*(1.0-l_tax))

FROM lineitem TABLESAMPLE (10 PERCENT),

orders TABLESAMPLE (1000 ROWS)

WHERE l_orderkey = o_orderkey AND

l_extendedprice > 100.0;

The result of this query is obtained by taking a Bernoulli sample with p = .1 over

lineitem and joining it with a sample of size 1000 obtained without replacement (WOR),

from orders and evaluating the SUM aggregate.

In practice, there are two main reasons practitioners write such code. One is that sampling

is useful for debugging expensive queries. The query can be quickly evaluated over a sample as

a sanity check, before it is unleashed upon the full database.

The second reason is that the practitioner is interested in obtaining an idea as to what the

actual answer to the query would be, in less time than would be required to run the query over

the entire database. This might be useful as a prelude to running the query “for real”—the

user might want to see if the result is potentially interesting—or else the estimate might be

used in place of the actual answer. Often, this situation arises when the query in question

11

performs an aggregation, since it is fairly intuitive to most users that sampling can be used to

obtain a number that is a reasonable approximation of the actual answer.

The problem we consider here comes from the desire to use sampling as an approximation

methodology. In this case, the user is not actually interested in computing an aggregate such

as “SUM(l discount*(1.0-l tax))” over a sample of the database. Rather, s/he is interested

in estimating the answer to such a query over the entire database using the sample. This

presents two obvious problems:

• First, what SQL code should the practitioner write in order to compute an estimate for aparticular aggregate?

• Second, how does the practitioner have any idea how accurate that estimate is?

Ideally, a database system would have built-in mechanisms that automatically provide

estimators for user supplied aggregate queries, and that automatically provide users with

accuracy guarantees. Along those lines, in this proposal, we study how to automatically support

SQL of the form:

SELECT C.I. (lo, hi, 95, SUM(l_discount*(1.0-l_tax)))


orders TABLESAMPLE(1000 ROWS)



Presented with such a query, the database engine will use the user-specified sampling to

automatically compute two values lo and hi that can be used as a [0.05, 0.95] confidence

bound on the true answer to the query. That is, the user has asked the system to compute

values lo and hi such that there is a 5% chance that the true answer is less than lo, and

there is a 95% chance that the true answer is less than hi. In the general case, the user

should be able to specify any aggregate over any number of sampled base tables using any

12

sampling scheme, and the system would automatically figure out how to compute an estimate

of the desired confidence level. A database practitioner need have no idea how to compute

an estimate for the answer, nor does s/he need to have any idea how to compute confidence

bounds; the user only specifies the desired level of confidence, and the system does the rest.

Existing Work on Database Sampling. While there has been a lot of research on

implementing efficient sampling algorithms [72, 77], providing confidence intervals for the

sample estimate is understood only for a few restricted cases. The simplest is when only a

single relation is sampled. A slightly more complicated case was handled by the AQUA system

developed at Bell labs [5–7, 41]. AQUA considered correlated sampling where a fact table

in a star schema is sampled. These cases are relatively simple because when a single table is

sampled, classical sampling theory applies with a few easy modifications. Simultaneous work

on ripple joins and online aggregation [47–50, 52] extended the class of queries amenable to

analysis to include those queries where multiple tables are sampled with replacement and then

joined. See Chapter 3 for an extensive review of related work.

Unfortunately, the extension to other types of sampling is not straightforward, and to

date new formulas have been derived every time a new sampling is considered (for example,

two-table without-replacement sampling [60]). Our goal is to provide a simple theory that

makes it possible to handle very general types of queries over virtually any uniform sampling

scheme: with replacement sampling, fixed-size without replacement sampling, Bernoulli

sampling, or whatever other sampling scheme is used. The ability to easily handle arbitrary

types of sampling is especially important given that the current SQL standard allows for a

somewhat mysterious SYSTEM sampling specification, whose exact implementation (and hence

its statistical properties) are left up to the database designers. Ideally, it should be easy for a

database designer to apply our theory to an arbitrary SYSTEM sampling implementation.

Generalized Uniform Sampling. One major reason that new theory and derivations

were previously required for each new type of sampling is that the usual analysis is tuple-based,

where the inclusion probability of each tuple in the output set is used as the basic building

13

block; computing expected values and variances requires intricate algebraic manipulations

of complicated summations. We use the notion of Generalized Uniform Sampling (GUS)

(see Definition 1) that subsumes many different sampling schemes (including all of the

aforementioned ones, as well as block-based variants thereof).

Our Contributions for SUM-like Aggregates. In Part II of this dissertation, we

develop an algebra over many common relational operators, as well as the GUS operator. This

makes it possible to take any query plan that contains one or more GUS operators and the

supported relational operators, and perform a statistical analysis of the accuracy of the result in

an algebraic fashion, working from the leaves up to the top of the plan.

No complicated algebraic manipulations over nested summations are required. This

algebra can form the basis for a lightweight tool for providing estimates and error bounds,

that should be easily integrable into any database system. The database need only feed the

tool the user-specified confidence levels, the set of tuples returned by the query, some simple

lineage information over those result tuples, and the query plan, and the tool can automatically

compute the desired error bounds.

The specific contributions we make, are:

• We define the notion of Second Order Analytical equivalence (SOA equivalence), a keyequivalence relationship between query plans that is strong enough to allow varianceanalysis but weak enough to ensure commutativity of sampling and relational operators.

• We define the GUS operator that emulates a wide class of sampling methods. Thisoperator commutes with most relational operators under SOA-equivalence.

• We develop an algebra over GUS and relational operators that allows derivation ofSOA-equivalent plans. These plans easily allow moment calculations that can be used toestimate error bounds.

• We describe how our theory can be used to add estimation capabilities to existingdatabases so that the required changes to the query optimizer and execution engine areminimal. Alternatively, the estimator can be implemented as an external tool.

Our work provides a straightforward analysis for the SUM aggregate. It can be easily

extended for COUNT by substituting the aggregated attribute to 1 and applying the analysis

14

for SUM on this attribute. Though the analysis for AVERAGE presents a slightly non-linear

case, the analyses for SUM and COUNT lay a foundation for it. The confidence intervals can

be derived using a method for approximating probability distribution/variance such as the delta

method. The analysis for MIN, MAX and DISTINCT are extremely hard problems to solve due

to their non-linearity. For example DISTINCT requires an estimate of all the distinct values in

the data and the number of such values. It is thus beyond the scope of this proposal.

While selections and joins are the highlight of our work, we show that SOA-equivalence

allows analysis for other database operators like cross-product (compaction), intersection

(concatenation) and union.

Multiple Correlated Aggregates Sharing the Same Sample. Practical AQP

applications often require support for a wide variety of estimates. In the first part of our

work, we focused on computing the variance of SUM-like estimates from query plans with

multiple joins. A natural question to ask would be: how can we generalize this framework to

accommodate:

• covariance between estimates

• non-linear estimates that can be constructed from multiple SUM-like aggregates, for eg,AVG, VAR.

• the GROUPBY clause. It is often more interesting to derive estimates for various groupsin the data, and compute their covariances for further analysis.

In a practical approximate exploration setting for large data, we almost never ask a single

question or compute only a single estimate per sample. The warehoused data can be Petabyte

size and a reasonable sample size should at least be Terabyte sized. The cost of obtaining

a sample of such a size can itself be significant. In some cases the underlying data itself is

a sample. It is prudent, or rather necessary, to reuse the obtained sample for computing the

required multiple estimates. These estimates are correlated since they are derived from the

same set of base relation samples. The user typically plugs in multiple estimates to generate

approximate answers to queries of interest. These approximate answers are incomplete without

15

lineitem

B(0.1)

orders

SWOR(1000)

customer

B(0.1)

nation

σ1 σ2

1 1

1 1

1

SUMloc COUNTloc COUNTocnCOUNTlocn GROUPBYlocn

Figure 1-1. Queries on samples of TPC-H relations

an error bound which quantifies the quality of the approximation. Computing covariances

between the multiple estimates is crucial for obtaining the required error bound. Ignoring these

correlations often leads to misleading results.

Example 1. Fig 1-1 shows an example using the TPC-H schema [4], where multiple estimates

are derived from a shared set of samples: a Bernoulli(0.1) sample from lineitem,

a sample of 1000 tuples w/o replacement from 150,000 tuples of orders & a

Bernoulli(0.1) sample from customer. These 3 base relation samples are used to derive the

following five Horvitz-Thompson ([24, 53]) estimators:

• SUMloc as 15000 ∗ SUM(l discount*(1.0-l tax))

• COUNTloc as 15000∗COUNT(*) from the join of the samples from lineitem, orders andcustomer

• COUNTocn as 1500∗COUNT(orderkey) from the join of the samples from orders,customer and nation

• COUNTlocn as 15000∗COUNT(*) & GROUPBYlocn as15000∗SUM(o totalprice) GROUPBY n name from the join of samples from lineitem,orders and customer and the nation table.

Note that the sample aggregates used in the above estimators have been scaled by appropriate

factors to obtain unbiased estimates of the corresponding population aggregates.

16

It is common practice for users to submit work units consisting of requests for multiple

estimates, where each estimate uses a subset of base relation samples. These estimates are

typically used in further analysis. For example, the user may want to approximate the average

price per lineitem by a function

f1(SUMloc, COUNTloc) =SUMloc

COUNTloc.

Approximations for other desired quantities can often involve much more complex functions.

For example, the average price of an order in a given nation can be approximated by

f2(SUMloc, COUNTloc, COUNTlocn, COUNTocn) =SUMloc

COUNTloc

COUNTlocn

COUNTocn.

The variance of the approximation f1 is given by

V (f1(SUMloc, COUNTloc))

≈ SUM2locCOUNT2loc

(V (SUMloc)

SUM2loc+V (COUNTloc)

COUNT2loc− 2Cov(SUMloc, COUNTloc)

SUMlocCOUNTloc

), (1–1)

and the variance of the approximation f2 (refer to Eq (3–3)) is a function of 6 covariance terms

- one for every pair of estimates from SUMloc, COUNTloc, COUNTlocn, COUNTocn, and 4 variance

terms.

The only way to avoid the covariance terms (by making them equal to zero) would

be to compute independent samples of lineitem, orders and customer per estimate

(12 independent base relation samples). This is clearly inefficient and wasteful. Another

example with multiple estimates is GROUPBYlocn, a set of 25 estimates (one for each nation).

By design, these estimates come from the same set of base relation samples and have

non-trivial covariances. The quality of an approximation can only be meaningfully and correctly

understood by knowing the pairwise covariances of various estimates used in the approximation.

2

As demonstrated by the above example, in modern AQP settings with multiple correlated

estimates, it is therefore crucial to compute pairwise covariance estimates, whether the

17

user needs to construct sophisticated functions of multiple estimates or to get accurate

simultaneous error bounds for the individual estimates (once covariances are computed, a

recipe for comupting simultaneous confidence intervals is available in [95]). Previous work on

covariance computation focuses on specific individual settings, where this computation can be

a tediuos, arduous process. There is a need for a clean and general algebraic approach (similar

to the approach that we develop for variance) to make covariance computation tenable in

practical settings.

Our Contributions for Covariance Computation. We address the covariance

computation problem in Part III of the dissertation. We provide a list of our specific

contributions below.

• We define the notion of Second Order Analytical Covariance equivalence (SOA-COVequivalence) between two pairs of query plans, a broad and non-trivial generalization ofthe notion of SOA-equivalence in developed in the first part of the dissertation. Thisequivalence is strong enough to allow covariance analysis, and subsumes the framework in[76] as variance can be thought of as a self covariance.

• We develop an algebra over GUS and relational operators that allow derivation ofSOA-COV equivalent plans. These plans easily allow moment calculations to estimatecovariance.

These analytical results form the basis for the development of a practical lightweight add-on

tool that

• computes the pairwise covariances between multiple estimates from a common sample.

• empowers the pratitioner with the ability to compute error bounds for functions ofsum-like aggregates, covariance matrix for GROUPBY queries and to get simultaneousconfidence intervals for multiple estimates.

• is based on theory which is independent of the number of joins involved, platform orschema and uses only a single sample.

Structure of the dissertation. The rest of the document is organized as follows. In

Chapter 2, we provide a detailed overview of related work in approximate query processing.

In Chapter 3, we review concepts related to Generalized Unifrom Sampling (GUS) methods

in detail. In Chapter 4, we introduce the notion of SOA-equivalence between query plans and

18

prove that GUS operators commute with a variety of relational operators in the SOA sense.

We also investigate interactions between GUS operators when applied to the same data. In

Chapter 5, we provide insights on how our theory can be used to implement a separate add-on

tool and how the performance of the variance estimation can be enhanced. Furthermore, we

test our implementation thoroughly, and provide accuracy and runtime analysis. In Chapter 6 ,

we propose to extend this theory to accommodate multiple estimates, for e.g. as in GROUPBY

queries. We explore the general difficulties associated with this problem and outline the major

technical challenges. In Chapter 7, we provide a solution to this problem by introducing the

notion of SOA-COV equivalence between pairs of query plans. We develop an algebra which

allows us to transform a given pair of query plans to an analyzable pair of query plans, thereby

giving us the ability to compute the covariance between any pair of aggregates resulting from

these plans. In Chapter 8, we proivde a thorough experimental testing of the SOA-COV based

theory and discuss issues relevant to implementation. In Chapter 9, we consider the problem

of estimating higher order moments of aggregate estimators, and develop the notions of

k-Generalized Uniform Sampling methods and kMA equivalence to provide a solution.

19

CHAPTER 2RELATED WORK

The idea of using sampling in databases for deriving estimates for a single relation was

first studied by Shapiro et al. [79]. Since then, much research has focused on implementing

efficient sampling algorithms in databases [72, 77]. Providing confidence intervals on estimates

for SQL aggregate queries is a difficult problem with limited progress so far. The previous

literature can be roughly classified into the following areas.

2.1 Analytical Bounds

The problem of providing closed form analytical bounds for approximate database queries

has a roughly three decade long history. There has been a large body of research on using

sampling to provide quick answers to database queries, on database systems [8, 19, 52, 59, 61,

78], and data stream systems [13, 74]. Olken [77] studied the problem for specific sampling

methods for a single relation. This line of work ended abruptly when Chaudhuri et al. [20, 21]

proved that extracting IID samples from a join of two relations is infeasible.

Another line of research was the extension to the correlated sampling pioneered by the

AQUA system [6, 7, 41]. AQUA is applicable to a star schema, where the goal is sampling

from the fact table, and including all tuples in dimension tables that match selected fact table

tuples. The AQUA type of sampling has been incorporated in DB2 [43].

The reason confidence intervals can be provided for AQUA type sampling is the fact

that independent identically distributed (IID) samples are obtained from the set over which

the aggregate is computed. A straightforward use of the central limit theorem readily allows

computation of good estimates and confidence intervals. Indeed, it is widely believed [6, 7, 20,

21, 41, 77] that IID samples at the top of the query plan are required to provide any confidence

interval. This idea leads to the search for a sampling operator that commutes with database

operators. This endeavor proved to be very difficult from the beginning [20] when joins are

involved. To see why this is the case, consider a tuple t ∈ orders and two tuples u1, u2 in

lineitem that join with t (i.e. they have the same value for orderkey). Random selection

20

of tuples t, u1, u2 in the sample does not guarantee random selection of result tuples (t, u1)

and (t, u2). If t is not selected, neither tuple can exist, and thus sampling is correlated. A lot

of effort [20, 21] has been spent in finding practical ways to de-correlate the result tuples with

only limited success.

Substantial research has been devoted to deriving samples from input relations in advance

and using them to approximate answers to ad-hoc queries [8] . These methods may provide

significant benefit when queries, predicates or query columns are predictable/ known in

advance. However, they offer limited support for joins. Any join has to be with a small

dimention tables on foreign keys. Multiple joins over large tables are not supported.

Progress has been made using a different line of thought by Hellerstein and Hass [52] and

the generalization in [51] for the special case of sampling with replacement. The problem of

producing IID result samples is avoided by developing central limit theorem-like results for the

combination of relation level sampling with replacement. The theory was generalized first to

sampling without replacement for single join queries [60], then further generalized to arbitrary

uniform sampling over base directions and arbitrary SELECT-FROM-WHERE queries without

duplicate elimination in DBO [59], and finally to allow sampling across multiple relations

in Turbo-DBO [30]. Even though some simplification occurred through these theoretical

developments, they are mathematically heavy and hard to understand/interpret. Moreover,

the theory, especially DBO and Turbo-DBO, is tightly coupled with the systems developed to

exploit it.

Technically, one major problem in all the mathematics used to analyze sampling schemes

is the fact the analyses use functions and summations over tuple domains, and not the

operators and algebras that the database community is used to. This makes the theory hard to

comprehend and apply. The fact that no database system picked up these ideas to provide a

confidence interval facility is a direct testament of these difficulties.

While recent progress has been made on generalized variance computation, the much

more daunting issue of computing pairwise covariances between multiple estimates has been

21

barely explored. The need for covariance computation arises in a wide variety of situations.

In the context of sampling from databases, covariance computation is crucial for obtaining

efficient simultaneous confidence intervals for multiple GROUPBY estimates. Kandula et al.

[62] extend the notion of SOA-equivalence between plans introduced in [76] to the notion of

Sampling Dominance between plans. They argue that the variance of a transformed plan need

not be exactly equal to the variance of the original plan, because obtaining an upper bound

for the variance (and hence the error) of the sampling based estimator may be good enough

for obtaining an error bound. Relaxing the notion of SOA-equivalence allows them to consider

query plans with non-GUS samplers. They construct non-GUS extensions of the Bernoulli

sampler called the Distinct sampler and the Universe sampler, and develop a framework for

transforming any plan with these samplers to a plan with sampling only on top (just before

aggregation) such that the error for the transformed plan is greater than or equal to that of

the original plan. As we demonstrate in Example 1 to compute the variance of a function of

multiple aggregate estimators (like AVG) or to compute a corresponding joint confidence region

for these estimators, all the pairwise covariances need to be computed (see (3–3)). This issue

is acknowledged in [62, Appendix C] in the context of AVG, but a framework to estimate the

covariance is not developed. In [95], the authors derive closed form estimators for the pairwise

covariances between aggregates from a GROUPBY query, for the specific case of sampling

without replacement. These covariances are then used to construct simultaneous confidence

intervals. Pansare et al. [78] develop a very sophisticated Bayesian framework to infer the

confidence bounds of approximate aggregate answers. However, this approach is limited to

simple group-by aggregate queries and does not provide a systematic way of quantifying

approximation quality.

2.2 Bootstratpping

Bootstrap [32, 34] is a popular resampling based statistical technique for obtaining

confidence/error bounds for a wide variety of estimates. This method is particularly useful in

settings where closed form variance estimates are not available (MIN, MAX, nested queries,

22

etc). The basic idea behind bootstrap is simple: obtain a large number of resamples from the

existing samples, compute the required estimate for each of the resamples, and use them to

approximate the sampling distribution of the original estimate, providing error bounds as a

byproduct. Though conceptually simple and powerful, this method requires repeated estimator

computation on resamples having size comparable to original dataset. This poses the obvious

challenges of computational efficiency and provides basis for multiple lines of work.

Research in statistical methodology has focused on reducing the required number of

Monte Carlo resamples required [33, 34] or reducing the size of resamples [14–16, 81].

The techniques for reducing the number of resamples introduce additional complexity of

implementation and still need repeated estimator computations on resamples having size

comparable to that of the original dataset. There is some computational interest in lowering

the size the resamples with bootstrap variants, such as m out n sampling, where the size

of resamples are smaller than the original sample. These techniques are sensitive to the

choice of parameters (size of resamples) [69] and the analytical correction requires the prior

knowledge of the convergence rate of the estimator, making them infeasible to automate.

The requirement of prior theoretical knowledge can be avoided by averaging the distributons

of the smaller resamples [69], but the automatic selection of resampling parameters in a

computationally efficient manner remains to be a challenge.

In the last decade, various approaches for using Bootstrap in AQP have been developed

in the literature. These approaches focus mainly on reducing the computational overhead

associated with repeated resampling. Pol et al. [80] developed a resampling tree data structure

for every base relation, that holds indicator random variables for inclusion of a tuple in the

sample. Laptev et al. [70] target MapReduce platforms and study how to overlap computation

across different bootstrap trials or bootstrap samples. The Analytic Bootstrap [96] provides a

probabilistic relational model for symbollically executing bootstrap and develop new relational

operators that combine random variables. Since bootstrap is a simulation based technique,

23

recent work [9, 68] demonstrate the need of diagnostic methods to identify when bootstrap

based techniques are unreliable.

While a vast amount of bootstrap literature focuses on the computational issues, there are

few issues that arise with its applications.

Joins. The asymptotic theory for bootstrap is valid only under the assumption that the

elements of the sample are independent and identically distributed (IID). Pol and Jermaine [80]

use resample tree per base relation to simulate multiple instances of sampling w replacement

and estimate the accuracy of aggregate over joins. Even if the base relation samples are

IID, it is well known that joining two IID samples does not lead to an IID sample from the

relevant cross product space [21]. It is not clear if the asymptotic results for i.i.d. bootstrap

are applicable in the presence of joins and there are no known theoretical guarantees for

consistency. While empirical results suggest that this these bootstrap based methods work

efficiently, it’s important to point out that the theoretical guarantees of bootstrap are lost.

Sampling generality. The IID assumption also restricts traditional bootstrap applications

to use sampling with replacement. This assumption does not hold for GUS methods that

subsume a wider class of generic sampling methods (eg Bernoulli). Some theoretical results

about validity of the bootstrap when elements of the sample are only independent (but not

necessarily identically distributed) are available, but hold under specific regularity assumptions

[73]. The assumption of independence itself does not hold for GUS methods such as sampling

without replacement. Another direct consequence is that bootstrap samples have to be apriori,

whereas GUS methods work for both apriori and samples computed inline.

To summarize, the strength of the Bootstrap lies in its wide applicability, but in the

presence of sampling induced correlations, it is theoretically and computatonally preferable to

use (if available) a closed form non-simulation based estimator with rigorous guarantees for

accuracy. This is the approach that we pursue in the current paper.

24

2.3 Other Areas

Probabilistic databases. Much existing work in this area [11, 27, 83, 87, 91] uses

possible world semantics to model uncertain data and its query evaluation. Tuples in a

probabilistic database have binary uncertainty, i.e., they either exist or not with a certain

probability. Specifically, [27, 83] use semirings for modeling and querying probabilistic

databases, focusing on conjunctive queries with HAVING clauses. Many probablistic databases

assume iid tuples [11, 27, 83, 87] or propose new query evaluation methods to handle

particular correlations. [88, 89]

Sketches. Sketching is another common technique that is used to provide approximate

answers for aggregate queries over data streams. Sketching methods use randomized

algorithms that combine random seeds with data to produce random variables whose

distribution depends on the true aggregate value. One class of sketching techniques focuses on

accurately estimating the individual frequencies in a data stream [10, 25, 75]. These frequency

estimates can be used to compute join sizes , quantiles, heavy hitters, etc. Another class of

sketching techniques focuses on COUNT DISTINCT queries [35, 36, 94]. Dobra et al [29]

develop sketching methods based on partitioning for join size estimation with multiple join

conditions. Dobra And Rusu [84, 86] provide a rigorous staistical analysis of various sketch

algorithms for join size estimation and perform extensive empirical evaluations. In [85], the

authors study a method that combines sampling and sketching and investigate the dependance

of variance of the resulting estimators on these two components.

Wavelets and Histograms. A tool which has a rich history in signal processing and

statistics, but recently has generated a lot of interest in AQP applications is wavelets.

Wavelets provide an effective way of representing a relational data in terms of appropriate

wavelet coefficients using linear transformations. In big data settings, one can obtain a

compressed/approximate representation of the data by only keeping a certain number of

wavelet coeffcients, and setting the rest of them to zero. Developing meaningful methods

to choose the best wavelet coefficients to keep in the approximate representation has been

25

the main focus of current research in this area. See, for example, [18, 26, 37–40, 42, 44,

45, 64, 65, 74, 90]. Histograms also provide a well-established and well-studied approach

for summarizing data, in both databases and statistics. In the last three decades, several

methods for using histograms for AQP have been developed in the database community. See

[7, 17, 31, 46, 54, 56–58, 63, 67, 79, 82, 92, 93] to name just a few. For a rigorous analysis of

the theoretical properties of histograms, see [28, 55, 66].

See [24] for a detailed review and extensive references for sketches, histograms, and

wavelets, and sampling based methods. As discussed in [24, Chapter 6], each of these methods

is useful and has comparative advantages and disadvantages. In particular, sampling provides

a flexible approach that works for general-purpose queries and adapts much more easily to

high-dimensional data as compared to the other methods, thereby occupying a unique and

invaluable place in the toolbox for modern AQP.

26

CHAPTER 3TECHNICAL PRELIMINARIES

The aim of this chapter is to review Generalized Uniform Sampling (GUS) methods, which

are a key ingredient in our thoery. We start with a quick overview of other sampling methods

for databases. Then, we define GUS methods, and provide details on how to get estimates and

confidence intervals using these methods. We also review the multivariate delta method, which

allows us to estimate the variance of a function of several SUM-like aggregates in terms of the

individual varainces and pairwise covarainces.

3.1 Generalized Uniform Sampling

Definition 1 (GUS Sampling [30]). A randomized selection process G(a,b) which gives a sample

R from R = R1 × R2 × · · · × Rn is called Generalized Uniform Sampling (GUS) method,

if, for any given tuples t = (t1, ... , tn), t′ = (t ′1, ... , t

′n) ∈ R, P(t ∈ R) is independent of

t, and P(t, t ′ ∈ R) depends only on {i : ti = t ′i }. In such a case, the GUS parameters a,

b = {bT |T ⊂ {1 : n}} are defined as:

a = P[t ∈ R]

bT = P[t ∈ R ∧ t ′ ∈ R|∀i ∈ T , ti = t ′i , ∀j ∈ TC , tj = t ′j ].

This definition requires GUS sampling to behave like a randomized filter. In particular,

any GUS operator can be viewed as a selection process from the underlying data, a process

that can introduce correlations. The uniformity of GUS requires that the randomized filtering

is performed on lineage of tuples and not on the content. As simple as the idea is, expressing

any sampling process in the form of GUS is a non-trivial task. Example 2 shows the calculation

of GUS parameters for a simple case.

Example 2. In this example, we show how the GUS definition above can be used to

characterize the estimation necessary for the query from Chapter 1. We denote by l s the

Bernoulli sample with p = 0.1 from lineitem and by o s the WOR sample of size 1000 from

orders. We assume that cardinality of orders is 150000. Henceforth, for ease of exposition,

27

we will denote all base relations involved by their first letters. For example, lineitem will be

denoted by l.

Applying the definition above and the independence between sampling processes, we

can derive the parameters for this GUS as follows: For any tuple t ∈ lineitem and tuple

u ∈ orders:

a = P[(t ∈ l s) ∧ (u ∈ u s)] = 0.1× 1000

150000= 6.667× 10−4

since the base relations are sampled independently from each other. For any tuples t, t ′ ∈

lineitem and u, u′ ∈ orders:

b∅ = P[(t, t ′ ∈ l s) ∧ (u, u′ ∈ o s)]

= 0.1× 0.1× 1000

150000× 999

149999

= 4.44× 10−7,

and

bo = P[t ∈ l s]× P[t ′ ∈ l s|t ∈ l s]× P[u ∈ o s]

= 0.1× 0.1× 1000

150000= 6.667× 10−5.

Similarly,

bl = P[(t ∈ l s) ∧ (u, u′ ∈ o s)]

= P[t ∈ l s]× P[u ∈ o s]× P[u′ ∈ o s|u ∈ o s]

= 0.1× 1000

150000× 999

149999

= 4.44× 10−6.

The last term is

bl ,o = P[(t ∈ l s) ∧ (u ∈ o s)] = 0.1× 1000

150000= 6.667× 10−4.

28

Notice that the GUS captures the entire estimation process, not only the two individual

sampling methods. The above analysis dealt with a simple join consisting of two base relations.

For more complex query plans, the derivation of GUS parameters would involve consideration

of all possible interactions between participating tuples. This will make the analysis highly

complex.

The analysis of any GUS sampling method for a SUM-like aggregate is given as follows.

Theorem 1. [30] Let f (t) be a function/property of t ∈ R, and R be the sample obtained

by a GUS method G(a,b). Then, the aggregate A =∑t∈R f (t) and the sampling estimate

X = 1a

∑t∈R f (t) have the property:

E [X ] = A

σ2(X ) =∑S⊂{1:n}

cSa2yS − yϕ (3–1)

with

yS =∑

ti∈Ri |i∈S

∑tj∈Rj |j∈SC

f (ti , tj)

2

cS =∑T∈P(n)

(−1)|T |+|S| bT .

The above theorem indicates that the GUS estimates of SUM-like aggregates are unbiased

and that the variance is simply a linear combination of properties of the data, terms yS and

properties of the GUS sampling method cS . Moreover, yS can be estimated from samples of

any GUS (see [30]). This result is not asymptotic; it gives the exact analysis even for very

small samples. Once the estimate and the variance are computed, confidence intervals can be

readily provided using either the normality assumption or the more conservative Chebychev

bound (see [30]).

3.2 Multivariate Delta Method

Let θ = (θ1, θ2, · · · , θk) be a vector of unknown parameters with corresponding unbiased

estimators A = (A1,A2, · · · ,Ak). Then, under appropriate assumptions (such as the

29

existence of a multivariate central limit theorem for A), the delta method shows that for any

continuously differentiable function g, E [g(A)] ≈ g(θ), and

Var(g(A)) ≈k∑i=1

(▽ig(θ))2Var(Ai) +∑i =j

▽ig(θ)▽j g(θ)Cov(Ai ,Aj). (3–2)

Using the multivariate delta method, the variance of the approximation f2 in Example 1 is

given by

V (f2(SUMloc, COUNTloc, COUNTlocn, COUNTocn))

≈ COUNT2locnCOUNT2locCOUNT

2ocn

Var(SUMloc) +

SUM2locCOUNT2locCOUNT

2ocn

Var(COUNTlocn) +

SUM2locCOUNT2locn

COUNT4locCOUNT2ocn

Var(COUNTloc) +

SUM2locCOUNT2locn

COUNT2locCOUNT4ocn

Var(COUNTocn) +

2SUMlocCOUNTlocn

COUNT2locCOUNT2ocn

Cov(SUMloc, COUNTlocn)−

2SUMlocCOUNT

2locn

COUNT3locCOUNT2ocn

Cov(SUMloc, COUNTloc)−

2SUMlocCOUNT

2locn

COUNT2locCOUNT3ocn

Cov(SUMloc, COUNTocn)−

2SUM2locCOUNTlocn

COUNT3locCOUNT2ocn

Cov(COUNTlocn, COUNTloc)−

2SUM2locCOUNTlocn

COUNT2locCOUNT3ocn

Cov(COUNTlocn, COUNTocn) +

2SUM2locCOUNT

2locn

COUNT3locCOUNT3ocn

Cov(COUNTloc, COUNTocn). (3–3)

30

CHAPTER 4ANALYSIS OF SAMPLING QUERY PLANS

The high-level goal of our research is to introduce a tool that computes the confidence

bounds of estimates based on sampling. Given a query plan with sampling operators

interspersed at various points, our tool transforms it to an analytically equivalent query

plan that has a particular structure: all relational operators except the final aggregate form

a subtree that is the input to a single GUS sampling operator. The GUS operator feeds the

aggregate operator that produces the final result. Note that this transformation is done solely

for the purpose of computing the confidence bounds of the result; it does not provide a better

alternative to the execution plan used as input. Once this transformation is accomplished,

Theorem 1 readily gives the desired analysis – the equivalence ensures that the analysis for the

special plan coincides with the analysis for the original plan.

A natural and convenient strategy to obtain the desired structure is to perform multiple

local transformations on the original query plan. These local transformations are based on a

notion of analytical equivalence, that we call Second Order Analytical (SOA) equivalence. They

allow both commutativity of relational and GUS operators, and consolidation of GUS operators.

Effectively, these local transformations allow a plan to be put in the special form in which there

is a single GUS operator just before the aggregate.

In this chapter, we first define the SOA-equivalence and then use it to provide equiva-

lence relationships that allow the plan transformations mentioned above. A more elaborate

example showcases the theory in the latter part of the chapter.

4.1 SOA-Equivalence

The main reason the previous attempts to design a sampling operator were not fully

successful is the requirement to ensure IID samples at the top of the plan. Having IID samples

makes the analysis easy since Central Limit Theorem readily provides confidence intervals.

However it is too restrictive to allow plans with multiple joins to be dealt with. It is important

31

to notice that the difficulty is not in executing query plans containing sampling but in analyz-

ing such query plans.

The fundamental question we ask in this section is: What is the least restrictive

requirement we can have and still produce useful estimates? Our main interest is in how

the requirement can be transformed into a notion of equivalence. This will enable us to talk

about equivalent plans, initially, but more usefully about equivalent expressions. The

key insight comes from the observation that it is enough to compute the expected value and

variance for any query plan. Then either the conservative Chebychev bounds or the optimistic1

normal-distribution based bounds can be used to produce confidence intervals. Note that

confidence intervals are the end goal, and, preserving expected value and variance is enough to

guarantee the same confidence interval using both CLT and Chebychev methods.

Thus, for our purposes, two query plans are equivalent if their result has the same

expected value and variance. This equivalence relation between plans already allows significant

progress. It is an extension of the classic plan equivalence based on obtaining the same answer

to randomized plans. From an operational sense, though, the plan equivalence is not sufficient

to provide interesting characterizations. The main problem is the fact that the equivalence

exists only between complete plans that compute aggregates. It is not clear what can be said

about intermediate results–the equivalent of non-aggregate relational algebra expressions.

The key to extend the equivalence of plans to equivalence of expressions is to first

design such an extension for the classic relational algebra. To this end, assume that we can

only use equality on numbers that are results of SUM-like aggregates but we cannot directly

compare sets. To ensure that two expressions are equivalent, we could require that they

produce the same answer using any SUM-aggregate. Indeed, if the expressions produce the

same relation/set, they must agree on any aggregate computation using these sets since

1 While the CLT does not apply due to the lack of IID samples, the distribution of mostcomplex random variables made out of many loosely interacting parts tends to be normal.

32

aggregates are deterministic and, more importantly, do not depend on the order in which the

computation is performed. The SUM-aggregates are crucial for this definition since they form

a vector space. Aggregates At that sum function ft(u) = δtu are the basis of this vector space;

agreement on these aggregates ensures set agreement. Extending these ideas to randomized

estimation, we obtain the following.

Definition 2 (SOA-equivalence). Given (possibly

randomized) expressions E(R) and F(R), we say

E(R) SOA⇐⇒ F(R)

if for any arbitrary SUM-aggregate Af (S) =∑t∈S f (t),

E [Af (E(R))] = E [Af (F(R))]

Var [Af (E(R))] = Var [Af (F(R))].

From the above discussion, it immediately follows that SOA-equivalence is a generalization

and implies set equivalence for non-randomized expressions, as stated in the following

proposition.

Proposition 4.1. Given two relational algebra expressions E(R) and F (R) we have:

E(R) = F (R)⇔ E(R) SOA⇐⇒ F (R).

The next proposition establishes that SOA-equivalence is indeed an equivalence relation

and can be manipulated like relational equivalence.

33

Proposition 4.2. SOA-equivalence is an equivalence relation, i.e., for any expressions E ,F ,H

and relation R:

E(R) SOA⇐⇒ E(R)

E(R) SOA⇐⇒ F(R)⇒ F(R) SOA⇐⇒ E(R)

E(R) SOA⇐⇒ F(R) ∧ F(R) SOA⇐⇒ H(R)⇒ E(R) SOA⇐⇒ H(R).

SOA-equivalence subsumes relational algebra equivalence. The strength of SOA-equivalence

is the fact that it does not depend on a notion of randomized set equivalence, an equivalence

that would be hard to define especially if it has to preserve aggregates.

Proposition 4.3. Given two relational algebra expressions E(R) and F(R) we have:

E(R) SOA⇐⇒ F(R)

⇔

∀t ∈ R, P[t ∈ E(R)] = P[t ∈ F(R)] and

∀t, u ∈ R, P[t, u ∈ E(R)] = P[t, u ∈ F(R)].

Proof. Suppose E(R) SOA⇐⇒ F(R). For every t ∈ R, define the function ft as ft(s) = 1{s=t}.

Hence, Aft(S) = 1{t∈S}. It follows that

P(t ∈ E(R)) = E [Aft(E(R))]

= E [Aft(F(R))]

= P(t ∈ F(R)).

34

Now, for every t, t ′inR, define the function ft,t′ as ft,t′(s) = 1{s=t} + 1{s=t′}. It follows that

E[Aft(E(R))2

]= E

[(1{t∈E(R)} + 1{t′∈E(R)}

)2]= E

[1{t∈E(R)} + 1{t′∈E(R)} + 21{t,t′∈E(R)}

]= P(t ∈ E(R)) + P(t ′ ∈ E(R)) + 2P(t, t ′ ∈ E(R)).

Similarly,

E[Aft(F(R))2

]= P(t ∈ F(R)) + P(t ′ ∈ F(R)) + 2P(t, t ′ ∈ F(R)).

Note that

E[Aft(E(R))2

]= E

[Aft(F(R))2

].

It follows by (4–1) that

P(t, t ′ ∈ E(R)) = P(t, t ′ ∈ F(R)).

Hence, one direction of the equivalence is proved. Let us now assume that

P(t ∈ E(R)) = P(t ∈ F(R)) ∀t ∈ R,

and

P(t, t ′ ∈ E(R)) = P(t, t ′ ∈ F(R)) ∀t, t ′ ∈ R.

The SOA-equivalence of E(R) and F(R) immediately follows by noting that for an arbitray

function f on R , and an arbitrary (possibly randomized) expression S(R)

E [Af (S(R))] =∑t∈R

P(t ∈ S(R)f (t),

and

E[Af (S(R))2

]=∑t,t′∈R

P(t, t ′ ∈ S(R)f (t)f (t ′).

35

Table 4-1. GUS parameters for known sampling methods on a single relation

Sampling method GUS parametersBernoulli(p) a = p, b∅ = p

2, bR = p

WOR (n, N) a = nN, b∅ =

n(n−1)N(N−1) , bR =

nN

Proposition 4.3 provides a powerful alternative characterization of SOA-equivalence. This

equivalence is in terms of first and second order probabilities, and we refer to it as SOA-set

equivalence. Another way to interpret the result above is that SOA-set equivalence is the

same as agreement on all SUM-like aggregates. More importantly for this work, SOA-set

equivalence provides an alternative proof technique to show SOA-equivalence. Often, proofs

based on SOA-set equivalence are simpler and more compact.

Chapter 5 contains a recipe for expected value and variance computation for a specific

situation, when there is a single overall GUS sampling on top. Starting with the given query

plan that contains both sampling and relational operators, if we find a SOA-equivalent plan

that is equivalent and has no sampling except a GUS at the top, we readily have a way to

compute the expected value and variance of the original plan. In the rest of this chapter we

pursue this idea further and show how SOA-equivalent plans with the desired structure can be

obtained from a general query plan.

4.2 GUS Quasi-Operators

Except under restrictive circumstances, the sampling operators will not commute with

relational operators. This, as we mentioned is the main reason previous work made limited

progress on the issue. As we will see later in this chapter, GUS sampling does commute in a

SOA-equivalence sense with most relational operators. The reason we can commute GUS (but

not specific sampling methods) is that, due to its generality, it can capture the correlations

induced by the relational operators. The first step in our analysis has to be a translation from

specific sampling to GUS-sampling.

36

Before we talk about the translation from sampling to GUS operators, we need to clarify

and refine the Definition 1 of GUS sampling. As part of the definition, terms of the form

ti = t′i or tj = t ′j are used. Intuitively, they capture the idea that tuples (or parts) are the same

or different. Since in this work we will have multiple GUS operators involved, it is important to

make the meaning of such terms very clear. We do this through a notion that proved useful in

probabilistic databases (among other uses): lineage (see [22]). Lineage allows dissociation of

the ID of a tuple from the content of the tuple, for base relation, and tracking the composition

of derived tuples. With this, ti = t′i means that the two tuples are the same – have the same

ID/lineage – not that they have the same content.

Representing and manipulating lineage is a complex subject. In this work, since we only

accommodate selection and joins the issue is significantly simpler. The selection leaves lineage

unchanged, the lineage of the result of the join is the union of the lineage of the matching

tuples. Thus, lineage can be represented in relational form with one attribute for each base

relation participating in the expression. We can thus talk about lineage schema L(R), a

synonym of the set of base relations participating in the expression of R . The lineage of a

specific tuple t ∈ R will have values for the lineage of all base relations constituting R. A

particularly useful notation related to lineage is: T (t, t ′) = {Rk |tk = tk ′, k ∈ L (R)}, the

common part of the lineage of tuples t and t ′, i.e. the base relations on which the lineage of t

and t ′ agree.

Example 3. The query at the beginning of Chapter 1 uses two sampling methods: Bernoulli

sampling with p = 0.1 on lineitem and sampling 1K tuples without replacement from

orders (150K tuples). These methods can be expressed in terms of GUS as G(aB ,bB) and

G(aW ,bW ) as follows: For G(aB ,bB): aB = 0.1 and bB = {bB,∅, bB,l; bB,∅ = 0.01, bB,l = 0.1}.

For G(aW ,bW ): aW = 6.667 × 10−3 and bW = {bW ,∅, bW ,o; bW ,∅ = 4.44 × 10−5, bW ,o =

6.667× 10−3}

It is important to note that the GUS is not an operator but a quasi-operator. While it

corresponds to a real operator when the translation from specific sampling to GUS happens, it

37

Table 4-2. Notation used

Notation MeaningR Random subset of Ra P[t ∈ R]

L (R) Lineage schema of RL (t) Lineage of tuple tT Subset of L (R)

T (t, t ′) {Rk |tk = tk ′, k ∈ L (R)}bT P[t, t ′ ∈ R|T = T (t, t ′)]b {bT |T ∈ P(n)}

G(a,b) GUS method with parameters a and bG(a,b)(R) G(a,b) applied to relation R

l lineitem

o orders

c customer

p parts

will not correspond to an operator after transformations. There is no need to provide or even

to consider an implementation of a general GUS operator since GUS will only be used for the

purpose of analysis.

In the following analysis, we will assume that all specific sampling operators were

replaced by GUS quasi-operators, thus will not be encountered by the re-writing algorithm.

We designate by G(a,b)(R), a GUS method applied to a relation R , and the resulting sample by

R. When multiple GUS methods are used, the i ’th GUS method and its resulting sample will

be denoted by G(ai ,bi ) and Ri respectively.

4.3 Interaction between GUS and Relational Operators

As we stated in Section 4.1, SOA-equivalence is the key for deriving an analyzable

plan that is equivalent to the one provided by the user. The results in this section provide

equivalences that allow such transformations that lead to a single, top, GUS operator. The

results in this section make use of the notation in Table 4.2.

Proposition 4.4 (Identity GUS). The quasi-operator G(1,1), i.e. a GUS operator with a = 1,

bT = 1, can be inserted at any point in a query plan without changing the result.

38

Proof. Since a = 1, all input tuples are allowed with probability 1, i.e., no filtering happens.

Proposition 4.5 (Selection-GUS Commutativity). For any R , selection σC and GUS G(a,b),

σC(G(a,b))(R)SOA⇐⇒ G(a,b)(σC(R)).

Proof. Let R ′ = σC(R). On computing R ∩ R ′ we see that

∀(t ∈ R ′),P[t ∈ R ∩ R ′] = P[t ∈ R]I{t∈R′} = a.

∀(t, t ′ ∈ R ′),P[t, t ′ ∈ R ∩ R ′|T =T (t, t ′)]

= P[t, t ′ ∈ R|T =T (t, t ′)] = bT

The above results are somewhat expected and have been covered for particular cases in

previous literature. The following result, though, overcomes the difficulties in [21].

Proposition 4.6 (Join-GUS Commutativity). For

any R, S , join 1θ and GUS methods G(a1,b1),G(a2,b2),

if L (R1) ∩ L (R2) = ∅

G(a1,b1)(R1) 1θ G(a2,b2)(R2)SOA⇐⇒ G(a,b)(R1 1θ R2),

where, a = a1a2, bT = b1,T1b2,T2

with T1 = T ∩ L (R1) and T2 = T ∩ L (R2).

Proof. We proved in Proposition 4.5 that a GUS method commutes with selection. Thus, it is

enough to prove commutativity of a GUS method with cross product. Let R = R1 × R2 and

39

t = (t1, t2), t′ = (t ′1, t

′2) ∈ R. Thus, L (R) = L (R1) ∪ L (R2). We have:

a = P[t ∈ R] = P[t1 ∈ R1 ∧ t2 ∈ R2]

= P[t1 ∈ R1 ∧ t2 ∈ R2] = a1a2.

Since L (R1)∩L (R2) = ∅, for an arbitrary T ∈ L (R), T1 = T ∩L (R1) and T2 = T ∩L (R2)

we have, T1 ∩ T2 = ∅ (disjunct lineage). With this, we first get:

T (t, t ′)=T ⇔ E1 ∩ E2,

where E1 = {T (t1, t ′1)=T1} and E2 = {T (t2, t ′2)=T2}. Using the above and independence

of GUS methods,

bT = P[t ∈ R ∧ t ′ ∈ R|T (t, t ′)=T ]

= P[t1, t′1 ∈ R1 ∧ t2, t ′2 ∈ R2|E1,E2]

= P[t1, t′1 ∈ R1|E1]× P[t2, t ′2 ∈ R2|E2]

= b1,T1b2,T2.

Example 4. Applying the above results to the GUS co-efficients obtained in Example 3,

we can derive the following co-efficients for the overall G(a,b) on the join of lineitem and

orders:

a = a1a2 = 0.1× 6.667× 10−3 = 6.667× 10−4.

b∅ = b1,∅b2,∅ = 0.01× 4.44× 10−5 = 4.44× 10−7.

bo = b1,∅b2,o = 0.01× 6.667× 10−3 = 6.667× 10−5.

bl = b1,lb2,∅ = 0.1× 4.44× 10−5 = 4.44× 10−6.

blo = b1,lb2,o = 0.1× 6.667× 10−3 = 6.667× 10−4.

40

Example 5. In this example we provide a complete walk-through for a larger query plan. The

input is the query plan in Figure 4-1.a that contains 3 sampling operators, 3 joins and refers to

relations lineitem, orders, customers and part. To analyze such a query, the first step is

to re-write the sampling operators as GUS quasi-operators G(a1,b1),

G(a2,b2), G(a3,b3) as in Figure 4-1.b. The second step, shown in Figure 4-1.c is to apply

Proposition 4.6 to commute G(a1,b1) and G(a2,b2) with the join resulting in G(a12,b12) . This step

also shows the application of Proposition 4.4 above customers. The next step in Figure 4-1.d

again uses Proposition 4.6 to commute G(a12,b12) and G(1,1)resulting in G(a121,b121). Figure 4-1.e

shows the final transformation that uses the same proposition to get an overall GUS method

G(a123,b123) just below the aggregate and on the top of the rest of the plan. Theorem 1 can

now be used to obtain expected value and variance of the estimate. Using this and either

the normal approximation or the Chebychev bounds, we obtain confidence intervals for the

estimate. The computed coefficients for the GUS methods involved are depicted in Figure 4-1.

4.4 Interactions between GUS Operators

In the previous section we explored the interaction between GUS operators and relational

algebra operators. In this section, we investigate interactions between GUS operators when

applied to the same data. Intuitively, this will open up avenues for design of sampling

operators, since it will indicate how to compute GUS quasi-operators that correspond to

complex sampling schemes.

Proposition 4.7 (GUS Union). For any expression R and GUS methods G(a1,b1), G(a2,b2),

G(a1,b1)(R) ∪ G(a2,b2)(R)SOA⇐⇒ G(a,b)(R)

where, a = a1 + a2 − a1a2

bT = 2a − 1 + (1− 2a1 + b1T )(1− 2a2 + b2T )

Union of GUS methods can be very useful when samples are expensive to acquire,

thus there is value in reusing them. If two separate samples from relation R are available,

Proposition 4.7 provides a way to combine them.

41

SUM

1

B0.5

p

1

c1

W1000

o

B0.1

l

SUM

1

G(a3,b3)

p

1

c1

G(a2,b2)

o

G(a1,b1)

l

SUM

1

G(a3,b3)

p

1

G(1,1)

c

G(a12,b12)

1

ol

(a) (b) (c)

SUM

1

G(a3,b3)

p

G(a121,b121)

1

c1

ol

SUM

G(a123,b123)

1

p1

c1

ol

(d) (e)

GUS method ParametersG(a1,b1) a1 = 0.1, b1,∅ = 0.01, b1,l = 0.1G(a2,b2) a2 = 6.667× 10−3, b2,∅ = 4.44× 10−5, b2,o = 6.667× 10−3G(a3,b3) a3 = 0.5, b3,∅ = 0.25, b3,p = 0.5G(a12,b12) a12 = 6.667 × 10−4, b12,∅ = 4.44 × 10−7, b12,o = 6.667 × 10−5, b12,l =

4.44× 10−6, b12,lo = 6.667× 10−4G(a121,b121) a121 = 6.667 × 10−4, b121,∅ = 4.44 × 10−7, b121,c = 4.44 × 10−7, b121,o =

6.667×10−5, b121,oc = 6.667×10−5, b121,l = 4.44×10−6, b121,lc = 4.44×10−6, b121,lo = 6.667× 10−4, b121,loc = 6.667× 10−4

G(a123,b123) a123 = 3.334 × 10−4, b123,∅ = 1.11 × 10−7, b123,p = 2.22 × 10−7, b123,c =1.11 × 10−7, b123,cp = 2.22 × 10−7, b123,o = 1.667 × 10−5, b123,op =3.335 × 10−5, b123,oc = 1.667 × 10−5, b123,ocp = 3.335 × 10−5, b123,l =1.11 × 10−6, b123,lp = 2.22 × 10−6, b123,lc = 1.11 × 10−6, b123,lcp =2.22 × 10−6, b123,lo = 1.667 × 10−4, b123,lop = 3.334 × 10−4, b123,loc =1.667× 10−4, b123,locp = 3.334× 10−4

Figure 4-1. Transformation of the query plan to allow analysis

42

Proposition 4.8 (GUS Compaction). For any expression R, and GUS methods G(a1,b1),G(a2,b2),

G(a1,b1)(G(a2,b2)(R)

) SOA⇐⇒ G(a,b)(R),

where, a = a1a2, bT = b1,T1b2,T2

Compaction can be also viewed as intersection. It allows sampling methods to be stacked

on top of each other to obtain smaller samples. We will make use of this in the next chapter.

Interestingly, union behaves like + with the null element G(0,0) (the sampling method that

blocks everything), the compaction/intersection behaves like ∗ with the null element G(1,1)(the

sampling method that allows everything). Overall, the algebraic structure formed is that of a

semi-ring, as stated in the following.

Theorem 2. The GUS operators over any expression R, form a semi-ring structure with

respect to the union and compaction operations with G(0,0) and G(1,1) as the null elements,

respectively.

The semi-ring structure of GUS methods can be exploited to design sampling operators

from ingredients.

Proposition 4.9 (GUS Composition). For any expressions R1, R2 and G(a1,b1), G(a2,b2),

G(a1,b1)(R1) ◦ G(a2,b2)(R2)SOA⇐⇒ G(a,b)(R)

a=a1a2, bT = b1,Tb2,T .

GUS composition is very useful for design of multi-

dimensional sampling operators. We use it here to design a bi-dimensional Bernoulli.

Example 6. Suppose that we designed a bi dimensional sampling operator B0.2,0.3(l,o) that

combines Bernoulli sampling operators B0.2(l) and B0.3(o). Using the above result, the GUS

operator G(a,b) corresponding to the bi dimensional Bernoulli is G(a1,b1)(l) ◦ G(a2,b2), where

G(a1,b1) is the GUS of B0.2(l) and G(a2,b2) is the GUS of B0.3(o). Working out the coefficients

using Proposition 4.9 - the process is similar to the process in Example 4 we get: a3 = 0.06,

b3,∅ = 0.0036, b3,o = 0.012, b3,l = 0.018, b3,lo = 0.06.

43

CHAPTER 5EFFICIENT VARIANCE ESTIMATION: IMPLEMENTATION & EXPERIMENTS

In this chapter, we first present the major challenges in implementing an efficient variance

estimator and show how the theoretical ideas in the previous chapter can be used to tackle

those challenges. Furthermore, we demonstrate the effectiveness of our approach by performing

a detailed empirical study using the TPC-H [4] data.

5.1 Estimating yS Terms

The computation of the variance of the sampling estimator in Theorem 1 uses the

coefficients yS defined as:

yS =∑

ti∈Ri |i∈S

∑tj∈Rj |j∈SC

f (ti , tj)

2 .The terms yS essentially require a group by lineage followed by a specific computation. This is

better understood through an example – Query 1 – and equivalent expressions in SQL:

CREATE TABLE unagg AS

SELECT l_orderkey*10+l_linenumber as f_l,

o_orderkey as f_o, l_discount*(1.0-l_tax) as f


orders TABLESAMPLE(1000 ROWS)



SELECT sum(f)^2 as y_empty FROM unagg;

SELECT sum(F*F) as y_l FROM ( SELECT sum(f) as F

FROM unagg GROUP BY f_l);

SELECT sum(F*F) as y_o FROM ( SELECT sum(f) as F

44

FROM unagg GROUP BY f_o);

SELECT sum(f*f) as y_lo FROM unagg;

The computation of the yS terms using the above code is harder than the evaluation of

the exact query, thus resulting in an impractical solution. We can use the sample to estimate

these terms essentially replacing the entire base relations in the above queries by their samples.

These estimates, YS can be used to obtain unbiased estimates YS of terms yS using the

formula [59]

YS =1

cS,∅

yS − ∑T⊂SC ,T =∅

cS,T YS∪T

where

cS,T =∑U⊂T

(−1)|U|+|S| bS∪U .

Yet again, the major effort is in evaluating YS terms over the sample. The number of

terms to be evaluated is 2n, where n is the number of base relations and each term consists

of a GROUP BY query that is possibly expensive. To this end, we observe that the computation

of the variance of the sampling estimator depends, in orthogonal ways, on properties of the

data through terms yS and on properties of the sampling through cS . The base theory does

not require any particular way to compute/estimate terms yS . Using the available sample

for estimating yS terms is only one of the possibilities. While many ways to estimate terms

yS can be explored, a particularly interesting one in this context is to use another sampling

method for the purpose. More specifically, we could use a sub-sample of the available sample

for estimation of the terms yS and the full sample for the estimation of the true value.

To understand what benefits we can get from this idea, we observe that we do not need

very precise estimates of the terms yS . Should we make a mistake, it will only affect the

confidence interval by a small constant factor but will still allow the shrinking of the confidence

interval with the increase of the sample. Based on the experience in DBO and TurboDBO,

using 100K result tuples for the estimation of yS terms suffices. Moreover, only for these 100K

45

samples the system needs to provide lineage information1 ; samples used for evaluation of the

expected value need no lineage.

5.2 Sub-Sampling

There are two alternatives when it comes to reducing the number of samples used for

estimation of terms yS : select a more restrictive sampling method, or further sample from

the provided sample. The latter approach can be applied when needed, in case the size of

the sample is overwhelming for the computation of terms yS . Specifically, we can use a

multi-dimensional Bernoulli GUS on top of the existing query plan for result tuples. This can

be obtained by applying Proposition 4.9 until the desired size is reached. The extra results

in Section 4.4 together with the core results in Section 4.3 provide the means to analyze this

modified sampling process. Example 7 and the accompanying Figure 5-1 provide such analysis

for Query 1 and exemplifies how the extra Bernoulli sampling can be dealt with.

Example 7. This example shows how the query plan from the introduction can be sampled

further to efficiently obtain yS terms. Figure 5-1.a shows the original query plan. Figure 5-1.b

shows the sampling in terms of a GUS quasi-operator. Figure 5-1.c shows the placement

of a bi-dimensional Bernoulli sampling method. Figures 5-1.d, 5-1.e, 5-1.f make use of

propositions in Section 4.3 to obtain a SOA-equivalent plan, suitable for analysis.

5.3 Experiments

We have two main goals for this empirical study. Our first goal is to provide some

experimental confirmation of our theory. Our second goal is to evaluate the efficiency of the

estimation process and study how sub-sampling affects the running times and the variance

estimates. More specifically:

• How does the running time depend on the selectivity of the query?

1 By lineage, we mean the list of base relations that have participated in forming a giventuple.

46

SUM

1

WOR1000

o

B0.1

l

SUM

G(aBW ,bBW )

1

ol

SUM

B0.2,0.3

1

WOR1000

o

B0.1

l

SUM

G(a3,b3)

1

G(a2,b2)

o

G(a1,b1)

l

(a) (b) (c) (d)

SUM

G(a3,b3)

G(a12,b12)

1

ol

SUM

G(a123,b123)

1

ol

(e) (f)

GUS method ParametersG(a1,b1) a1 = 0.1, b1,∅ = 0.01, b1,l = 0.1G(a2,b2) a2 = 6.667× 10−3, b2,∅ = 4.44× 10−5, b2,o = 6.667× 10−3G(a3,b3) a3 = 0.06, b3,∅ = 0.0036, b3,o = 0.012, b3,l = 0.018, b3,lo = 0.06G(a12,b12) a12 = 6.667 × 10−4, b12,∅ = 4.44 × 10−7, b12,o = 6.667 × 10−5, b12,l =

4.44× 10−6, b12,lo = 6.667× 10−4G(a123,b123) a123 = 4× 10−5, b123,∅ = 1.598× 10−9, b123,o = 8× 10−7, b123,l = 7.992×

10−8, b123,lo = 4× 10−5

Figure 5-1. Transformation of the query plan to allow analysis

• How useful is the sub-sampling process? How does the running time depend on thesub-sampling?

• Does sub-sampling significantly reduce the precision of the confidence interval estimates?

As we will see in this section, the experiments validate the theory and the sub-sampling

is invaluable for efficient estimation. In particular, sub-samples of size 100K to 400K tuples

47

provide correct and meaningful confidence intervals and require less than 2% to the overall

running time.

5.3.1 Experimental Setup

Data Sets. For our experiments, we use two versions of TPC-H [4] data—a small sized

data set (scale factor 0.1, 100MB) for verification of confidence intervals, and a large one

(scale factor 1000, 1TB) to benchmark the efficiency of the for the estimation process. We use

following parameterized query:

SELECT SUM(l_discount*(1.0-l_tax))

FROM lineitem TABLESAMPLE (x PERCENT),

orders TABLESAMPLE(y ROWS),

part TABLESAMPLE(z PERCENT)


l_partkey = p_partkey AND o_totalprice < q AND

p_retailprice < r;

where the parameters x, y and z govern the inclusiveness of the sampling methods and

parameters q and r govern the selectivity.

Code. The analysis of the sampling query plan, as described in Section 4.3, is coded using

SWI-Prolog [3]. When presented with the specific query plan that contains both the relational

and sampling operators, the code derives the a, bT coefficients. The sub-sampling operator and

the confidence interval prediction based on the coefficients a, bT from the symbolic analysis

is implemented in C++. For the sub-sampling operator, we can specify a lower (L) and an

upper (U) bound on the number of tuples to maintain. The implementation uses an adaptive

algorithm that changes the probability of the Bernoulli sampling to meet these goals. In all the

experiments we specify the range used to configure the sub-sampling – the actual number of

tuples in the sub-sample is specified in some of the experiments.

48

DBMSs. Two DBMSs were used for experiments in this section. For the correctness

experiments, we used Oracle 11g and made use of the SAMPLE operator. For the large scale

experiments we used the DataPath[12] parallel database system and coded the sub-sampling

and analysis as user-defined aggregates.

Hardware. We run our experiments on a 48 core AMD Magny-cours 1.9GHz with 256GB

of RAM and 76 disks. The machine is capable of sustained I/O rates of 3.0GB/s through 3

Averatec RAID controllers. Oracle 11g used only one disk and did not make use of all the cores

(Oracle was used only for the correctness experiments). DataPath usually makes use of all the

resources and for most queries sustained data processing speeds of 2-3GB/s.

We now describe the experiments undertaken to answer the questions listed above. The

detailed setup of each experiment is specified in the respective subsections.

5.3.2 Correctness

In this experiment, we provide a sanity check for our theory and measure how accurate the

derived confidence intervals are.

Setup. For this study, we use the 100MB TPC-H data (lineitem has 600,000 tuples

and orders has 150,000 tuples). We selected the small size to allow us to perform many

experiments to empirically evaluate the variance through Monte-Carlo simulation. We used the

Oracle 11g database since it supports the SAMPLE operator from SQL and thus would support

our claim that the theory in this work can readily be used to predict sampling queries from

existing commercial DBMSs.

Monte Carlo Validation. We validate the theoretically computed expected value and

variance of the sampling estimate using a Monte Carlo simulation. We use the query as shown

above with parameters q and r fixed at 106 and 4 × 105 respectively. This query is run

for different values of (x,y,z) as shown in Figure 5-2. For each value of (x,y,z), we run the

query 5000 times and obtain 5000 confidence intervals each for ten different confidence levels,

ranging from 5% to 95%. We then compute the achieved confidence levels, i.e. for each of

49

0102030405060708090

0 10 20 30 40 50 60 70 80 90

Acheivedconfi

dencelevel

Desired confidence level

Correctness Study

l(10) o(10) p(50)l(20) o(15) p(55)l(30) o(20) p(60)l(40) o(25) p(65)

Figure 5-2. Plot of percentage of times the true value lies in the estimated confidenceintervals vs desired confidence level.

the ten desired confidence levels, we compute the percentage of times the true value falls

within the corresponding confidence intervals. In Figure 5-2, we show a comparison between

the desired and achieved confidence levels for 4 different sampling strategies. The achieved

confidence levels are very close to the desired values, across the different sampling strategies

and confidence levels. This provides strong empirical evidence that the confidence intervals

obtained by using the theory in Section 4.3 are accurate and tight.

5.3.3 Running Time

The next goal is to evaluate the efficiency of the estimation process. We are especially

interested in evaluating the variance of the estimators. This study performed with our research

prototype should give the practitioner some idea on the overhead expected.

Setup. Intuitively, the analysis overhead will depend on the sample size. To ensure that

we stress the analysis with large samples, we use the 1TB TPC-H instance and treat the

database as a sample of a 1PB database. More specifically we assume that the 6 billion tuples

in lineitem are a Bernoulli sample from the 6 trillion tuples in the same relation at 1PB scale

(0.001 sampling fraction). Similarly, the 1.5 billion tuples in orders are a sample without

replacement from the 1.5 trillion tuples of the 1PB database and the 200 million tuples in part

50

are a Bernoulli sample (0.001 sampling fraction) from 200 billion tuples at 1PB scale. This

ensures that the sample sizes the analysis has to deal with can be in the billions – a very harsh

scenario for analysis indeed.

Since the database is the sample, there is no sampling needed in the execution of the

query – the tuples that the analysis has to make use of are the tuples that are aggregated by

the non-sampling version of the query. As described in Section 5.1, maintaining the estimator

is as easy as performing the aggregation in the non-sampling query but computing the variance

is much more involved. The technique we proposed is to sub-sample from the sample to limit

the number of tuples used to estimate the variance. In our experiments we studied the impact

of the query characteristics (various selection predicates) and sub-sampling size on the running

time. For each experiment, we measured three running times. First, the running time of the

non-sampling query (no statistical estimation, just the aggregate). Second, the running time

of the query processing and sub-sampling process. Sub-sampling is interleaved with the rest

of the processing and the two running times cannot be separated. Third, the time to perform

the analysis. The analysis is single-threaded and starts only once the sub-sample is completely

formed.

Impact of Selectivity with Fixed Sub-Sampling. In our first experiment, we will vary

the selection predicate (thus indirectly the selectivity of the query) and set the range for the

sub-sampling at 100K-400K tuples (i.e. sub-sampling obtains 100K-400K tuples that are a

Bernoulli sample from the data provided for analysis). Results are depicted in Figure 5-3A.

We make three key observations; First, for this sub-sampling target, the analysis adds an

insignificant amount of extra effort (about 2% of the overall running time). Second, selectivity

of the query has no significant effect on the running time for either the non-sampling or for the

sampling query. Last, the running time of the sampling version of the query seems to be more

51

0100200300400500

9001200

15001800

2100

0

10

20

30

40

Run

ning

time(sec)

Num

ber

oftuples

(million)

Selection parameter

Sub-sampling range 100K - 400K

A

0100200300400500

9001200

15001800

2100

0

10

20

30

40

Run

ning

time(sec)

Num

ber

oftuples

(million)

Selection parameter

No Sub-sampling

Query processing + Sub-samplingAnalysis

Sub-sample sizeQuery without analysis

B

Figure 5-3. Plots of running time vs selection parameter with and without sub-sampling.

stable than the running time of the non-sampling version.2 It seems that, when the size of

sub-sample is below 500,000, the extra effort to perform sampling analysis is insignificant. We

show later that such sub-samples are good enough to produce stable variance estimates.

Impact of Selectivity with No Sub-Sampling. An unresolved question from the

previous experiment is what happens when no sub-sampling is performed, i.e. all the data

is used for analysis. The selectivity of the query will now control the number of tuples used

2 A similar behavior was noticed in [12]: the execution is more stable and somewhat faster athigher CPU loads.

52

for analysis and give an indication of the effort as a function of the size. Figure 5-3B depicts

result of such an experiment in which the selection predicate was varied. Results reveal that,

once the size of the sample exceeds 1M the analysis cost becomes unmanageable and starts

to dominate. At the end of the spectrum (31M tuples) the analysis was 5 times slower than

the the rest of the execution – this is clearly not acceptable in practice. As we hinted above,

targets of 100K-400K produce good enough estimates of the variance; there is no need to base

the variance analysis on millions of tuples, thus running time of analysis can be kept under

control. Sub-sampling is thus a crucial technique for applicability of sampling estimation to

large data.

5.3.4 Sub-Sample Size

This experiment sheds light on influence of sub-sampling sizes on the estimate for the variance

and thus the quality of the confidence intervals.

Setup. Since we would like to get samples from all over the data source, we use the 1TB

TPC-H instance as the data source and repeatedly derive samples from it. Remember that,

according to Section 5.1, any estimates of the terms yS can be used to analyze any of the

sampling methods. Sub-sampling is used to estimate the terms yS , but the entire sample is

used to compute the estimate. Subsampling leads to a substantial reduction in computation,

but also gives rise to wider confidence intervals. In many situations, the estimated aggregate

is various orders of magnitude larger than its estimated standard deviation (based on the

whole sample). In such cases, it clear that any increase in the width of the confidence interval

due to subsampling will be extremely minor as compared to the estimated aggregate. Thus

a much smaller sub-sample has only a secondary influence on the confidence interval, while

leading to a substantial reduction in computation. The plot in Figure 5-4 shows this fact. In

this experiment we run around 250 instances of Query 1 for sub-sampling ranges of 10K-40K,

100K-400K and 1M-4M tuples each. In all cases, we calculate the fluctuation of the resultant

confidence interval widths with respect to the confidence interval width obtained from an

analysis without sub-sampling. In particular, we define error by ratio of the difference between

53

0

0.02

0.04

0.06

0.08

0.1

10K-400K

100K-400K

1M-4M

Error

Sub-sampling target

Sub-sampling Study

Figure 5-4. Plot of fluctuation of confidence interval widths obtained with sub-sampling, wrtthe true confidence interval

the 5th and the 95th percentile values to the width of the confidence interval obtained without

sub-sampling. The plot in Figure 5-4 shows that this error is only 1% when 100K-400K tuples

are used.

Note on number of relations. As we have seen in this section, for sampling over

3 relations, good confidence intervals can be obtained with a mere 2% extra effort since

sub-samples of 100K tuples suffice. Since the analysis requires the computation of 2n terms

if n relations are sampled, the influence of the number of relations on the running time of the

analysis is of concern. In practice, these concerns can be easily addressed as follows: (a) the

computation of the yS terms from the sub-samples can be parallelized – on our system this

would result in a speedup of at least 32 (on 48 cores), (b) we noticed that foreign key joins

result in repeated values for certain terms – about half the values are repeated, (c) we see

no need to sample from more than 8 relations since there is no need to sample from small or

medium size relations. Notice that the parallelization alone would allow us to scale from 3 to

3 + 5 = 8 relations since 25 = 32, the expected speedup.

54

CHAPTER 6COVARIANCE BETWEEN MULTIPLE ESTIMATES: CHALLENGES

Having addressed the challenge of estimating variance for SUM-like aggregates, we

now focus on the more difficult problem of computing pairwise covariances between multiple

estimates. As mentioned in the introduction, previous work on covariance computation

(such as [59, 95]) has focused on deriving closed form expressions for covariances in special

settings, with specific assumptions on the type of schema, number of relations, and type of

estimates. The computation has to be done by hand on a case-by-case basis. Our goal is to

develop theory for a generalized solution, which is independent of platform, schema, number

of relations, is applicable for a wide class of sampling methods, and is also capable of being

automated. In the first part of this dissertation, we developed precisely such a theory for the

variance computation problem. Given the connection between covariance and variance - it is

natural to assume that covariance computation problem can be solved through a mild and

straightforward extension of the previous theory. However, a closer look at the problem reveals

that it is not the case, and brings out key underlying challenges. The goal of this section is to

carefully and methodically lay out these challenges. We will first recall the major steps in the

variance computation strategy, and then examine the adequacy of this strategy for covariance

computation, starting with the simplest case and then moving on to more genaral and complex

settings.

6.1 Base Lemma for Covariance

In previous chapters, we dealt with a general single query plan which has sampling at

the bottom, and the estimate at the top. The strategy for computing the variance of such an

estimate was as follows.

• The notion of SOA-Equivalence allowed us to transform this plan into an analyzable planfor variance analysis. This analyzable plan had an equivalent sampling operator at thetop of the plan, i.e., at the level of the estimate of interest.

• The tranformation of a general query plan into an analyzable query plan were acheived /carried by applying a series of algebraic rules based on SOA- Equivalence.

55

• The overall sampling method at the top of this analyzable plan was expressed as a GUSmethod, whose 2n parameters represented the sampling based correlations (one term forevery subset of the base relations) involved in the estimate.

• We used the base lemma for variance (Theorem 1 in Chapter 3) for the overall GUSmethod into the base lemma to get the required variance.

The conceptually simplest covariance computation problem corresponds to two estimates

deribed from a single query plan. Using the above strategy, we can get a SOA-Equivalent plan

with an overall GUS method on top. We plugged in these parameters into the base lemma for

variance. However, even in the simplest multiple estimate case where we have 2 estimates from

the same from the same set of base relation samples. Challenge #1: Generalize the base

lemma for covariance

Example 8. To compute the error bound for the estimate f1 using (1–1), we need to

compute V (SUMloc),V (COUNTloc) and Cov(SUMloc, COUNTloc). Let slineitem, sorders and

scustomer be Bernoulli(0.1), SWOR(1000) and Bernoulli(1000) samples drawn from lineitem,

orders & customer, denoted by l, o & c respectively in Fig 6-1. The corresponding GUS

sampling methods are denoted by Gl, Go & Gc respectively.The variance terms V (SUMloc) &

V (COUNTloc) can be computed using the theory in [76] as follows.

Starting from the initial representation in Fig 6-1(a), where sampling is at the base, we

can use Proposition 4.6 on the joins to get a SOA-Equivalent plan with a single GUS on top.

This overall GUS, Gloc in Figure 6-1(c), can be used to compute the two variance terms by

applying Theorem 1. However, the existing results do not provide any mechanism to compute

the term Cov(SUMloc, COUNTloc). 2

6.2 Covariance Parameters

The case of two estimates from a single query plan is still not general enough for the

covariance computation problem. Sampling based correlations will be present in any pair of

estimates that share samples of base relations. These estimates may come from different

queries and the queries may be based on different set of base relations. A more genralized

56

l

Gl

o

Go

c

Gc

1

1

SUMloc COUNTloc

SOA⇐⇒

l o

1

Glo

c

Gc

1

SUMloc COUNTloc

SOA⇐⇒

l o

1

c

1

Gloc

SUMloc COUNTloc

(a) (b) (c)

GUS ParametersGx:x=l,c ax = 0.1, bx,∅ = 0.01, bx,x = 0.1

Go ao = 6.67× 10−3, bo,∅ = 4.44× 10−5, bo,o = 6.67× 10−3Glo alo = 6.67×10−4, blo,∅ = 4.44× 10−7, blo,l = 4.44×10−6, blo,o = 6.67×

10−5, blo,lo = 6.67× 10−4Gloc aloc = 6.67 × 10−5, bloc,l = bloc,c = 4.44 × 10−8, bloc,o = 6.67 ×

10−7, bloc,lo = bloc,oc = 6.67 × 10−6, bloc,lc = 4.44 × 10−7, bloc,∅ =4.44× 10−9, bloc,loc = 6.67× 10−5

Figure 6-1. Covariance in a single query plan

setting for the covaraince problem is a pair of estimates coming from two simultaneous query

plans that share a subset of base relation samples.

Example 9. To compute the error bound for the estimate f2 in (3–3), we need to compute

Cov(SUMloc, COUNTocn), among other terms. Note that the estimates SUMloc and COUNTocn are

derived from different queries which only share the samples o and c (drawn from orders and

customer respectively). The sample l (from lineitem) is exclusive to SUMloc, and the sample

n (from nation) is exclusive to COUNTocn. This case, when the pair of estimates arises from a

pair of distinct queries which only partially share some of the base relation samples, is much

more challenging than the single query case illustrated in Example 8. 2

Previous theory for variance [76] depended crucially on the ability to find an analyz-

able representation for a query plan, where the sampling operators at the bottom of the

plan permuted to the top into an overall sampling characterization for the estimate. The

base lemma for variance could only be applied using the overall sampling in this analyzable

representation. It is hard to characterize the covariance of estimates at the top of an arbitrary

57

pair of query plans, when the sampling operators are at the bottom. If we permute the

sampling methods to the top of the respective query plans using the previous theory, we

lose the the correlation information between the estimates. Challenge #2: Identify an

analyzable representation and parameters that characterize covariance.

Example 10. The parameters of the overall GUS method obtained from previous theory

characterize the sampling based dependencies involved in a corresponding estimate. As

illustrated in Example 9, two estimates can be based on different cross product spaces.

For eg., SUMloc is based on lineitem × orders × customer while COUNTocn is based on

orders×customer×nation). In Figure 6.3, the overall GUS method corresponding to SUMloc

is represented by Gloc and the overall GUS method corresponding to COUNTocn is represented

by Gocn. However, Gloc and Gocn only capture the intra-query dependencies (enough for

variance computation), but do not capture the inter query dependencies needed for computing

Cov(SUMloc, COUNTocn).

6.3 Notion of Equivalence

If we find an analzyble representation for a pair of query plans that allows us to encode

the correlations between the estimates, we need to a notion equivalence to tranform an

arbitrary pair of query plans to this representation. This notion of equivalence is certainly

not relational equivalence, since the sampling operators would remain at the bottom. While

SOA-Equivalence allows sampling operators to permute upwards, it only applies to a single

query plan. It is not clear what the notion of equivalence means for a simultaneous pair

of query plans. The notion of SOA-Equivalence was extended to the notion of Sampling

Dominance in [62]. This generality comes at the cost of decreased accuracy in error estimation,

as we are only able to get upper bounds for the true variance. Note that the sign of any

covariance term in the variance expression of a function of the multiple estimators (see (3–3))

is not necessarily positive. Even if one were to develop a notion of sampling dominance for

covariances, it would not be useful in achieving the desired goal. The best approach or path

to take would be to try and generalize/extend the notion of SOA-equivalence (with exact

58

lineitem

Gl

orders

Go

customer

Gc

nation

σ1 σ2

1 1

1 1

SUMloc COUNTocn

SOA⇐⇒

lineitem orders customer nation

1 1

σ1 σ2

1 1

Gloc

SUMloc

Gocn

COUNTocn

Figure 6-2. Applying SOA-Equivalence for covariance computation

equality) for covariances. Challenge #3: Develop a notion of Equivalence between two

pairs of query plans.

From the above discussion it is clear that we need an overall characterization for

inter-estimate correlations and a notion of equivalence. The SOA-Algebra developed in

[76] was based on SOA-Equivalence for a single query. This algebra does not apply to the

genaralized covariance problem. A new algebra corresponding to the new notion of equivalence

will have to be developed from scratch. Challenge #4: Develop an algebra for computing

the covariance parameters.

6.4 Support for Optimization

Another relevant challenge is efficient support of GROPUBY queries. If Ng is the number

of groups in the result, then the number of covariances that need to be computed is(Ng2

). For

example, the GROUPBY query in Fig 1-1 produces a group of 25 estimates (one corresponding

to each nation) which result in(252

)= 300 covariance terms. This large number of covariance

terms can create a significant computation overhead. For Ng groups in the sample, the number

of covariance computations is O(Ng2). Each such computation requires us to compute 2n

terms, where n is the number of base relations. There are often properties of the underlying

data or other factors which can simplify some computations. The theory we aim to deveop

needs to have the ability to intgerate these simplification techniques. Challenge #5: De-

velop support for optimizations.

59

6.5 Support for Sub-Sampling

Quite often, the base relation samples can be too large to fit in the main memory,

making the analysis unnecessarily expensive. Depending upon the memory constraints, these

samples can be further sub-sampled so that the analysis can be done in memory, without

much loss in the accuracy of estimates. This approach was used in [76] in the context of

variance computation. Such a support is highly desirable even for covariance analysis. It is

not immidietely clear whether/how sub-sampling can be used in covariance computation.

Challenge #6: Develop support for computational efficiency.

60

CHAPTER 7COVARIANCE BETWEEN MULTIPLE ESTIMATES: SOLUTION

In this chapter, we present the theory for a comprehensive solution for covariance

computation that addresses the key challenges outlined in Chapter 6. Before proceeding, it

is imperative that we consider all possible ways of interactions between two query plans. This is

where we need to differentiate between a sampling method and a sample. For our purpose,

a GUS sampling method is characterized by its a and b co-efficients. There can be many ways

to perform the sampling operation such that the a and b parameters are the same. Thus, if

we apply two sampling processes on a set of relations and the GUS co-efficients are the same

in both cases, then both these processes correspond to the same sampling method. However,

since these two processes will be conducted independently (in time or space) of one another,

they provide different samples. If the same process is performed twice, independently of one

another, there will still be two different samples. (The theory in [76] did not need to make

such a distinction because it was designed to deal with a single query plan.) We now start by

defining a novel notion of equivalence between two pairs of query plans.

7.1 SOA-COV Equivalence

The notion of equivalence for covariance has to be able to capture the interaction between

pairs of plans. This equivalence must be defined between two different pairs of plans as

opposed to individual query plans. We call this notion SOA-COV equivalence, and define it as

follows.

Definition 3 (SOA-COV Equivalence). Given (possibly randomized) expressions C(R1) and

D(R2), such that they share a set of samples from S = R1 ∩ R2(= ∅),

{C(R1),D(R2)}soa⇐⇒cov

{E(R1),F(R2)}

if for any arbitrary SUM-like aggregates

Af (C(R1)) =∑

t1∈C(R1)

f (t1) & Ag(D(R2)) =∑

t2∈D(R2)

g(t2)

61

we have

E [Af (C(R1))Ag(D(R2))] = E [Af (E(R1))Ag(F(R2))].

Superficially, this definition looks significantly more restrictive then the SOA equivalence in

Definition 2, since it requires the covariance of all pairs of aggregates to agree on the pairwise

equivalent plans. The following proposition shows that it is enough to agree on interaction of

individual tuples to obtain SOA-COV equivalence.

Proposition 7.1. Given the relational algebra expressions C(R1), D(R2), E(R1) & F(R2) we

have:

{C(R1),D(R2)}soa⇐⇒cov

{E(R1),F(R2)}

⇔

P[t1 ∈ C(R1)&t2 ∈ D(R2)] = P[t1 ∈ E(R1)&t2 ∈ F(R2)]

∀t1 ∈ R1, t2 ∈ R2

Proof. The proof follows immediately by noting two facts. First, for any t1 ∈ R1, t2 ∈ R2,

P[t1 ∈ C(R1)&t2 ∈ D(R2)] = E [Af (C(R1))Ag(D(R2))],

where f (t1) = 1{t1=t1} and g(t2) = 1{t2=t2}. Second, we have that

E [Af (C(R1))Ag(D(R2))]

=∑

t1∈R1,t2∈R2

P[t1 ∈ C(R1)&t2 ∈ D(R2)]f (t1)g(t2)

for any f and g.

This result allows us to prove that SOA-COV equivalence subsumes SOA equivalence.

Proposition 7.2. {C(R), C(R)} soa⇐⇒cov

{D(R), D(R)} if and only if C(R) SOA⇐⇒ D(R).

62

Gcov(V )U1 × U2 × · · · × Up

V1 × V2 × · · · × Vn

W1 ×W2 × · · · ×Wq

1 1

Af Ag

Figure 7-1. Setting for the base lemma

Proof. By Proposition 7.1, the left side is equivalent to:

P[t1 ∈ C(R)&t2 ∈ C(R)] = P[t1 ∈ D(R)&t2 ∈ D(R)]

∀t1 ∈ R, t2 ∈ R.

By Proposition 4.3, C(R) SOA⇐⇒ D(R) if and only if

P[t1 ∈ C(R)&t2 ∈ C(R)] = P[t1 ∈ D(R)&t2 ∈ D(R)]

∀t1 ∈ R, t2 ∈ R.

The result follows immediately from both observations.

This indicates that SOA-COV is sufficient to analyze both - single queries and pairs of

queries.

7.2 Base Lemma and Overall Sampling

In order to extend the base lemma for covariance analysis, our key insight is that the pair

of query plans should be represented in the configuration shown in Figure 7-1. We will call this

configuration Covariance analyzable. The two query plans in this configuration use the same

sample from a set of common relations, V1 × V2 × · · · × Vn. Each plan computes a SUM-like

aggregate on its own cross product space, where each cross product space is comprised of

(a) the set of common relations and (b) the set of relations exclusive to the query plans,

U1×U2×· · ·×Up & W1×W2×· · ·×Wq. The lemma below provides a closed form expression

for the covariance between aggregates in such a case. We use the following notation.

63

Notation:

1. A GUS method (see Definition 1) on a set of relations V can be expressed in short as

GV .

2. A bar over a GUS expression GV (V) denotes that the sample derived from that GUS

is shared and V is the common set of relation. This notation will play a crucial role in

identifying inter query correlations.

3. A query plan that uses a sample from GV (V) and possibly a set of other non-shared

relations R is denoted by C(GV (V ),R).

Theorem 3 (GUS Covariance). Let V be a sample obtained from a GUS method GV (V) on a

set of relations V = V1×V2×· · ·×Vn. Suppose there are two query plans: one on U×V space

(with U = U1×U2×· · ·×Up) and the other on V ×W space (with W =W1×W2×· · ·×Wq)

such that no sampling is done on U and W , and U,W are mutually exclusive. Furthermore,

SUM-like aggregates Af & Ag are computed using these query plans as follows:

Af =1

aV

∑tu∈U,tv∈V

f (tu, tv), Ag =1

aV

∑t′v∈V,tw∈W

g(t ′v , tw).

Then Af ,Ag are unbiased estimates of the population level aggregates∑tu∈U,tv∈V f (tu, tv) and∑

tv∈V ,tw∈W g(tv , tw), and

Cov(Af ,Ag) =∑S∈P(V )

cSa2VyS − yϕ, (7–1)

with cS =∑T∈P(S)(−1)|T |+|S|bT ,V ,

yS =∑

ti∈Vi |i∈S

∑tj∈Vj |j∈Sc

F(ti , tj)

∑tj∈Vj |j∈Sc

G(ti , tj)

F(tv) =

∑tu∈U

f (tu, tv), G(tv) =∑tw∈W

g(tv , tw).

Proof. Since U × GV (V ) is SOA-equivalent to a GUS on U × V with a = 1 × aV = aV

(Proposition 4.6), it follows that Af is unbiased for∑tu∈U,tv∈V f (tu, tv). The fact that Ag is

64

unbiased for∑tv∈V ,tw∈W g(tv , tw) follows similarly. Let t1 be (tu, tv) & t2 be (t

′v , tw). Then

a2VE [AfAg] =∑tu

∑tv

∑t′v

∑tw

Pr(tv , t′v ∈ V)f (t1)g(t2)

=∑T∈P(V )

bT ,V∑

tv ,t′v :tv∩t′v=T

(∑tu

f (tu, tv)

)(∑tw

g(t ′v , tw)

)=

∑T∈P(V )

bT ,V∑

tv ,t′v :tv∩t′v=T

F(tv)G(t ′v). (7–2)

A straightforward application of the inclusion-exclusion principle gives

∑S∈P(V )

cSyS =∑S∈P(V )

yS∑T∈P(S)

(−1)|T |+|S|bT ,V

=∑S∈P(V )

∑tv ,t′vS⊆tv∩t′v

F(tv)G(t ′v)

∑T∈P(S)

(−1)|T |+|S|bT ,V

=∑

T∈P(V )

bT ,V∑S⊇T

(−1)|T |+|S|

∑tv ,t′vS⊆tv∩t′v

F(tv)G(t ′v)

=

∑T∈P(V )

bT ,V∑tv ,t′v

tv∩t′v=T

F(tv)G(t ′v).

The result follows by (7–2) and the definition of yϕ.

Note that the form of the covariance in (7–1) is structurally the same as the variance

formula in Theorem 1 and inherits all the nice computational advantages that are associated

with it.

The case of a single query plan with multiple estimates is a boundary condition of the

above setting, where the set of relations U & W are absent and samples are shared from all

the participating base relations, V . By Proposition 7.2, it follows that GV (V ) for this case is

the same as the overall GUS for variance computation.

65

Example 11. Since the estimates SUMloc and COUNTloc in Example 8 share the same query

plan, it follows that the overall GUS for covariance Gloc is the same as overall GUS for variance

Gloc (from Figure 6-1(c)). Cov(SUMlocCOUNTloc) can be estimating using Theorem 3. 2

7.3 SOA-COV Algebra

For a general pair of query plans, if we can show that these plans are SOA-COV

Equivalent to a pair plans as in Theorem 3, then the same result will immediately provide

the desired covariance between aggregates. The key is to understand how to establish such

a SOA-COV equivalence. In this section, we develop a SOA-COV algebra that enables the

tranformation of any general pair of query plans to a SOA-COV Equivalent query plan of

the form in Theorem 3. It is assumed throughout this section that sampling methods on two

mutually exclusive sets of relations are performed independently.

GU GV GW

U V W

1 1

Af Ag

soa⇐⇒cov

Gcov(V )

U V W

1 1

Af Ag

Figure 7-2. SOA-COV for Joins

Proposition 7.3. (SOA-COV For Joins) Given expressions GU(U), GW (W) and GV (V)

{GU(U) 1 GV (V ),GV (V ) 1 GW (W )}soa⇐⇒cov

{U 1 Gcov(V ),Gcov(V ) 1W }

bT ,cov = aUaWbT ,V

acov = aUaW aV

66

Proof. Let U ,V,W denote random samples obtained from GU(U), GV (V), GW (W) respectively

(see Figure 7-2). For arbitrary (tu, tv) ∈ U 1 V and (t ′v , tw) ∈ V 1W , it follows that

P((tu, tv) ∈ U × V, (t ′v , tw) ∈ V ×W) = P(tu ∈ U ; tv , t ′v ∈ V; tw ∈ W)

= P(tu ∈ U)× P(tv , t ′v ∈ V)× P(tw ∈ W)

= aUaWbtv∩t′v .

The result now follows immediately from Proposition 7.1.

V

GV

G1/σ1 G2/σ2

Af Ag

soa⇐⇒cov

V

Gcov(V )

Af Ag

Figure 7-3. SOA-COV for GUS/Selection

Proposition 7.4. (SOA-COV For selections) Given expressions σi (GV (V)), Gi(GV (V)), i =

1, 2,

{σ1/G1(GV (V )),σ2/G2(GV (V ))}soa⇐⇒cov

{Gcov(V ),Gcov(V )}

where bT ,cov = A1A2bT ,V , acov = A1A2aV with Ai = 1 if σi and Ai = ai if Gi .

Proof. We will provide a proof for the case with G1 and G2. Let V denote the sample obtained

by using GV (V). Let V1,V2 be the samples obtained after applying G1, G2 to V respectively.

We assume here that these samples are obtained independently conditional on V. For arbitrary

67

tv , t′v ∈ V :

P(tv ∈ V1, t ′v ∈ V2) = P(tv ∈ V1, t ′v ∈ V2, tv ∈ V, t ′v ∈ V)

= P(tv ∈ V1, t ′v ∈ V2 | tv , t ′v ∈ V)× P(tv , t ′v ∈ V)

= P(tv ∈ V1 | tv , t ′v ∈ V)× P(t ′v ∈ V2 | tv , t ′v ∈ V)btv∩t′v ,V

= a1a2btv∩t′v ,V .

The proof for the other cases follows similarly.

V

GV

G1/σ1 1

GU

U

Af Ag

soa⇐⇒cov

V

Gcov(V )

1

U

Af Ag

Figure 7-4. SOA-COV for Join/GUS interaction

Proposition 7.5. (SOA-COV for Join/GUS Interaction) Given expressions GU(U), GV (V), σ1

(GV (V)), G1(GV (V)),

{σ1/G1(GV (V )),GU(U) 1 GV (V )}soa⇐⇒cov

{Gcov(V ),U 1 Gcov(V )}

where bT ,cov = A1aUbT ,V , acov = A1aUaV with A1 = 1 if σ1, and A1 = a1 if G1.

Proof. We will provide a proof for the case with G1. Let U ,V denote the sample obtained by

using GU(U), GV (V) respectively. Let V1 be the sample obtained after applying G1 to V . For

68

arbitrary (tu, tv) ∈ U 1 V and t ′v ∈ V , it follows that

P((tu, tv) ∈ U 1 V, t ′v ∈ V1)

= P(tu ∈ U , tv ∈ V, t ′v ∈ V1, t ′v ∈ V)

= P(tu ∈ U)× P(t ′v ∈ V1 | tv ∈ V, t ′v ∈ V)× P(tv ∈ V, t ′v ∈ V)

= a1aUbtv∩t′v ,V .

The proof for the σ1 case follows similarly.

Propositions 7.4 and 7.5 together imply the following result.

Corollary 1. Assuming all base relation samples are obtained using GUS methods, any

selection operator in a pair of query plans does not change the parameters of the overall GUS

method Gcov .

C(U) D(V ) E(W )

1 1

Af Ag

soa⇐⇒cov

C∗(U) D∗(V ) E∗(W )

1 1

Af Ag

Figure 7-5. SOA composition

Proposition 7.6. (SOA Composition) Given randomized expressions C(U), C∗(U), D(V ),

D∗(V ), E(W ) and E∗(W ), such that C(U) SOA⇐⇒ C∗(U), D(V ) SOA⇐⇒ D∗(V ) and E(W ) SOA⇐⇒

E∗(W ), then

{C(U) 1 D(V ),D(V ) 1 E(W )}soa⇐⇒cov

{C∗(U) 1 D∗(V ),D∗(V ) 1 E∗(W )}

69

Proof. For any (tu, tv) ∈ U 1 V and (t ′v , tw) ∈ V 1 W , it follows by the SOA-Equivalence of

C, C∗, of D,D∗, and of E , E∗ (see Figure 7-5) and Proposition 4.3 that

P((tu, tv) ∈ C(U) 1 D(V ), (t ′v , tw) ∈ D(V )× E(W ))

= P(tu ∈ C(U))× P(tw ∈ E(W ))× P(tv , t ′v ∈ D(V ))

= P(tu ∈ C∗(U))× P(tw ∈ E∗(W ))× P(tv , t ′v ∈ D∗(V ))

= P((tu, tv) ∈ C∗(U) 1 D∗(V ), (t ′v , tw) ∈ D∗(V )× E∗(W )).

The result now follows from Proposition 7.1.

Proposition 7.7. (SOA-COV Composition) Let P and Q be two sets of relations (with

possibly some common relations). Let R and S be two sets of relations (with possibly some

common relations) such that P ∪ Q and R ∪ S are mutually exclusive. Given expressions

C(P),D(Q), C∗(P),D∗(Q), E(R),F(S), E∗(R),F∗(S) such that

{C(P),D(Q)} soa⇐⇒cov

{C∗(P),D∗(Q)}

{E(R),F(S)} soa⇐⇒cov

{E∗(R),F∗(S)}

we have

{C(P) 1 E(R),D(Q) 1 F(S)}soa⇐⇒cov

{C∗(P) 1 E∗(R),D∗(Q) 1 F∗(S)}

70

Proof. For arbitrary (tp, tr) ∈ P 1 R and (tq, ts) ∈ Q 1 S , it follows that

P((tp, tr) ∈ C(P) 1 E(R); (tq, ts) ∈ D(Q) 1 F(S))

= P(tp ∈ C(P), tq ∈ D(Q))× P(tr ∈ E(R), ts ∈ F(S))(a)= P(tp ∈ C∗(P), tq ∈ D∗(Q))× P(tr ∈ E∗(R), ts ∈ F∗(S))

= P((tp, tr) ∈ C∗(P) 1 E∗(R); (tq, ts) ∈ D∗(Q) 1 F∗(S))

Here (a) follows by the two SOA-COV equivalences assumed in the statement of this

proposition. The result follows immediately by Proposition 7.1.

Proposition 7.8. (SOA-COV for relational algebra equivalence) Let P,Q and R be mutually

exclusive sets of relations, and the radomized expressions C(P × Q), C∗(P × Q),D(Q ×

R),D∗(Q × R) be obtained as follows. First, independent samples are drawn from all the

base relations in P,Q & R . Then, randomized expressions C(P × Q) and C∗(P × Q) are

obtained by joining the base relation samples for P & Q in two ways which are relational

algebra equivalent. Similarlly, randomized expressions D(Q × R) and D∗(Q × R) are obtained

by joining the base relation samples for Q & R in two ways which are relational algebra

equivalent. Then

{C(P ×Q),D(Q × R)} soa⇐⇒cov

{C∗(P ×Q),D∗(Q × R)}

Proof. By construction, note that

t1 ∈ C(P ×Q), t2 ∈ C(Q × R)⇔ t1 ∈ C∗(P ×Q), t2 ∈ D∗(Q × R).

The result follows immediately using Proposition 7.1.

Applying SOA-COV Algebra to analyze covariance. The results in this chapter

(summarized in Tables 7-1 and 7-2) constitue the SOA-COV Algebra based on SOA-COV

Equivalence. The goal of these algebraic rules, is to transform a given pair of query plans to a

Covariance-analyzable pair of plans as in Fig. 7-1.

71

Table 7-1. SOA-COV rules

Name Transformation acov bT ,covSOA-COV for {GU(U) 1 GV (V ),GV (V ) 1 GW (W )} aUaW aV aUaWbT ,V

Joinssoa⇐⇒cov

(Prop. 7.3, Fig. 7-2) {U 1 Gcov(V ),Gcov(V ) 1W }SOA-COV for {σ1/G1(GV (V )),σ2/G2(GV (V ))} A1A2 aV A1A2 bT ,VSelections

soa⇐⇒cov

Ai = 1 if σi

(Prop. 7.4, Fig. 7-3) {Gcov(V ),Gcov(V )} Ai = ai if GiSOA-COV for {σ1/G1(GV (V )),GU(U) 1 GV (V )} A1aU aV A1aU bT ,V

Join-GUS interactionsoa⇐⇒cov

A1 = 1 if σ1

(Prop. 7.5, Fig. 7-4) {Gcov(V ),U 1 Gcov(V )} A1 = a1 if G1

Table 7-2. SOA-COV composition

Name If Then

SOA C(U) SOA⇐⇒ C∗(U) {C(U) 1 D(V ),D(V ) 1 E(W )}composition D(V ) SOA⇐⇒ D∗(V )

soa⇐⇒cov

(Prop. 7.6) E(W ) SOA⇐⇒ E∗(W ) {C∗(U) 1 D∗(V ),D∗(V ) 1 E∗(W )}SOA-COV {C(P),D(Q)} soa

⇐⇒cov

{C∗(P),D∗(Q)} {C(P) 1 E(R),D(Q) 1 F(S)}composition

soa⇐⇒cov

(Prop. 7.7) {E(R),F(S)} soa⇐⇒cov

{E∗(R),F∗(S)} {C∗(P) 1 E∗(R),D∗(Q) 1 F∗(S)}

For a single query plan, the Covariance-analyzable configuration will have an overall GUS

method permuted to the top of the query plan (Prop. 7.2). In the general setting, where the

estimates come from different query plans that share a subset of base relation samples, the

Covariance-analyzable configuration can be derived by using the SOA-Algebra rules in [76] to

permute the sampling operators in the exclusive cross product spaces U & W and the shared

cross-product space V, to the top of their respective spaces (Prop. 7.6), and then applying the

relevant SOA-COV rules in Table 7-1 to charecterize the overall GUS for covariance Gcov . (see

Fig. 7-6). We can then apply the base lemma for covariance (Lemma 3) to Gcov to compute

the covariance between the relevant estimates. Note that these transformations are only for the

purpose of analysis and do not interfere with the actual query execution.

We will now illustrate how the SOA-COV Algebra can be used to derive

Covariance-analyzable plans and the overall GUS Gcov . In Example 12, we analyze

72

GVGU

U1 × U2 × · · · × Up V1 × V2 × · · · × Vn

GW

W1 ×W2 × · · · ×Wq

1 1

Af Ag

soa⇐⇒cov

Gcov(V )U1 × U2 × · · · × Up

V1 × V2 × · · · × Vn

W1 ×W2 × · · · ×Wq

1 1

Af Ag

Figure 7-6. Deriving Covariance-analyzable plans

Cov(SUMloc, COUNTocn) for f2 in Example 1, where the queries for SUMloc and COUNTocn apply

different selection criteria on the joins of shared base relation samples. Example 13 analyzes a

case with multiple joins and an intermediate sampling in one of the queries. Example 15 shows

an application of SOA-COV composition (Prop. 7.7).

Example 12. The pair of query plans in Example 1 (see Figure 7-7(a)) are:

Planloc = Gl(l) 1 Go(o) 1θ1 Gc(c) = Gl(l) 1 σ1(Go(o) 1 Gc(c))

Planocn = Go(o) 1θ2 (Gc(c) 1 n) = σ2(Go(o) 1 (Gc(c)) 1 n)

It follows that

{Gl(l) 1 σ1(Go(o) 1 Gc(c)), σ2(Go(o) 1 (Gc(c)) 1 n)}soa⇐⇒cov

{Gl(l) 1 σ1(Goc(oc)),σ2(Goc(oc)) 1 n} [Prop. 4.6]

soa⇐⇒cov

{Gl(l) 1 Goc(oc)′,Goc(oc)

′1 n} [Prop. 7.4]

soa⇐⇒cov

{l 1 Gcov(lo),Gcov(lo) 1 n} [Prop. 7.3].

73

l

Gl

o

Go

c

Gc

n

σ1 σ2

1 1

1 1

SUMloc COUNTocn

soa⇐⇒cov

l

Gl

o c n

Goc

1

σ1 σ2

1 1

SUMloc COUNTocn

soa⇐⇒cov

l

Gl

o c n

Goc′

1 1

SUMloc COUNTocn

(a) (b) (c)

soa⇐⇒cov

l o c n

Gcov

1 1

SUMloc COUNTocn

(d)

GUS method ParametersGl al = 0.1, bl,∅ = 0.01, bl,l = 0.1

Go ao = 6.67× 10−3, bo,∅ = 4.44× 10−5, bo,o = 6.67× 10−3Gc ac = 0.1, bc,∅ = 0.01, bc,c = 0.1

Goc aoc = 6.67×10−4, boc,∅ = 4.44×10−7, boc,c = 4.44×10−6, boc,o =6.67× 10−5, boc,oc = 6.67× 10−4

Goc′

Same parameters as Goc

Gcov acov = 6.67 × 10−5, bcov ,∅ = 4.44 × 10−8, bcov ,c = 4.44 ×10−7, bcov ,o = 6.67× 10−6, bcov ,oc = 6.67× 10−5

Figure 7-7. Applying SOA-COV Equivalence for covariance computation in Example 12

Example 13. Starting with the pair of query plans in Figure 7-8 (a) we can rewrite the

query plans such the the exclusive and shared subtrees emerge.The pair of query plans can be

expressed as:

74

p ps l o c n

Gp Gps Gl Go Gc Gn

1 1 1

1 1

Af Ag

G1 soa⇐⇒cov

p ps l o c n

Gpps Glo Gcn

1 1 1

1 1

Af Ag

G1 soa⇐⇒cov

p ps l o c n

Gppslocn

1 1 1

1 1

Af Ag

G1

(a) (b) (c)

soa⇐⇒cov

p ps l o c n

Gcov

1 11

1 1

Af Ag

(d)

GUS method ParametersGx:x=p, ps, c, n ax = 0.1, bx,∅ = 0.01, bx,x = 0.1

Gx:x=l,o ax = 0.1, bx,∅ = 0.01, bx,x = 0.1G1 a1 = 0.5, b1,∅ = 0.25, b1,1 = 0.5Gpps apps = 0.01, bpps,∅ = 10

−4, bpps,p = bpps,ps = 10−3, bpps,pps =

10−2

Gcn acn = 0.01, bcn,∅ = 10−4, bcn,c = bcn,n = 10

−3, bcn,cn = 10−2

Glo alo = 0.01, blo,∅ = 10−4, blo,l = blo,o = 10

−3, blo,lo = 10−2

Gppslocn appslocn = 10−6, bppslocn,∅ = 10−8, bppslocn,l = bppslocn,o =10−7, bppslocn,lo = 10

−6

Gcov acov = 0.5 × 10−6, bcov ,∅ = 0.5 × 10−8, bcov ,l = bcov ,o =10−7, bcov ,lo = 10

−6


75

It follows that

{Gp(p) 1 Gps(ps) 1 Gl(l) 1 Go(o),G1(Gl(l) 1 Go(o) 1 Gc(c) 1 n)}soa⇐⇒cov

{Gpps(p 1 ps) 1 Glo(l 1 o),G1(Glo(l 1 o) 1 Gcn(c 1 n))} [Prop. 4.6]

soa⇐⇒cov

{p 1 ps 1 Gppslocn(l 1 o),G1(Gppslocn(l 1 o) 1 c 1 n)} [Prop. 7.4]

soa⇐⇒cov

{p 1 ps 1 Gcov(l 1 o),Gcov(l 1 o) 1 c 1 n} [Prop. 7.3].

Example 14. Figure 7-9(a) shows a pair query plans on shared Bernoulli(0.1) samples

of l & o. Each plan joins the samples with selection conditions σ1 & σ2, and derives

SUM-like estimates Af & Ag respectively. Note that Ag is derived from another sample

of the intermediate results of the join in the second plan. This pair of query plans can be

expressed as follows (see Figure 7-9(b)).

Plan1 = Gl(l) 1σ1 Go(o) = σ1(Glo(lo)

)Plan2 = G1

(Gl(l) 1σ2 Go(o)

)= G∗

1

(Glo(lo)

).

Here Glo(lo) = Gl(l)× Go(o) and G∗1 = G1 ◦ σ2 is a GUS method. By Prop. 7.4:

where acov = a∗1alao, bcov ,∅ = a

∗1bl,∅bo,∅, bcov ,l = a

∗1albo,∅, bcov ,∅ = a

∗1bl,∅ao and

bcov ,lo = a∗1alao (see Figure 7-9(c)). All the GUS parameters are given in the table of

Figure 7-9. We can now plug-in the parameters for Glo in Theorem 3 to obtain the covariance

of Af & Ag.

Example 15. Figure 7-10(a) shows a pair of query plans, Plan1 & Plan2, that share samples

of lineitem (Bernoulli 0.1) & orders (SWOR 1000 out of 150,000 tuples). Plan1 joins

these samples to Bernoulli 0.1 samples of partsup & part to compute Af . Plan2 joins these

samples to Bernoulli 0.1 samples of customer & nation to compute Ag. Hence

Plan1 =(Gp(p) 1 Gl(l)

)1

(Gps(ps) 1 Go(o)

)Plan2 =

(Gl(l) 1 Gc(c)

)1

(Go(o) 1 Gn(n)

).

76

l o

Gl Go

1σ1 1σ2

G1

Af Ag

soa⇐⇒cov

l o

Gl Go

1

σ1 σ2

G1

Af Ag

soa⇐⇒cov

l o

1

Glo

Af Ag

(a) (b) (c)

GUS method Parameters

Gx:x=l,o ax = 0.1, bx∅ = 0.01, bxx =0.1

G(a1,b1)(lo) a1 = 0.04, b1,∅ = 16 ×10−4, b1,l = 8× 10−3b1,l = 8× 10−3, b1,lo = 0.04

Gcov (lo) acov = 4× 10−4, bcov ,∅ = 4×10−6,bcov ,l = bcov ,o = 4 ×10−5, bcov ,lo = 4× 10−4


Note by Prop. 7.3 that

{Gp(p) 1 Gl(l),Gl(l) 1 Gc(c)}soa⇐⇒cov

{p 1 Gcov1(l),Gcov1(l) 1 c}

and

{Gps(ps) 1 Go(o),Go(o) 1 Gn(n)}soa⇐⇒cov

{ps 1 Gcov2(o),Gcov2(o) 1 n},

where the parameters of Gcov1(l) and Gcov2(o) are as specified in Prop. 7.3. It follows by

Prop. 7.7 that

{Plan1,Plan2}soa⇐⇒cov

{Plan3,Plan4} (7–3)

77

p ps l o c n

Gp Gps Gl Go Gc Gn

1 1 1 1

1 1

Af Ag

soa⇐⇒cov

p ps l o c n

Gcov1Gcov2

1 1 1 1

1 1

Af Ag

soa⇐⇒cov

p ps l o c n

Gcov1Gcov21 1

1

1 1

Af Ag

(a) (b) (c)

soa⇐⇒cov

p ps l o c n

Gcov

1 11

1 1

Af Ag

(d)

GUS method Parameters

Gx:x=p,ps,l,c,n ax = 0.1, bx∅ = 0.01, bxx = 0.1Gcov1 acov1 = 10

−3, bcov1,∅ = 10−4, bcov1,l = 10

−3

Gcov2 acov2 = 6.67× 10−5, bcov2,∅ = 4.44× 10−7, bcov2,o = 6.67× 10−5Gcov acov = 6.667 × 10−8, bcov ,∅ = 4.44 × 10−11, bcov ,l = 4.44 ×

10−10, bcov ,o = 6.67× 10−9, bcov ,lo = 6.67× 10−8


where

Plan3 =(p 1 Gcov1(l)

)1

(ps 1 Gcov2(o)

)Plan4 =

(Gcov1(l) 1 c

)1

(Gcov2(o) 1 n

).

By associativity of joins and (7–3) it follows that

{Plan1,Plan2}soa⇐⇒cov

{Plan5,Plan6}

78

where

Plan5 = (p 1 ps) 1(Gcov1(l) 1 Gcov2(o)

)Plan6 =

(Gcov1(l) 1 Gcov2(o)

)1 (c 1 n)

are in the form required by Theorem 3. Hence Theorem 3 can be used to obtain the covariance

for Af and Ag.

79

CHAPTER 8EFFICIENT COVARIANCE ANALYSIS: IMPLEMENTATION & EXPERIMENTS

The base lemma for covariance computation (Lemma 3) comprises of two kinds of terms:

terms that characterize the effects of sampling (cS) and terms that characterize the properties

of data (yS). The cS terms are derived from the Gcov parameters obtained by applying

SOA-COV transformations as shown in Section 7.3. Recall that the yS terms are defined as:

yS =∑

ti∈Vi |i∈S

∑tj∈Vj |j∈Sc

F(ti , tj)

∑tj∈Vj |j∈Sc

G(ti , tj)

F(tv) =

∑tu∈U

f (tu, tv), G(tv) =∑tw∈W

g(tv , tw).

Computing these ys terms essentially requires a group-by lineage over the shared base

relations. For the pair of estimates in Example 1 the SQL equivalent expressions for the yS

terms are given by:

CREATE TABLE unagg_loc AS

SELECT o_orderkey as f_o, c_custkey as f_c, l_discount*(1.0-l_tax) as f


orders TABLESAMPLE(10 PERCENT)

customer TABLESAMPLE(20 PERCENT)



CREATE TABLE unagg_ocn AS

SELECT o_orderkey as g_o, c_custkey as g_c, SUM(1) as g

FROM orders TABLESAMPLE(10 PERCENT),

customer TABLESAMPLE(20 PERCENT),

nation;

CREATE unagg AS

80

SELECT unagg_loc.f_o, unagg_loc.f_c, unagg_loc.f, unagg_ocn.g

FROM unagg_loc JOIN unagg_ocn

ON unagg_loc.f_o = unagg_ocn.g_o AND unagg_loc.f_c = unagg_ocn.g_c;

SELECT sum(f)*sum(g) as y_empty FROM unagg;

SELECT sum(F*G) as y_o FROM ( SELECT sum(f) as F, sum(g) as G

FROM unagg GROUP BY f_o);

SELECT sum(F*G) as y_c FROM ( SELECT sum(f) as F, sum(g) as G

FROM unagg GROUP BY f_c);

SELECT sum(f*g) as y_oc FROM unagg;

Computing the yS terms over the base relations is often more expensive than computing

the exact desired aggregate over the data. In [76] a computationally feasible solution is

discussed, where unbiased estimates YS of yS are constructed as follows.

YS =1

cS,∅

yS − ∑T⊂SC ,T =∅

cS,T YS∪T

where

cS,T =∑U⊂T

(−1)|U|+|Sc | bS∪U .

Here YS is obtained by replacing the enitre base relation by base relation samples in the

formula for computing yS . Sometimes, however, the size of the sample can be overwhelming

for computation of all the yS terms, and one can use a further sub-sample of the provided

sample for computing these terms (see [76, Section 6.2]). The discussion in [76] is in the

context of computing the variance of a single estimator. When two estimates are derived from

query plans sharing a common subset of base relation samples, implementing subsampling

81

for covariance computation presents addtional challenges. We address these challenges in

Section 8.2. Also, in the GROUP-BY setting where we have multiple estimates, the number of

yS terms can be exponential. In Section 8.1, we show how foreign keys can be used to simplify

this computation significantly

8.1 Leveraging Foreign Keys

The number of yS terms to be computed in Theorem 1 & 3 is exponential. If the number of

common base relations in the data is nR & the number of groups are ng, then there will be(ng2

)covariance computation problems, each requiring the computation of 2nR yS terms. Each

yS computation requires maintaining a seperate hash table. This is indeed a harsh scenario for

large samples. The proposition below shows that the presence of foreign keys implies that many

of the yS terms are the same, thereby significantly reducing the computational burden.

Proposition 8.1. Let V1,V2, · · · ,Vn be the set of relations as in the statement of Proposition

3. If there exist 1 ≤ i1 < i2 ≤ n such that Vi1 has a foreign key from Vi2 , then for any

S ⊇ {i1}, we have yS = yS∪{i2}.

Proof. Let S ⊇ {i1}, T = S ∪ {i2} and S ′ = S \ {i1}. If {i2} ⊆ S , or equivalently S = T ,

then the result is trivial. Suppose instead that S ⊂ T . Note that

yT =∑

ti∈Vi |i∈T

∑tj∈Vj |j∈T c

F(ti , tj)

∑tj∈Vj |j∈T c

G(ti , tj)

=

∑ti∈Vi |i∈S ′

∑(ti1 ,ti2)∈Vi1×Vi2

∑tj∈Vj |j∈T c

F(ti , ti1, ti2, tj)

×

∑tj∈Vj |j∈T c

G(ti , ti1, ti2, tj)

.

82

Since Vi1 has a foreign key from Vi2 , it follows that for every ti1 ∈ Vi1 , there exists a unique

h(ti1) ∈ Vi2 such that (ti1, h(ti1)) ∈ Vi1 × Vi2 . It follows that

yT =∑

ti∈Vi |i∈S ′

∑ti1∈Vi1

∑tj∈Vj |j∈T c

F(ti , ti1, h(ti1), tj)

×

∑tj∈Vj |j∈T c

G(ti , ti1, h(ti1), tj)

=

∑ti∈Vi |i∈S ′

∑ti1∈Vi1

∑tj∈Vj |j∈T c

∑ti2∈Vi2

F(ti , ti1, ti2, tj)

×

∑tj∈Vj |j∈T c

∑ti2∈Vi2

G(ti , ti1, ti2, tj)

=

∑ti∈Vi |i∈S

∑tj∈Vj |j∈Sc

F(ti , tj)

∑tj∈Vj |j∈Sc

G(ti , tj)

= yS .

Example 16. Consider the GROUPBY query

SELECT SUM(l_extendedprice * (1 - l_discount))

FROM lineitem, orders, customer, nation

WHERE o_custkey = c_custkey AND

l_orderkey = o_orderkey AND

c_nationkey = n_nationkey

GROUPBY n_nationkey;

Here the number of groups ng = 25 and the number of common relations nR = 3, which

means there are(252

)= 300 covariance terms, and for each, 23 = 8 yS terms need to be

computed. Hence the total number of computations is 300×8 = 2400. In the TPC-H schema,

lineitem has a foreign key from orders, and orders has a foreign key from customers. By

Proposition 8.1, it follows that yl = ylo = yloc = ylc and yo = yoc . Hence, only 4 distinct

83

yS terms (yϕ, yl , yo , yc), need to be computed for each covariance. This brings down the total

computations to 300× 4 = 1200.

8.2 Sub-Sampling

In Chapter 5, sub-sampling is used in [76] to allow faster, in-memory analysis when

samples are too large to fit in the main memory. In case of a single query plan (as in

Example 8 for AVG) this amounts to drawing a sub-sample at the top of the query plan,

just before the aggregate(s). Sub-sampling in the case of two query plans sharing samples from

a subset of base relations, presents an additional challenge. Drawing independent sub-samples

at the top of each of the query plans will not produce accurate covariance estimates (for

Figure 7-8 (d), this would amount to drawing differrent sub-samples just before computing Af

and Ag).

To get meaningful estimates, the same sub-sample from the common base relations should

be used in the covariance computation. For example, in Figure 7-8 this would amount to

drawing a single sub-sample just after the Gcov step. Note that Figure 7-8 (d) is an equivalent

conceptualization of the original query plan for the purposes of covariance computation and

may be different from the actual query execution plan (Figure 7-8 (a)). It is imperatve that

the sub-sampling does not interfere with the actual query execution plan. To acheive both the

goals, we draw dependent sub-samples from the top of the query plans using psuedo-RNGs

with the same starting seed. In our experiments, the sub-sampling was implemented through a

GIST [71], which uses multi-dimensional Bernoulli GUS on top of the query plans to obtain the

sub-samples.

8.3 Experiments

We have two major goals for our experimental study. The first goal is to provide experimental

confirmation of our theory. The second goal is to evaluate the efficiency of covariance

computation. More specifically:

• How does the runtime depend on the size of base relation samples?

• How does sub-sampling affect the runtime and accuracy of covariance estimates?

• How does the foreign key optimization affect runtime?

84

As we demonstrate in this section, both the sub-sampling and foreign-key optimizations

significantly reduce the runtime while providing correct and meaningful estimates.

8.3.1 Experimental Setup

Data sets. We use two versions of TPC-H data. For verification of confidence intervals,

we use a small dataset (scale factor 0.01, 10MB). For runtime experiments, we use a large

dataset (scale factor 1000, 1TB).

Code & Platform. The SOA-COV algebra described in Section 7.3 is implemented in

R. The query processing and subsampling is implemented in Grokit [2](implemented in C++),

which combines Datapath [12], GLADE [23] & GIST [71]. It uses a streaming strategy, reading

in chunks and column stores to acheive high performance (up to 3 GB/s).

Hardware. For the GrokIt machine: 64 AMD Opteron processors @2.3 Ghz, 512 GB of

DDR3 RAM, 20 4 TB hard drives.

We now describe the experiments undertaken to answer the questions listed above. The

detailed setup of each experiment is specified in the respective subsections.

8.3.2 Correctness

In this section, we provide a sanity check for our theory and evaluate the accuracy of the

proposed estimates.

Setup. We use the 10MB TPC-H data (lineitem has 60000, orders has 15000 and

customers has 1500 tuples). The small size allows us to perform multiple experiments to

empirically evaluate the estimates through Monte Carlo simulation. We use the following query.

SELECT SUM(l_extendedprice*(1-l_discount)) as e1,

COUNT(1) as e2

FROM lineitem TABLESAMPLE(x PERCENT),

orders TABLESAMPLE(y PERCENT),

customer TABLESAMPLE(z PERCENT)


o_orderkey = c_custkey AND

85

0102030405060708090

0 10 20 30 40 50 60 70 80 90

Acheivedconfi

dencelevel

Desired confidence level

Correctness Study

l(10) o(10) p(50)l(20) o(15) p(55)l(30) o(20) p(60)l(40) o(25) p(65)

y = x

Figure 8-1. Plot of percentage of times the true value lies in the estimated confidence intervalsvs desired confidence level.

c_mktsegment = "BUILDING";

The objective is to estimate the average revenue using e1/e2 from the sample obtained from

this query. As described at the end of Chapter 3, the covariance between e1 and e2 is one of

the key terms needed to compute the confidence interval for the average expected revenue.

Monte Carlo validation. We validate the theoretically computed expected value and

variance of the AVG-like estimate (e1/e2) using a Monte Carlo simulation. We use the query

as shown above for different values of (x,y,z), see Figure 8-1. For each value of (x,y,z), we run

the query 2500 times and obtain 2500 confidence intervals each for ten different confidence

levels, ranging from 5% to 95%. We then compute the achieved confidence levels, i.e. for each

of the ten desired confidence levels, we compute the percentage of times the true value falls

within the corresponding confidence intervals. In Figure 8-1, we show a comparison between

the desired and achieved confidence levels for 4 different sampling strategies. The achieved

confidence levels are very close to the desired values across the different sampling strategies

and confidence levels. This provides strong empirical evidence that the confidence intervals

obtained by using our theory are accurate and tight.

86

8.3.3 Running Time

In this section, we evaluate the efficiency of the estimation process.

Setup. For efficiency evaluation, it is important to evaluate the theory in large sample

settings. We use the 1TB TPC-H instance as a sample of a 1PB database. That means that

the six billion tuples in lineitem are a Bernoulli sample from six trillion lineitem tuples

in a 1PB database (0.001 sampling fraction), the 1.5 billion tuples in orders are a sample

without replacement from 1.5 trillion tuples on the 1PB database and the 150 million tuples

in customers are a Bernoulli sample from 150 billion customers tuples in the 1PB database.

The analysis overhead will depend on the sample size. Note that for this section, the database

is the sample. There is no sampling needed in the execution of the query. We use the pair of

queries below and vary the selectivity using the parameter FINALDATE. The selectivity of the

query controls the number of tuples used for analysis and indicates the overhead as a function

of size.

SELECT SUM(l_quantity)

FROM lineitem, orders, customer


l_orderkey = o_orderkey AND

"1992-06-01" <= o_orderdate AND

o_orderdate < FINALDATE

SELECT Sum(1)

FROM nation, orders, customer


c_nationkey = n_nationkey AND

"1992-06-01" <= o_orderdate AND

o_orderdate < FINALDATE

87

Figure 8-2. Plot of runtime.

Figure 8-2 demonstrates the running time of the simultaneous queries, with increasing overhead

specified by four values of FINALDATE. For each value, we recorded runtimes unders four

conditions: with/without sub-sampling & with/without foreign-key optimization. In each

case, the runtime is the sum of time taken for query processing with sub-sampling & the time

taken for analysis. When there is no optimization, the total runtime increases with larger loads

due to the increase in time taken for analysis. Since there are n = 2 common relations, the

anlalysis needs 2n = 4 ys parameters. Applying foreign-key optimization precludes the seperate

computation of two of these terms. In this single covariance example, using foreign keys for

optimization (without sub-sampling) reduces the analysis time, but the rate of increase with

selectivity is roughly the same. Sub-sampling (target sub-sample size between 100K - 400K

tuples) vastly reduces the analysis time and makes it uniform across all the four values of the

selectivity parameter, resulting in a stable total runtime.

88

8.3.4 Foreign Key Optimization

The effect of foreign key optimization is more useful in cases where a covariance matrix

needs to be computed, for eg. in GROUP-BY queries, where the number of yS compuation is

large (Section 8.1). We use the query in Example 16 and note the running times with/without

foreign key optimization. The query processing and subsampling time is the same in both

cases, but the analysis times reduced from 7 sec to 0.33 sec when foreign key optimization is

used.

89

CHAPTER 9A SAMPLING ALGEBRA FOR GENERAL MOMENT MATCHING

In previous chapters, we showed that a GUS sampling operator commutes with relational

operators such as selection/cross product in a SOA-equivalent sense. In particular, aggregates

derived from two SOA-equivalent sampling plans have the same first two moments. Such

an equivalence is sufficient if the aim is to compute the variance of these aggregates, and

to construct confidence intervals based on the variance. However, such an equivalence is

not sufficient if one wishes to know deeper distributional properties of the aggregates such

as skewness, kurtosis etc. For such endeavors, we need a class of sampling methods which

commute with relational operators in the sense that aggregates for all “equivalent” plans have

all the same moments up to a given order. In this chapter, we will develop a class of sampling

methods to achieve this purpose. In particular, it will be shown that

• common sampling methods such as SWR, SWOR and Bernoulli sampling (and theircombinations) are included in this class,

• these sampling methods commute with relational operators while preserving all momentsup to a given order k ,

• the moments for these methods can be expressed in a form which makes computationeasier.

9.1 k-Generalized Uniform Sampling

Consider a database R = R1 × R2 × · · · × Rn. Let k ≥ 2 be an arbitrarily chosen positive

integer. To define our class of sampling methods, we first introduce required notation.

• We say that S = (S1,S2, · · · ,Sk−1) is an ordered k-partition of V = {1, 2, · · · , n}if S1,S2, · · · ,Sk−1 are pairwise disjoint subsets of V . In this setting, we denote Sk =V \

(∪k−1i=1 Si

).

• The collection of all ordered k-partitions of V is denoted by Pk(V ).

• Let t1, t2, · · · , tk ∈ R be arbitrarily chosen. Then S({ti}ki=1

)is an ordered k-partition of

V such that

Sm({ti}ki=1

)={j ∈ V : There are exactly m distinct values in t1j , t

2j , · · · , tkj

}for 1 ≤ m ≤ k − 1.

90

With the above notation in hand, we define our desired class of sampling methods.

Definition 4. A randomized selection process which gives a sample R from R is a called

a k-Generalized Uniform Sampling (k-GUS) method if for any t1, t2, · · · , tk ∈ R,

P(t1, t2, · · · , tk ∈ R) is a function of S({ti}ki=1

). In such a case, the k-GUS parameters

b = {bT | T ∈ Pk(V )} are defined as:

bT = P(t1, t2, · · · , tk ∈ R | S

({ti}ki=1

)= T), (9–1)

and the sampling method is denoted by Gk,b.

Now, we establish some properties of the class of k-GUS methods. We first show that as k

increases, the class of k-GUS sampling methods becomes smaller.

Lemma 1. A k + 1-GUS method is also a k-GUS method.

Proof. Consider a sampling method which gives a sample R from R. Suppose this

sampling method is a k + 1-GUS method. Let t1, t2, · · · , tk ∈ R. Define tk+1 = t1. By the

definition of k + 1-GUS and tk+1, it follows that

P(t1, t2, · · · , tk ∈ R) = P(t1, t2, · · · , tk , tk+1 ∈ R)

is a function of the sets

{j ∈ V : There are exactly m distinct values in t1j , tj2, ·, tk+1j },

for 1 ≤ m ≤ k . Since the number of distinct values in {t ij }k+1i=1 is the same as the number of

distinct values in {t ij }ki=1, and this number is always bounded above by k , the result follows. 2

The following result shows that the sampling methods introduced in [76] are equivalent to

2-GUS methods as per the above definition.

Lemma 2. The class of GUS methods defined in [76] is identical to the class of 2-GUS

methods.

The proof of the above lemma follows immediately from the definition of GUS methods in [76].

91

Our next result shows that in the context of sampling from a single relation, three

standard sampling methods: simple random sampling without replacement (SWOR), simple

random sampling with replacement (SWR) and Bernoulli sampling are k-GUS sampling

methods for any positive integer k ≥ 2.

Lemma 3. Consider a sampling method which draws n tuples at random with replacement or

without replacement from a single relation with N tuples, or chooses each tuple independently

with probability p. Such a sampling method is k-GUS for any k ≥ 2.

Proof. For any t1, t2, · · · , tk ∈ R , it follows by standard calculations using the

inclusion-exclusion principle that

P(t1, t2, · · · , tk ∈ R) =

1−

∑mi=1(−1)i−1

(mi

) (N−iN

)nfor WR sampling,(

N−mn−m

)/(Nn

)for WOR sampling,

pm for Bernoulli sampling,

where m is the number of distinct elements in the set {t1, t2, · · · , tk}. In particular,

P(t1, t2, · · · , tk ∈ R) is a function of the number of distinct elements in {t1, t2, · · · , tk}.

The result follows immediately from the definition of a k-GUS method. 2

Using Lemma 3, we now show that any combination of SWR, SWOR and Bernoulli sampling

over R = R1 × R2 × · · · × Rn is k-GUS method for any positive integer k ≥ 2.

Lemma 4. Consider a sampling method where SWOR, SWR or Bernoulli sampling is done

individually for each relation independently, and a cross product of these samples is taken to

obtain a sample from R. Such a sampling method is a k-GUS method.

Proof. Using independence and Lemma 3 it follows that for any t1, t2, · · · , tk ∈ R,

P(t1, t2, · · · , tk ∈ R) =n∏j=1

P(t1j , t2j , · · · , tkj ∈ R)

=

k∏m=1

∏j∈Sm({ti}ki=1)

pj ,m

92

where

pj ,m =

1−

∑mi=1(−1)i−1

(mi

) (Nj−iNj

)njfor WR sampling,(

Nj−mnj−m

)/(Njnj

)for WOR sampling,

pmj for Bernoulli sampling.

The result now follows from the definition of k-GUS sampling. 2

We now derive an expression for the k th moment of SUM-like aggregates arising from a k-GUS

method. To achieve this, we will need to define a partial order on the space Pk(V ).

Definition 5. For S,T ∈ Pk(V ), we say S ⊇k T if Sj ⊇ Tj for every 1 ≤ j ≤ k − 1.

Using the above definition, we now state and prove the following result.

Theorem 4. Consider a k-GUS method Gk,b which gives a sample R from R. Then, for any

aggregate A =∑t∈R f (t) we have

E[Ak]=

∑S∈Pk(V )

cSyS.

Here

cS =∑T⊆kS

(−1)∑k−1i=1 |Si\Ti |bT,

F (S) =∑

t1,t2,··· ,tk∈R:S({ti}ki=1)=S

k∏i=1

f (ti),

and

yS =∑T⊇kS

F (T).

93

Proof. It follows by the definition of a k-GUS method that

E[Ak]= E

(∑t∈R

1{t∈R}f (t)

)k=

∑t1,t2,··· ,tk∈R

P(t1, t2, · · · , tk ∈ R)k∏i=1

f (ti)

=∑

S∈Pk(V )

bS∑

t1,t2,··· ,tk∈R:S({ti}ki=1)=S

k∏i=1

f (ti) (9–2)

=∑

S∈Pk(V )

bSF (S). (9–3)

Now, it follows by the definition of cS and yS that

∑S∈Pk(V )

cSyS =∑

S∈Pk(V )

(∑T⊆kS

(−1)∑k−1i=1 |Si\Ti |bT

)yS

=∑

S∈Pk(V )

∑T⊆kS

(−1)∑k−1i=1 |Si\Ti |bTyS

=∑

T∈Pk(V )

bT

(∑S⊇kT

(−1)∑k−1i=1 |Si\Ti |yS

). (9–4)

It is clear from the definition of yS that∑

S⊇kT(−1)∑k−1i=1 |Si\Ti |yS is a linear function of

{F (S)}S⊇kT. Clearly, the coefficient of F (T) in this linear function is 1, as F (T) only appears

in yT, and does not appear in any yS for S ⊃k T. Now, for any S, note that there are

2∑k−1i=1 |Si\Ti | sets which dominate T (⊇k T) and are dominated by S (⊆k S). The term F (S)

appears in all these yS expressions. However, for half of these sets (−1)∑k−1i=1 |Si\Ti | = 1 and for

the other half (−1)∑k−1i=1 |Si\Ti | = −1. Hence, the coefficient of F (S) in

∑S⊇kT(−1)

∑k−1i=1 |Si\Ti |yS

is zero. It follows that∑

S⊇kT(−1)∑k−1i=1 |Si\Ti |yS = F (T). It follows by (9–3) and (9–4) that

E[Ak]=

∑S∈Pk(V )

cSyS.

2

We now express yS when k = 3 in a form which can be easily obtained by using GROUPBY

queries.

94

Proposition 9.1. For every S ∈ P3(V ), we have

yS = 3∑ti :i∈S1

∑tj :j∈S2

( ∑tk :k∈S3

f (ti , tj , tk)

)2 ∑tj :j∈S2

∑tk :k∈S3

f (ti , tj , tk)

−

3∑ti :i∈S1

∑tj :j∈S2

( ∑tk :k∈S3

f (ti , tj , tk)

)3.

Proof. Recall that for t, t′, t′′ ∈ R, S1(t, t′, t′′) is the set of all indices on which t, t′, t′′

agree, and S2(t, t′, t′′) is the set of all indices on which t, t′, t′′ take exactly two distinct values.

If j ∈ S2(t, t′, t′′), then there are three possibilities: tj = t′j = t ′′j , t ′j = t ′′j = tj , tj = t ′j = t ′′j .

Hence, we get

yS =∑

t,t′,t′′:S(t,t′,t′′)=S

f (t)f (t′)f (t′′)

=∑ti :i∈S1

∑tj =t′j :j∈S2

∑tk ,t

′k ,t

′′k :k∈S3

f (ti , tj , tk)f (ti , t′j , t

′k)f (ti , tj , t

′′k ) +∑

ti :i∈S1

∑tj =t′′j :j∈S2

∑tk ,t

′k ,t

′′k :k∈S3

f (ti , tj , tk)f (ti , tj , t′k)f (ti , t

′′j , t

′′k ) +∑

ti :i∈S1

∑t′j =t”j :j∈S2

∑tk ,t

′k ,t

′′k :k∈S3


′k)f (ti , tj , t

′′k )

= 3∑ti :i∈S1


∑tk ,t

′k ,t

′′k :k∈S3


′k)f (ti , tj , t

′′k ),

where the last step follows by symmetry. The result follows by noting that

∑ti :i∈S1


∑tk ,t

′k ,t

′′k :k∈S3


′k)f (ti , tj , t

′′k )

=∑ti :i∈S1

∑tj ,t

′j :j∈S2

∑tk ,t

′k ,t

′′k :k∈S3


′k)f (ti , tj , t

′′k )−∑

ti :i∈S1

∑tj :j∈S2

∑tk ,t

′k ,t

′′k :k∈S3


′k)f (ti , tj , t

′′k )

=∑ti :i∈S1

∑tj :j∈S2

( ∑tk :k∈S3

f (ti , tj , tk)

)2 ∑tj :j∈S2

∑tk :k∈S3

f (ti , tj , tk)

−

∑ti :i∈S1

∑tj :j∈S2

( ∑tk :k∈S3

f (ti , tj , tk)

)3.

95

2

For k ≥ 4, the expressions for yS can be derived along similar lines, although they will get more

complex as k increases.

9.2 kMA equivalence

In this section, we will define an equivalence between randomize expressions which preserves all

moments up to a given order k , thereby generalizing the notion of SOA-equivalence in [76].

Definition 6. Given (possibly randomized) expressions E(R) and F(R) we say

E(R) kMA⇐⇒ F(R)

if for any arbitrary SUM-aggregate Af (S) =∑t∈S f (t),

E[Af (E(R))i

]= E

[Af (F(R))i

]for every 1 ≤ i ≤ k .

For non-randomized expressions, it can be easily shown by appropriate choices of the function

f that kMA equivalence is the same as set equivalence. The next lemma shows that kMA

equivalence is an equivalence relation and can be manipulated like a relational equivalence.

Proposition 9.2. kMA equivalence is an equivalence relation, i.e., for any expressions E ,F ,H

and relation R, we have

E(R) kMA⇐⇒ E(R)

E(R) kMA⇐⇒ F(R)⇒ F(R) kMA⇐⇒ E(R)

E(R) kMA⇐⇒ F(R)&F(R) kMA⇐⇒ H(R)⇒ E(R) kMA⇐⇒ H(R).

The proof of the lemma immediately follows from the definition of kMA equivalence. The

next lemma provides an alternative characterization of kMA equivalence in terms of joint

probabilities.

96

Proposition 9.3. Given two relational algebra expressions E(R) and F(R) we have

E(R) kMA⇐⇒ E(R)

⇔

P(t1, t2, · · · , tk ∈ E(R)) = P(t1, t2, · · · , tk ∈ F(R)) ∀t1, t2, · · · , tk ∈ R.

9.3 Interaction between kMA and Relational Operators

In this section we will show that k-GUS sampling operators commute with selection and

cross-product in terms of kMA equivalence. This will help converting sampling plans provided

by the user to kMA equivalent analyzable sampling plans.

Proposition 9.4. For an R, selection σC and k-GUS Gk,b,

σC(Gk,b(R)kMA⇐⇒ Gk,b(σC(R)).

Proof. Let R′ = σC(R). Let R denote the random sample corresponding to Gk,b. For

t1, t2, · · · , tk ∈ R′, it follows that

P(t1, t2, · · · , tk ∈ σC(R) | S({ti}ki=1) = S) = P(t1, t2, · · · , tk ∈ R | S({ti}ki=1) = S) = bS.

2

Proposition 9.5. For k-GUS methods Gk,b1(R) and Gk,b2(S) such that L(R) ∩ L(S) = ∅,

Gk,b1(R)× Gk,b2(S)kMA⇐⇒ Gk,b(R× S).

Here bT = b1,T1b2,T2 with T1 = T ∩ L(R) and T2 = T ∩ L(S), and L(R), L(S) refer to the

lineage schema for R and S respectively.

Proof. Let R and S denote the independent random samples corresponding to Gk,b1(R)

and Gk,b2(S) respectively. Let t1, t2, · · · , tk ∈ R×S, where ti = (tiR , tiS). Since L(R)∩L(S) =

97

∅, it follows that

S({ti}ki=1) = T ⇔ S({tiR}ki=1) = T1, S({tiS}ki=1) = T2.

Hence, we get

bT = P(t1, t2, · · · , tk ∈ R× S | S({ti}ki=1) = T)

= P(t1R , t2R , · · · , tkR ∈ R | S({tiR}ki=1) = T1)P(t1S , t2S , · · · , tkS ∈ S | S({tiS}ki=1) = T2)

= b1,T1b2,T2.

2

We now investigate interactions between k-GUS operators when applied to the same data.

Proposition 9.6. For any expression R and k-GUS methods Gk,b1 and Gk,b2 which are applied

independently

Gk,b1(Gk,b2(R)

) kMA⇐⇒ Gk,b(R)

where bT = b1,Tb2,T.

The proof follows immediately from the independence of the two k-GUS methods.

Using Propositions 9.4, 9.5 and 9.6, we can transform a wide variety of query plans to a

kMA-equivalent query plan with sampling on the top. This allows us to construct estimates of

quantities such as skewness (for k = 3) and kurtosis (for k = 4) for SUM-like aggregates.

In future work, we will explore extensions of the the notion of kMA equivalence to pairs of

query plans (analogous to extending SOA equivalence to SOA-COV equivalence). Such an

extension will allow us to estimate the skewness, kurtosis or other quantities invloving higher

order moments for functions of multiple SUM-like aggregates. Another future line of research

will be to find compact and computation-friendly expressions for yS for a general k , i.e., to

extend Lemma 9.1 beyond the case k = 3.

98

REFERENCES

[1] “SQL-2003 Standard.” 2003.

[2] “Grokit.” 2018.

URL https://github.com/tera-insights/grokit

[3] “SWI-Prolog.” 2018.

URL http://www.swi-prolog.org

[4] “TPC-H Benchmark.” 2018.

URL http://www.tpc.org/tpch

[5] Acharya, S., Gibbons, P. B., Poosala, V., and Ramaswamy, S. “Join synopses forapproximate query answering.” Proceedings of the ACM SIGMOD International Confer-ence on Management of Data. ACM, 1999, 275–286.

[6] Acharya, Swarup, Gibbons, Phillip B., and Poosala, Viswanath. “Aqua: A fast decisionsupport system using approximate query answers.” In Proc. of 25th Intl. Conf. on VeryLarge Data Bases. 1999, 754–755.

[7] Acharya, Swarup, Gibbons, Phillip B., Poosala, Viswanath, and Ramaswamy, Sridhar.“The Aqua approximate query answering system.” Proceedings of the ACM SIGMODInternational Conference on Management of Data. New York, USA: ACM, 1999, 574–576.

URL http://doi.acm.org/10.1145/304182.304581

[8] Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., and Stoica, I. “BlinkDB:Queries with Bounded Errors and Bounded Response Times on Very Large Data.”EuroSys (2013).

[9] Agarwal, Sameer, Milner, Henry, Kleiner, Ariel, Talwalkar, Ameet, Jordan, Michael,Madden, Samuel, Mozafari, Barzan, and Stoica, Ion. “Knowing When You’re Wrong:Building Fast and Reliable Approximate Query Processing Systems Knowing When You’reWrong: Building Fast and Reliable Approximate Query Processing Systems.” Proceedingsof the ACM SIGMOD International Conference on Management of Data. ACM, 2014.

[10] Alon, N., Matias, Y., and Szegedy, M. “The space complexity of approximating thefrequency moments.” ACM Symposium on Theory of Computing. 1996.

[11] Antova, Lyublena, Jansen, Thomas, Koch, Christoph, and Olteanu, Dan. “Fast and simplerelational processing of uncertain data.” 2008 IEEE 24th International Conference on DataEngineering. IEEE, 2008, 983–992.

[12] Arumugam, Subi, Dobra, Alin, Jermaine, Christopher M., Pansare, Niketan, and Perez,Luis. “The DataPath System: A Data-centric Analytic Processing Engine for Large

99

https://github.com/tera-insights/grokit

http://www.swi-prolog.org

http://www.tpc.org/tpch

http://doi.acm.org/10.1145/304182.304581

Data Warehouses.” Proceedings of the ACM SIGMOD International Conference onManagement of Data. ACM, 2010, 519–530.

[13] Babcock, Brian, Datar, Mayur, and Motwani, Rajeev. “Load Shedding in Data StreamSystems.” Data Streams - Models and Algorithms. 2007. 127–147.

[14] Bickel, Peter J, Gotze, Friedrich, and van Zwet, Willem R. “Resampling fewer than nobservations: gains, losses, and remedies for losses.” Statistical Sinica (1997): 1–31.

[15] Bickel, Peter J and Sakov, Anat. “Extrapolation and the bootstrap.” Sankhya: The IndianJournal of Statistics, Series A (2002): 640–652.

[16] Bickel, Peter J and Yahav, Joseph A. “Richardson extrapolation and the bootstrap.”Journal of the American Statistical Association 83 (1988).402: 387–393.

[17] Buccafurri, F., Lax, G., Sacc‘a, D., Pontieri, L., and Rosaci, D. “Enhancing histograms bytree-like bucket indices.” The VLDB Journal 17 (2008): 1041–1061.

[18] Chakrabarti, K., Garofalakis, M. N., Rastogi, R., and Shim, K. “Approxi- mate queryprocessing using wavelets.” The VLDB Journal 10 (2001).2-3: 199–223.

[19] Chaudhuri, S., Das, G., and Narasayya, V. “Optimized stratified sampling for approximatequery processing.” TODS (2007).

[20] Chaudhuri, Surajit and Motwani, Rajeev. “On Sampling and Relational Operators.” IEEEData Eng. Bull. 22 (1999).4: 41–46.

[21] Chaudhuri, Surajit, Motwani, Rajeev, and Narasayya, Vivek R. “On Random Samplingover Joins.” SIGMOD Conference. 1999, 263–274.

[22] Cheney, James, Chiticariu, Laura, and Tan, Wang-Chiew. “Provenance in Databases:Why, How, and Where.” Foundations and Trends in Databases 1 (2007).4: 379–474.

[23] Cheng, Yu, Qin, Chengjie, and Rusu, Florin. “GLADE: Big Data Analytics Made Easy.”Proceedings of the 2012 ACM SIGMOD International Conference on Management ofData. ACM, 2012, 697–700.

[24] Cormode, G., Garofalakis, M., Haas, P.J., and Jermaine, C. “Synopses for Massive Data:Samples, Histograms, Wavelets, Sketches.” Foundations and Trends in Databases 4(2012).1-3: 1–294.

[25] Cormode, G. and Muthukrishnan, S. “An improved data stream summary: The count-minsketch and its applications.” Journal of Algorithms 55 (2005).1: 58–75.

[26] C.Pang, Q.Zhang, D.Hansen, and A.Maeder. “Unrestricted wavelet synopses undermaximum error bound.” Proceedings of the International Conference on ExtendingDatabase Technology. 2009.

[27] Dalvi, Nilesh and Suciu, Dan. “Efficient query evaluation on probabilistic databases.” TheVLDB Journal 16 (2007): 523–544.

100

[28] Dobra, A. “Histograms revisited: When are histograms the best approximation method foraggregates over joins?” Proceedings of ACM Principles of Database Systems. 2005.

[29] Dobra, A., Garofalakis, M., Gehrke, J. E., and Rastogi, R. “Processing complex aggregatequeries over data streams.” Proceedings of the ACM SIGMOD International Conferenceon Management of Data. ACM, 2002.

[30] Dobra, Alin, Jermaine, Chris, Rusu, Florin, and Xu, Fei. “Turbo-Charging EstimateConvergence in DBO.” PVLDB 2 (2009).1: 419–430.

[31] Donjerkovic, D. and Ramakrishnan, R. “Probabilistic optimization of top N queries.”Proceedings of the International Conference on Very Large Data Bases. 1999.

[32] Efron, B. “Bootstrap methods: another look at the jackknife.” Annals of Statistics 7(1979): 1–26.

[33] Efron, Bradley. “More Efficient Bootstrap Cmputations.” Journal of American StatisticalAssociation 85 (1990): 79–89.

[34] Efron, Bradley and Tibshirani, Robert J. An introduction to the bootstrap. CRC press,1994.

[35] Flajolet, P. and Martin, G. N. “Probabilistic counting algorithms for databaseapplications.” Journal of Computer and System Sciences 31 (1985): 182–209.

[36] Ganguly, S. “Counting distinct items over update streams.” Theoretical Computer Science378 (2007): 211–222.

[37] Garofalakis, M. and Gibbons, P. B. “Wavelet synopses with error guarantees.” Pro-ceedings of the ACM SIGMOD International Conference on Management of Data. ACM,2002.

[38] ———. “Probabilistic wavelet synopses.” ACM Transactions on Database Systems 29(2004).

[39] Garofalakis, M. and Kumar, A. “Deterministic wavelet thresholding for maximum-errormetrics.” Proceedings of the ACM SIGACT-SIGMOD- SIGART Symposium on Principlesof Database Systems. ACM, 2004, 166–176.

[40] ———. “Wavelet synopses for general error metrics.” ACM Transactions on DatabaseSystems 30 (2005).

[41] Gibbons, Phillip B., Poosala, Viswanath, Acharya, Swarup, Bartal, Yair, Matias, Yossi,Muthukrishnan, S., Ramaswamy, Sridhar, and Suel, Torsten. “AQUA: System andtechniques for approximate query answering.” Tech. rep., 1998.

[42] Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. J. “One-pass waveletdecomposition of data streams.” IEEE Transactions on Knowledge and Data Engineering15 (2003).

101

[43] Gryz, Jarek, Guo, Junjie, Liu, Linqi, and Zuzarte, Calisto. “Query sampling in DB2Universal Database.” Proceedings of the ACM SIGMOD International Conference onManagement of Data. ACM, 2004, 839–843.

[44] Guha, S. and Harb, B. “Wavelet synopsis for data streams: Minimizing non-euclideanerror.” Proceedings of the ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining. 2005.

[45] ———. “Approximation algorithms for wavelet transform coding of data streams.”Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms. 2006.

[46] Guha, S., Koudas, N., and Shim, K. “Approximation and streaming algorithms forhistogram construction problems.” ACM Transactions on Database Systems 31 (2006):396–438.

[47] Haas, Peter J. “Large-Sample and Deterministic Confidence Intervals for OnlineAggregation.” SSDBM. IEEE Computer Society Press, 1996, 51–63.

[48] Haas, Peter J. and Hellerstein, Joseph M. “Ripple joins for online aggregation.” Proceed-ings of the ACM SIGMOD International Conference on Management of Data. ACM, 1999,287–298.

[49] ———. “Online Query Processing.” Proceedings of the ACM SIGMOD InternationalConference on Management of Data. ACM, 2001, 623.

[50] Haas, Peter J., Naughton, Jeffrey F., Seshadri, S., and Swami, Arun N. “Selectivity andCost Estimation for Joins Based on Random Sampling.” Journal of Computer and SystemSciences 52 (1996): 550 – 569.

URL http://www.sciencedirect.com/science/article/pii/S0022000096900410

[51] ———. “Selectivity and cost estimation for joins based on random sampling.” J. Comput.Syst. Sci. 52 (1996): 550–569.

[52] Hellerstein, Joseph M., Haas, Peter J., and Wang, Helen J. “Online aggregation.” ACM,1997, 171–182.

URL http://doi.acm.org/10.1145/253262.253291

[53] Horvitz, D. G. and Thompson, D. J. “A generalization of sampling without replacementfrom a finite universe.” Journal of the American Statistical Association 47 (1952):663–685.

[54] Ioannidis, Y. E. “Approximations in database systems.” Proceedings of the InternationalConference on Database Theory. 2003.

[55] Ioannidis, Y. E. and Christodoulakis, S. “On the propagation of errors in the size of joinresults.” Proceedings of the ACM SIGMOD International Conference on Management ofData. ACM, 1991.

102

http://www.sciencedirect.com/science/article/pii/S0022000096900410

http://doi.acm.org/10.1145/253262.253291

[56] ———. “Optimal histograms for limiting worst-case error propagation in the size of joinresults.” ACM Transactions on Database Systems 18 (1993).

[57] Ioannidis, Y. E. and Poosala, V. “Balancing histogram optimality and practicality forquery result size estimation.” Proceedings of the ACM SIGMOD International Conferenceon Management of Data. ACM, 1995.

[58] Ioannidis, Y.E. and Poosala, V. “Histogram-based approximation of set-valuedquery-answers.” Proceedings of the International Conference on Very Large DataBases. 1999.

[59] Jermaine, Chris, Arumugam, Subramanian, Pol, Abhijit, and Dobra, Alin. “Scalableapproximate query processing with the DBO engine.” ACM Trans. Database Syst. 33(2008).

[60] Jermaine, Chris, Dobra, Alin, Arumugam, Subramanian, Joshi, Shantanu, and Pol, Abhijit.“The Sort-Merge-Shrink join.” ACM Trans. Database Syst. 31 (2006): 1382–1416.

[61] Joshi, S. and Jermaine, C. “Sampling-Based Estimators for Subset-Based Queries.”PVLDB 1 (2009): 181–202.

[62] Kandula, Srikanth, Shanbhag, Anil, Vitorovic, Aleksandar, Olma, Matthaios, Grandl,Robert, Chaudhuri, Surajit, and Ding, Bolin. “Quickr: Lazily Approximating ComplexAd-Hoc Queries in Big Data Clusters.” Proceedings of the ACM SIGMOD InternationalConference on Management of Data. ACM, 2016.

[63] Kanne, C.C. and Moerkotte, G. “Histograms reloaded: The merits of bucket diversity.”Proceedings of the ACM SIGMOD International Conference on Management of Data.ACM, 2010.

[64] Karras, P. and Manoulis, N. “One-pass wavelet synopses for maximum-error metrics.”PVLDB. ACM, 2005.

[65] Karras, P., Sacharidis, D., and Manoulis, N. “Exploiting duality in summa- rization withdeterministic guarantees.” Proceedings of the ACM SIGKDD International Conference onKnowledge Discovery and Data Mining. 2007.

[66] Kaushik, R., Naughton, J. F., Ramakrishnan, R., and Chakaravarthy, V. T. “Synopses forquery optimization: A space-complexity perspective.” ACM Transactions on DatabaseSystems 30 (2005): 1102–1127.

[67] Kempe, D., Dobra, A., and Gehrke, J. “Gossip-based computation of aggregateinformation.” Proceedings of the IEEE Conference on Foundations of Computer Sci-ence. 2003.

[68] Kleiner, A., Talwalkar, A., Agarwal, S., Stoica, I., and Jordan, M. I. “A general bootstrapperformance diagnostic.” KDD (2013).

103

[69] Kleiner, Ariel, Talwalkar, Ameet, Sarkar, Purnamrita, and Jordan, Michael. “The big databootstrap.” arXiv preprint arXiv:1206.6415 (2012).

[70] Laptev, N., Zeng, K., and Zaniolo, C. “Early Accurate Results for Advanced Analytics onMapReduce.” PVLDB 5 (2012).

[71] Li, Kun, Wang, Daisy Zhe, Dobra, Alin, and Dudley, Christopher. “UDA-GIST: AnIn-database Framework to Unify Data-parallel and State-parallel Analytics.” PVLDB(2015): 557–568.

[72] Lipton, Richard J., Naughton, Jeffrey F., Schneider, Donovan A., and Seshadri, S.“Efficient sampling strategies for relational database operations.” Theoretical ComputerScience 116 (1993).1: 195 – 226.

URL http://www.sciencedirect.com/science/article/pii/030439759390224H

[73] Liu, R.Y. and Singh, K. “Using i.i.d. bootstrap inference for general non-i.i.d. models.”Journal of Statistical Planning and Inference 43 (1999): 67–75.

[74] Matias, Y. and Urieli, D. “Optimal workload-based weighted wavelet synopses.” Theoreti-cal Computer Science 371 (2007): 227–246.

[75] M.Charikar, K.Chen, and M.Farach-Colton. “Finding frequent items in data streams.”International Colloquium on Automata, Languages and Programming. 2002.

[76] Nirkhiwale, Supriya, Dobra, Alin, and Jermaine, Chris. “A Sampling Algebra for AggregateEstimation.” PVLDB 6 (2013).14: 1798–1809.

[77] Olken, Frank. “Random Sampling from Databases.” 1993.

[78] Pansare, Niketan, Borkar, Vinayak, Jermaine, Chris, and Condie, Tyson. “OnlineAggregation for Large MapReduce Jobs.” PVLDB 4 (2011): 1135–1145.

[79] Piatetsky-Shapiro, Gregory and Connell, Charles. “Accurate estimation of the numberof tuples satisfying a condition.” Proceedings of the 1984 ACM SIGMOD internationalconference on Management of data. ACM, 1984, 256–276.

[80] Pol, A. and Jermaine, C. “Relational confidence bounds are easy with the bootstrap.”Proceedings of the ACM SIGMOD International Conference on Management of Data.ACM. 2005.

[81] Politis, D.N., Romano, J.P., and Wolf, M. Subsampling. Springer, New York, 1999.

[82] Poosala, V., Ioannidis, Y. E., Haas, P. J., and Shekita, E. J. “Improved histogramsfor selectivity estimation of range predicates.” Proceedings of the ACM SIGMODInternational Conference on Management of Data. ACM, 1996.

[83] Re, Christopher and Suciu, Dan. “The trichotomy of HAVING queries on a probabilisticdatabase.” PVLDB 18 (2009): 1091–1116.

104

http://www.sciencedirect.com/science/article/pii/030439759390224H

[84] Rusu, F. and 2007, A. Dobra. “Statistical Analysis of Sketch Estimators.” Proceedings ofthe ACM SIGMOD International Conference on Management of Data. ACM, 2007.

[85] Rusu, F. and Dobra, A. “Sketches for size of join estimation.” ACM Transactions onDatabase Systems 33 (2008).3.

[86] Rusu, Florin and Dobra, Alin. “Sketching Sampled Data Streams.” Proceedings of IEEEICDE. 2009.

[87] Sen, Prithviraj, Deshpande, Amol, and Getoor, Lise. “Read-once functions and queryevaluation in probabilistic databases.” Proceedings of the VLDB Endowment 3(2010).1-2: 1068–1079.

[88] Tran, Thanh T, Peng, Liping, Diao, Yanlei, McGregor, Andrew, and Liu, Anna. “CLARO:modeling and processing uncertain data streams.” The VLDB JournalThe InternationalJournal on Very Large Data Bases 21 (2012).5: 651–676.

[89] Tran, Thanh TL, Diao, Yanlei, Sutton, Charles, and Liu, Anna. “Supporting user-definedfunctions on uncertain data.” Proceedings of the VLDB Endowment 6 (2013).6: 469–480.

[90] Vitter, J. S. and Wang, M. “Approximate computation of multidimensional aggregates ofsparse data using wavelets.” Proceedings of the ACM SIGMOD International Conferenceon Management of Data. ACM, 1999.

[91] Wang, Daisy Zhe, Michelakis, Eirinaios, Garofalakis, Minos, and Hellerstein, Joseph M.“BayesStore: managing large, uncertain data repositories with probabilistic graphicalmodels.” Proceedings of the VLDB Endowment 1 (2008).1: 340–351.

[92] Wang, H. and Sevcik, K. C. “Utilizing histogram information.” Proceedings of CASCON.2001.

[93] ———. “Histograms based on the minimum description length principle.” VLDB Journal17 (2008).

[94] Whang, K. Y., Vander-Zanden, B. T., and Taylor, H. M. “A linear-time probabilisticcounting algorithm for database applications.” ACM Transactions on Database Systems15 (1990): 208.

[95] Xu, Fei, Jermaine, Christopher M., and Dobra, Alin. “Confidence bounds forsampling-based group by estimates.” ACM Trans. Database Syst. 33 (2008).

[96] Zeng, Kai, Gao, Shi, Mozafari, Barzan, and Zaniolo, Carlo. “The Analytical Bootstrap: ANew Method for Fast Error Estimation in Approximate Query Processing.” Proceedingsof the ACM SIGMOD International Conference on Management of Data. ACM, 2014,277–288.

105

BIOGRAPHICAL SKETCH

Supriya Nirkhiwale received her B.E. degree in electronics and telecommunications from

the Sri Govindram Sekseria Institute of Technology, Indore, India in 2006. She received her

master’s degree in electrical engineering in 2009 from Kansas State University and Ph.D. in

computer science in 2018 from the University of Florida. Her primary research is focused on

building theory and scalable frameworks Approximate Query Processing in large databases. She

has been working as a data scientist for LexisNexis since 2014.

106

ufdcimages.uflib.ufl.eduufdcimages.uflib.ufl.edu/UF/E0/05/21/85/00001/NIRKHIWALE_S.pdf · TABLE OF CONTENTS page ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . .

Documents