A SAMPLING ALGEBRA FOR SCALABLE APPROXIMATE QUERY PROCESSING By SUPRIYA NIRKHIWALE A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2018
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A SAMPLING ALGEBRA FOR SCALABLE APPROXIMATE QUERY PROCESSING
By
SUPRIYA NIRKHIWALE
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2018
c⃝ 2018 Supriya Nirkhiwale
To Vedaant and Kshitij
ACKNOWLEDGMENTS
Thanks to my advisor, Alin Dobra, for introducing me to an interesting set of problems. I
have learnt from him how the right abstraction makes problems simple and tractable. Thanks
to my husband Kshitij, my son Vedaant, my family, and my friends, without whom none of this
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
A SAMPLING ALGEBRA FOR SCALABLE APPROXIMATE QUERY PROCESSING
By
Supriya Nirkhiwale
May 2018
Chair: Alin V. DobraMajor: Computer Engineering
As of 2005, sampling has been incorporated in all major databases. While efficient
sampling techniques are easily realizable, determining the accuracy of an estimate obtained
from the sample is still an unresolved problem. In the first part of this dissertation, we present
a theoretical framework that allows an elegant treatment of the problem. We base our work
on generalized uniform sampling (GUS), a class of sampling methods that subsumes a wide
variety of sampling techniques. We introduce a key notion of equivalence that allows GUS
sampling operators to commute with selection and join, and derivation of confidence intervals
for SUM-like aggregates obtained by a very general class of queries.
The use of sampling for approximate query processing in large data warehousing
environments has another significant limitation: sampling large tables is expensive. For
applications like approximate exploration, it can be wasteful to compute a single sample to
estimate only one value. Resources are better utilized if a single sample can be used to obtain
multiple estimates from the data. So far, it has not been possible to acheive this because
multiple estimates from the same sample are correlated and the theory to compute these
correlations was missing. In the second part of this dissertation, we provide a theoretical
framework for a lightweight, add-on tool to any database for computing covariance betwen
two estimates that arise from a common set of base relation samples. This theory also makes
it possible to a compute covariance matrix between groups of data in a GROUPBY query or
variances of estimates like AVG that are functions of SUM-like aggregates.
9
We illustrate the theory through extensive examples and give indications on how to use it
to provide meaningful estimation in database systems.
10
CHAPTER 1INTRODUCTION
Sampling has long been used by database practitioners to speed up query evaluation,
especially over very large data sets. For many years it was common to see SQL code of the
form “WHERE RAND() > 0.99”. Widespread use of this sort of code lead to the inclusion of
the TABLESAMPLE clause in the SQL-2003 standard [1]. Since then, all major databases have
incorporated native support for sampling over relations. One such query, using the TPC-H
schema, is:
SELECT SUM(l_discount*(1.0-l_tax))
FROM lineitem TABLESAMPLE (10 PERCENT),
orders TABLESAMPLE (1000 ROWS)
WHERE l_orderkey = o_orderkey AND
l_extendedprice > 100.0;
The result of this query is obtained by taking a Bernoulli sample with p = .1 over
lineitem and joining it with a sample of size 1000 obtained without replacement (WOR),
from orders and evaluating the SUM aggregate.
In practice, there are two main reasons practitioners write such code. One is that sampling
is useful for debugging expensive queries. The query can be quickly evaluated over a sample as
a sanity check, before it is unleashed upon the full database.
The second reason is that the practitioner is interested in obtaining an idea as to what the
actual answer to the query would be, in less time than would be required to run the query over
the entire database. This might be useful as a prelude to running the query “for real”—the
user might want to see if the result is potentially interesting—or else the estimate might be
used in place of the actual answer. Often, this situation arises when the query in question
11
performs an aggregation, since it is fairly intuitive to most users that sampling can be used to
obtain a number that is a reasonable approximation of the actual answer.
The problem we consider here comes from the desire to use sampling as an approximation
methodology. In this case, the user is not actually interested in computing an aggregate such
as “SUM(l discount*(1.0-l tax))” over a sample of the database. Rather, s/he is interested
in estimating the answer to such a query over the entire database using the sample. This
presents two obvious problems:
• First, what SQL code should the practitioner write in order to compute an estimate for aparticular aggregate?
• Second, how does the practitioner have any idea how accurate that estimate is?
Ideally, a database system would have built-in mechanisms that automatically provide
estimators for user supplied aggregate queries, and that automatically provide users with
accuracy guarantees. Along those lines, in this proposal, we study how to automatically support
Presented with such a query, the database engine will use the user-specified sampling to
automatically compute two values lo and hi that can be used as a [0.05, 0.95] confidence
bound on the true answer to the query. That is, the user has asked the system to compute
values lo and hi such that there is a 5% chance that the true answer is less than lo, and
there is a 95% chance that the true answer is less than hi. In the general case, the user
should be able to specify any aggregate over any number of sampled base tables using any
12
sampling scheme, and the system would automatically figure out how to compute an estimate
of the desired confidence level. A database practitioner need have no idea how to compute
an estimate for the answer, nor does s/he need to have any idea how to compute confidence
bounds; the user only specifies the desired level of confidence, and the system does the rest.
Existing Work on Database Sampling. While there has been a lot of research on
implementing efficient sampling algorithms [72, 77], providing confidence intervals for the
sample estimate is understood only for a few restricted cases. The simplest is when only a
single relation is sampled. A slightly more complicated case was handled by the AQUA system
developed at Bell labs [5–7, 41]. AQUA considered correlated sampling where a fact table
in a star schema is sampled. These cases are relatively simple because when a single table is
sampled, classical sampling theory applies with a few easy modifications. Simultaneous work
on ripple joins and online aggregation [47–50, 52] extended the class of queries amenable to
analysis to include those queries where multiple tables are sampled with replacement and then
joined. See Chapter 3 for an extensive review of related work.
Unfortunately, the extension to other types of sampling is not straightforward, and to
date new formulas have been derived every time a new sampling is considered (for example,
two-table without-replacement sampling [60]). Our goal is to provide a simple theory that
makes it possible to handle very general types of queries over virtually any uniform sampling
scheme: with replacement sampling, fixed-size without replacement sampling, Bernoulli
sampling, or whatever other sampling scheme is used. The ability to easily handle arbitrary
types of sampling is especially important given that the current SQL standard allows for a
somewhat mysterious SYSTEM sampling specification, whose exact implementation (and hence
its statistical properties) are left up to the database designers. Ideally, it should be easy for a
database designer to apply our theory to an arbitrary SYSTEM sampling implementation.
Generalized Uniform Sampling. One major reason that new theory and derivations
were previously required for each new type of sampling is that the usual analysis is tuple-based,
where the inclusion probability of each tuple in the output set is used as the basic building
13
block; computing expected values and variances requires intricate algebraic manipulations
of complicated summations. We use the notion of Generalized Uniform Sampling (GUS)
(see Definition 1) that subsumes many different sampling schemes (including all of the
aforementioned ones, as well as block-based variants thereof).
Our Contributions for SUM-like Aggregates. In Part II of this dissertation, we
develop an algebra over many common relational operators, as well as the GUS operator. This
makes it possible to take any query plan that contains one or more GUS operators and the
supported relational operators, and perform a statistical analysis of the accuracy of the result in
an algebraic fashion, working from the leaves up to the top of the plan.
No complicated algebraic manipulations over nested summations are required. This
algebra can form the basis for a lightweight tool for providing estimates and error bounds,
that should be easily integrable into any database system. The database need only feed the
tool the user-specified confidence levels, the set of tuples returned by the query, some simple
lineage information over those result tuples, and the query plan, and the tool can automatically
compute the desired error bounds.
The specific contributions we make, are:
• We define the notion of Second Order Analytical equivalence (SOA equivalence), a keyequivalence relationship between query plans that is strong enough to allow varianceanalysis but weak enough to ensure commutativity of sampling and relational operators.
• We define the GUS operator that emulates a wide class of sampling methods. Thisoperator commutes with most relational operators under SOA-equivalence.
• We develop an algebra over GUS and relational operators that allows derivation ofSOA-equivalent plans. These plans easily allow moment calculations that can be used toestimate error bounds.
• We describe how our theory can be used to add estimation capabilities to existingdatabases so that the required changes to the query optimizer and execution engine areminimal. Alternatively, the estimator can be implemented as an external tool.
Our work provides a straightforward analysis for the SUM aggregate. It can be easily
extended for COUNT by substituting the aggregated attribute to 1 and applying the analysis
14
for SUM on this attribute. Though the analysis for AVERAGE presents a slightly non-linear
case, the analyses for SUM and COUNT lay a foundation for it. The confidence intervals can
be derived using a method for approximating probability distribution/variance such as the delta
method. The analysis for MIN, MAX and DISTINCT are extremely hard problems to solve due
to their non-linearity. For example DISTINCT requires an estimate of all the distinct values in
the data and the number of such values. It is thus beyond the scope of this proposal.
While selections and joins are the highlight of our work, we show that SOA-equivalence
allows analysis for other database operators like cross-product (compaction), intersection
(concatenation) and union.
Multiple Correlated Aggregates Sharing the Same Sample. Practical AQP
applications often require support for a wide variety of estimates. In the first part of our
work, we focused on computing the variance of SUM-like estimates from query plans with
multiple joins. A natural question to ask would be: how can we generalize this framework to
accommodate:
• covariance between estimates
• non-linear estimates that can be constructed from multiple SUM-like aggregates, for eg,AVG, VAR.
• the GROUPBY clause. It is often more interesting to derive estimates for various groupsin the data, and compute their covariances for further analysis.
In a practical approximate exploration setting for large data, we almost never ask a single
question or compute only a single estimate per sample. The warehoused data can be Petabyte
size and a reasonable sample size should at least be Terabyte sized. The cost of obtaining
a sample of such a size can itself be significant. In some cases the underlying data itself is
a sample. It is prudent, or rather necessary, to reuse the obtained sample for computing the
required multiple estimates. These estimates are correlated since they are derived from the
same set of base relation samples. The user typically plugs in multiple estimates to generate
approximate answers to queries of interest. These approximate answers are incomplete without
15
lineitem
B(0.1)
orders
SWOR(1000)
customer
B(0.1)
nation
σ1 σ2
1 1
1 1
1
SUMloc COUNTloc COUNTocnCOUNTlocn GROUPBYlocn
Figure 1-1. Queries on samples of TPC-H relations
an error bound which quantifies the quality of the approximation. Computing covariances
between the multiple estimates is crucial for obtaining the required error bound. Ignoring these
correlations often leads to misleading results.
Example 1. Fig 1-1 shows an example using the TPC-H schema [4], where multiple estimates
are derived from a shared set of samples: a Bernoulli(0.1) sample from lineitem,
a sample of 1000 tuples w/o replacement from 150,000 tuples of orders & a
Bernoulli(0.1) sample from customer. These 3 base relation samples are used to derive the
following five Horvitz-Thompson ([24, 53]) estimators:
• SUMloc as 15000 ∗ SUM(l discount*(1.0-l tax))
• COUNTloc as 15000∗COUNT(*) from the join of the samples from lineitem, orders andcustomer
• COUNTocn as 1500∗COUNT(orderkey) from the join of the samples from orders,customer and nation
• COUNTlocn as 15000∗COUNT(*) & GROUPBYlocn as15000∗SUM(o totalprice) GROUPBY n name from the join of samples from lineitem,orders and customer and the nation table.
Note that the sample aggregates used in the above estimators have been scaled by appropriate
factors to obtain unbiased estimates of the corresponding population aggregates.
16
It is common practice for users to submit work units consisting of requests for multiple
estimates, where each estimate uses a subset of base relation samples. These estimates are
typically used in further analysis. For example, the user may want to approximate the average
price per lineitem by a function
f1(SUMloc, COUNTloc) =SUMloc
COUNTloc.
Approximations for other desired quantities can often involve much more complex functions.
For example, the average price of an order in a given nation can be approximated by
f2(SUMloc, COUNTloc, COUNTlocn, COUNTocn) =SUMloc
COUNTloc
COUNTlocn
COUNTocn.
The variance of the approximation f1 is given by
V (f1(SUMloc, COUNTloc))
≈ SUM2locCOUNT2loc
(V (SUMloc)
SUM2loc+V (COUNTloc)
COUNT2loc− 2Cov(SUMloc, COUNTloc)
SUMlocCOUNTloc
), (1–1)
and the variance of the approximation f2 (refer to Eq (3–3)) is a function of 6 covariance terms
- one for every pair of estimates from SUMloc, COUNTloc, COUNTlocn, COUNTocn, and 4 variance
terms.
The only way to avoid the covariance terms (by making them equal to zero) would
be to compute independent samples of lineitem, orders and customer per estimate
(12 independent base relation samples). This is clearly inefficient and wasteful. Another
example with multiple estimates is GROUPBYlocn, a set of 25 estimates (one for each nation).
By design, these estimates come from the same set of base relation samples and have
non-trivial covariances. The quality of an approximation can only be meaningfully and correctly
understood by knowing the pairwise covariances of various estimates used in the approximation.
2
As demonstrated by the above example, in modern AQP settings with multiple correlated
estimates, it is therefore crucial to compute pairwise covariance estimates, whether the
17
user needs to construct sophisticated functions of multiple estimates or to get accurate
simultaneous error bounds for the individual estimates (once covariances are computed, a
recipe for comupting simultaneous confidence intervals is available in [95]). Previous work on
covariance computation focuses on specific individual settings, where this computation can be
a tediuos, arduous process. There is a need for a clean and general algebraic approach (similar
to the approach that we develop for variance) to make covariance computation tenable in
practical settings.
Our Contributions for Covariance Computation. We address the covariance
computation problem in Part III of the dissertation. We provide a list of our specific
contributions below.
• We define the notion of Second Order Analytical Covariance equivalence (SOA-COVequivalence) between two pairs of query plans, a broad and non-trivial generalization ofthe notion of SOA-equivalence in developed in the first part of the dissertation. Thisequivalence is strong enough to allow covariance analysis, and subsumes the framework in[76] as variance can be thought of as a self covariance.
• We develop an algebra over GUS and relational operators that allow derivation ofSOA-COV equivalent plans. These plans easily allow moment calculations to estimatecovariance.
These analytical results form the basis for the development of a practical lightweight add-on
tool that
• computes the pairwise covariances between multiple estimates from a common sample.
• empowers the pratitioner with the ability to compute error bounds for functions ofsum-like aggregates, covariance matrix for GROUPBY queries and to get simultaneousconfidence intervals for multiple estimates.
• is based on theory which is independent of the number of joins involved, platform orschema and uses only a single sample.
Structure of the dissertation. The rest of the document is organized as follows. In
Chapter 2, we provide a detailed overview of related work in approximate query processing.
In Chapter 3, we review concepts related to Generalized Unifrom Sampling (GUS) methods
in detail. In Chapter 4, we introduce the notion of SOA-equivalence between query plans and
18
prove that GUS operators commute with a variety of relational operators in the SOA sense.
We also investigate interactions between GUS operators when applied to the same data. In
Chapter 5, we provide insights on how our theory can be used to implement a separate add-on
tool and how the performance of the variance estimation can be enhanced. Furthermore, we
test our implementation thoroughly, and provide accuracy and runtime analysis. In Chapter 6 ,
we propose to extend this theory to accommodate multiple estimates, for e.g. as in GROUPBY
queries. We explore the general difficulties associated with this problem and outline the major
technical challenges. In Chapter 7, we provide a solution to this problem by introducing the
notion of SOA-COV equivalence between pairs of query plans. We develop an algebra which
allows us to transform a given pair of query plans to an analyzable pair of query plans, thereby
giving us the ability to compute the covariance between any pair of aggregates resulting from
these plans. In Chapter 8, we proivde a thorough experimental testing of the SOA-COV based
theory and discuss issues relevant to implementation. In Chapter 9, we consider the problem
of estimating higher order moments of aggregate estimators, and develop the notions of
k-Generalized Uniform Sampling methods and kMA equivalence to provide a solution.
19
CHAPTER 2RELATED WORK
The idea of using sampling in databases for deriving estimates for a single relation was
first studied by Shapiro et al. [79]. Since then, much research has focused on implementing
efficient sampling algorithms in databases [72, 77]. Providing confidence intervals on estimates
for SQL aggregate queries is a difficult problem with limited progress so far. The previous
literature can be roughly classified into the following areas.
2.1 Analytical Bounds
The problem of providing closed form analytical bounds for approximate database queries
has a roughly three decade long history. There has been a large body of research on using
sampling to provide quick answers to database queries, on database systems [8, 19, 52, 59, 61,
78], and data stream systems [13, 74]. Olken [77] studied the problem for specific sampling
methods for a single relation. This line of work ended abruptly when Chaudhuri et al. [20, 21]
proved that extracting IID samples from a join of two relations is infeasible.
Another line of research was the extension to the correlated sampling pioneered by the
AQUA system [6, 7, 41]. AQUA is applicable to a star schema, where the goal is sampling
from the fact table, and including all tuples in dimension tables that match selected fact table
tuples. The AQUA type of sampling has been incorporated in DB2 [43].
The reason confidence intervals can be provided for AQUA type sampling is the fact
that independent identically distributed (IID) samples are obtained from the set over which
the aggregate is computed. A straightforward use of the central limit theorem readily allows
computation of good estimates and confidence intervals. Indeed, it is widely believed [6, 7, 20,
21, 41, 77] that IID samples at the top of the query plan are required to provide any confidence
interval. This idea leads to the search for a sampling operator that commutes with database
operators. This endeavor proved to be very difficult from the beginning [20] when joins are
involved. To see why this is the case, consider a tuple t ∈ orders and two tuples u1, u2 in
lineitem that join with t (i.e. they have the same value for orderkey). Random selection
20
of tuples t, u1, u2 in the sample does not guarantee random selection of result tuples (t, u1)
and (t, u2). If t is not selected, neither tuple can exist, and thus sampling is correlated. A lot
of effort [20, 21] has been spent in finding practical ways to de-correlate the result tuples with
only limited success.
Substantial research has been devoted to deriving samples from input relations in advance
and using them to approximate answers to ad-hoc queries [8] . These methods may provide
significant benefit when queries, predicates or query columns are predictable/ known in
advance. However, they offer limited support for joins. Any join has to be with a small
dimention tables on foreign keys. Multiple joins over large tables are not supported.
Progress has been made using a different line of thought by Hellerstein and Hass [52] and
the generalization in [51] for the special case of sampling with replacement. The problem of
producing IID result samples is avoided by developing central limit theorem-like results for the
combination of relation level sampling with replacement. The theory was generalized first to
sampling without replacement for single join queries [60], then further generalized to arbitrary
uniform sampling over base directions and arbitrary SELECT-FROM-WHERE queries without
duplicate elimination in DBO [59], and finally to allow sampling across multiple relations
in Turbo-DBO [30]. Even though some simplification occurred through these theoretical
developments, they are mathematically heavy and hard to understand/interpret. Moreover,
the theory, especially DBO and Turbo-DBO, is tightly coupled with the systems developed to
exploit it.
Technically, one major problem in all the mathematics used to analyze sampling schemes
is the fact the analyses use functions and summations over tuple domains, and not the
operators and algebras that the database community is used to. This makes the theory hard to
comprehend and apply. The fact that no database system picked up these ideas to provide a
confidence interval facility is a direct testament of these difficulties.
While recent progress has been made on generalized variance computation, the much
more daunting issue of com- puting pairwise covariances between multiple estimates has been
21
barely explored. The need for covariance computa- tion arises in a wide variety of situations.
In the context of sampling from databases, covariance computation is crucial for obtaining
efficient simultaneous confidence intervals for multiple GROUPBY estimates. Kandula et al.
[62] extend the notion of SOA-equivalence between plans introduced in [76] to the notion of
Sampling Dominance between plans. They argue that the variance of a transformed plan need
not be exactly equal to the variance of the original plan, because obtaining an upper bound
for the variance (and hence the error) of the sampling based estimator may be good enough
for obtaining an error bound. Relaxing the notion of SOA-equivalence allows them to consider
query plans with non-GUS samplers. They construct non-GUS extensions of the Bernoulli
sampler called the Distinct sampler and the Universe sampler, and develop a framework for
transforming any plan with these samplers to a plan with sampling only on top (just before
aggregation) such that the error for the transformed plan is greater than or equal to that of
the original plan. As we demonstrate in Example 1 to compute the variance of a function of
multiple aggregate estimators (like AVG) or to compute a corresponding joint confidence region
for these estimators, all the pairwise covariances need to be computed (see (3–3)). This issue
is acknowledged in [62, Appendix C] in the context of AVG, but a framework to estimate the
covariance is not developed. In [95], the authors derive closed form estimators for the pairwise
covariances between aggregates from a GROUPBY query, for the specific case of sampling
without replacement. These covariances are then used to construct simultaneous confidence
intervals. Pansare et al. [78] develop a very sophisticated Bayesian framework to infer the
confidence bounds of approximate aggregate answers. However, this approach is limited to
simple group-by aggregate queries and does not provide a systematic way of quantifying
approximation quality.
2.2 Bootstratpping
Bootstrap [32, 34] is a popular resampling based statistical technique for obtaining
confidence/error bounds for a wide variety of estimates. This method is particularly useful in
settings where closed form variance estimates are not available (MIN, MAX, nested queries,
22
etc). The basic idea behind bootstrap is simple: obtain a large number of resamples from the
existing samples, compute the required estimate for each of the resamples, and use them to
approximate the sampling distribution of the original estimate, providing error bounds as a
byproduct. Though conceptually simple and powerful, this method requires repeated estimator
computation on resamples having size comparable to original dataset. This poses the obvious
challenges of computational efficiency and provides basis for multiple lines of work.
Research in statistical methodology has focused on reducing the required number of
Monte Carlo resamples required [33, 34] or reducing the size of resamples [14–16, 81].
The techniques for reducing the number of resamples introduce additional complexity of
implementation and still need repeated estimator computations on resamples having size
comparable to that of the original dataset. There is some computational interest in lowering
the size the resamples with bootstrap variants, such as m out n sampling, where the size
of resamples are smaller than the original sample. These techniques are sensitive to the
choice of parameters (size of resamples) [69] and the analytical correction requires the prior
knowledge of the convergence rate of the estimator, making them infeasible to automate.
The requirement of prior theoretical knowledge can be avoided by averaging the distributons
of the smaller resamples [69], but the automatic selection of resampling parameters in a
computationally efficient manner remains to be a challenge.
In the last decade, various approaches for using Bootstrap in AQP have been developed
in the literature. These approaches focus mainly on reducing the computational overhead
associated with repeated resampling. Pol et al. [80] developed a resampling tree data structure
for every base relation, that holds indicator random variables for inclusion of a tuple in the
sample. Laptev et al. [70] target MapReduce platforms and study how to overlap computation
across different bootstrap trials or bootstrap samples. The Analytic Bootstrap [96] provides a
probabilistic relational model for symbollically executing bootstrap and develop new relational
operators that combine random variables. Since bootstrap is a simulation based technique,
23
recent work [9, 68] demonstrate the need of diagnostic methods to identify when bootstrap
based techniques are unreliable.
While a vast amount of bootstrap literature focuses on the computational issues, there are
few issues that arise with its applications.
Joins. The asymptotic theory for bootstrap is valid only under the assumption that the
elements of the sample are independent and identically distributed (IID). Pol and Jermaine [80]
use resample tree per base relation to simulate multiple instances of sampling w replacement
and estimate the accuracy of aggregate over joins. Even if the base relation samples are
IID, it is well known that joining two IID samples does not lead to an IID sample from the
relevant cross product space [21]. It is not clear if the asymptotic results for i.i.d. bootstrap
are applicable in the presence of joins and there are no known theoretical guarantees for
consistency. While empirical results suggest that this these bootstrap based methods work
efficiently, it’s important to point out that the theoretical guarantees of bootstrap are lost.
Sampling generality. The IID assumption also restricts traditional bootstrap applications
to use sampling with replacement. This assumption does not hold for GUS methods that
subsume a wider class of generic sampling methods (eg Bernoulli). Some theoretical results
about validity of the bootstrap when elements of the sample are only independent (but not
necessarily identically distributed) are available, but hold under specific regularity assumptions
[73]. The assumption of independence itself does not hold for GUS methods such as sampling
without replacement. Another direct consequence is that bootstrap samples have to be apriori,
whereas GUS methods work for both apriori and samples computed inline.
To summarize, the strength of the Bootstrap lies in its wide applicability, but in the
presence of sampling induced correlations, it is theoretically and computatonally preferable to
use (if available) a closed form non-simulation based estimator with rigorous guarantees for
accuracy. This is the approach that we pursue in the current paper.
24
2.3 Other Areas
Probabilistic databases. Much existing work in this area [11, 27, 83, 87, 91] uses
possible world semantics to model uncertain data and its query evaluation. Tuples in a
probabilistic database have binary uncertainty, i.e., they either exist or not with a certain
probability. Specifically, [27, 83] use semirings for modeling and querying probabilistic
databases, focusing on conjunctive queries with HAVING clauses. Many probablistic databases
assume iid tuples [11, 27, 83, 87] or propose new query evaluation methods to handle
particular correlations. [88, 89]
Sketches. Sketching is another common technique that is used to provide approximate
answers for aggregate queries over data streams. Sketching methods use randomized
algorithms that combine random seeds with data to produce random variables whose
distribution depends on the true aggregate value. One class of sketching techniques focuses on
accurately estimating the individual frequencies in a data stream [10, 25, 75]. These frequency
estimates can be used to compute join sizes , quantiles, heavy hitters, etc. Another class of
sketching techniques focuses on COUNT DISTINCT queries [35, 36, 94]. Dobra et al [29]
develop sketching methods based on partitioning for join size estimation with multiple join
conditions. Dobra And Rusu [84, 86] provide a rigorous staistical analysis of various sketch
algorithms for join size estimation and perform extensive empirical evaluations. In [85], the
authors study a method that combines sampling and sketching and investigate the dependance
of variance of the resulting estimators on these two components.
Wavelets and Histograms. A tool which has a rich history in signal processing and
statistics, but recently has generated a lot of interest in AQP applications is wavelets.
Wavelets provide an effective way of representing a relational data in terms of appropriate
wavelet coefficients using linear transformations. In big data settings, one can obtain a
compressed/approximate representation of the data by only keeping a certain number of
wavelet coeffcients, and setting the rest of them to zero. Developing meaningful methods
to choose the best wavelet coefficients to keep in the approximate representation has been
25
the main focus of current research in this area. See, for example, [18, 26, 37–40, 42, 44,
45, 64, 65, 74, 90]. Histograms also provide a well-established and well-studied approach
for summarizing data, in both databases and statistics. In the last three decades, several
methods for using histograms for AQP have been developed in the database community. See
[7, 17, 31, 46, 54, 56–58, 63, 67, 79, 82, 92, 93] to name just a few. For a rigorous analysis of
the theoretical properties of histograms, see [28, 55, 66].
See [24] for a detailed review and extensive references for sketches, histograms, and
wavelets, and sampling based methods. As discussed in [24, Chapter 6], each of these methods
is useful and has comparative advantages and disadvantages. In particular, sampling provides
a flexible approach that works for general-purpose queries and adapts much more easily to
high-dimensional data as compared to the other methods, thereby occupying a unique and
invaluable place in the toolbox for modern AQP.
26
CHAPTER 3TECHNICAL PRELIMINARIES
The aim of this chapter is to review Generalized Uniform Sampling (GUS) methods, which
are a key ingredient in our thoery. We start with a quick overview of other sampling methods
for databases. Then, we define GUS methods, and provide details on how to get estimates and
confidence intervals using these methods. We also review the multivariate delta method, which
allows us to estimate the variance of a function of several SUM-like aggregates in terms of the
individual varainces and pairwise covarainces.
3.1 Generalized Uniform Sampling
Definition 1 (GUS Sampling [30]). A randomized selection process G(a,b) which gives a sample
R from R = R1 × R2 × · · · × Rn is called Generalized Uniform Sampling (GUS) method,
if, for any given tuples t = (t1, ... , tn), t′ = (t ′1, ... , t
′n) ∈ R, P(t ∈ R) is independent of
t, and P(t, t ′ ∈ R) depends only on {i : ti = t ′i }. In such a case, the GUS parameters a,
b = {bT |T ⊂ {1 : n}} are defined as:
a = P[t ∈ R]
bT = P[t ∈ R ∧ t ′ ∈ R|∀i ∈ T , ti = t ′i , ∀j ∈ TC , tj = t ′j ].
This definition requires GUS sampling to behave like a randomized filter. In particular,
any GUS operator can be viewed as a selection process from the underlying data, a process
that can introduce correlations. The uniformity of GUS requires that the randomized filtering
is performed on lineage of tuples and not on the content. As simple as the idea is, expressing
any sampling process in the form of GUS is a non-trivial task. Example 2 shows the calculation
of GUS parameters for a simple case.
Example 2. In this example, we show how the GUS definition above can be used to
characterize the estimation necessary for the query from Chapter 1. We denote by l s the
Bernoulli sample with p = 0.1 from lineitem and by o s the WOR sample of size 1000 from
orders. We assume that cardinality of orders is 150000. Henceforth, for ease of exposition,
27
we will denote all base relations involved by their first letters. For example, lineitem will be
denoted by l.
Applying the definition above and the independence between sampling processes, we
can derive the parameters for this GUS as follows: For any tuple t ∈ lineitem and tuple
u ∈ orders:
a = P[(t ∈ l s) ∧ (u ∈ u s)] = 0.1× 1000
150000= 6.667× 10−4
since the base relations are sampled independently from each other. For any tuples t, t ′ ∈
lineitem and u, u′ ∈ orders:
b∅ = P[(t, t ′ ∈ l s) ∧ (u, u′ ∈ o s)]
= 0.1× 0.1× 1000
150000× 999
149999
= 4.44× 10−7,
and
bo = P[t ∈ l s]× P[t ′ ∈ l s|t ∈ l s]× P[u ∈ o s]
= 0.1× 0.1× 1000
150000= 6.667× 10−5.
Similarly,
bl = P[(t ∈ l s) ∧ (u, u′ ∈ o s)]
= P[t ∈ l s]× P[u ∈ o s]× P[u′ ∈ o s|u ∈ o s]
= 0.1× 1000
150000× 999
149999
= 4.44× 10−6.
The last term is
bl ,o = P[(t ∈ l s) ∧ (u ∈ o s)] = 0.1× 1000
150000= 6.667× 10−4.
28
Notice that the GUS captures the entire estimation process, not only the two individual
sampling methods. The above analysis dealt with a simple join consisting of two base relations.
For more complex query plans, the derivation of GUS parameters would involve consideration
of all possible interactions between participating tuples. This will make the analysis highly
complex.
The analysis of any GUS sampling method for a SUM-like aggregate is given as follows.
Theorem 1. [30] Let f (t) be a function/property of t ∈ R, and R be the sample obtained
by a GUS method G(a,b). Then, the aggregate A =∑t∈R f (t) and the sampling estimate
X = 1a
∑t∈R f (t) have the property:
E [X ] = A
σ2(X ) =∑S⊂{1:n}
cSa2yS − yϕ (3–1)
with
yS =∑
ti∈Ri |i∈S
∑tj∈Rj |j∈SC
f (ti , tj)
2
cS =∑T∈P(n)
(−1)|T |+|S| bT .
The above theorem indicates that the GUS estimates of SUM-like aggregates are unbiased
and that the variance is simply a linear combination of properties of the data, terms yS and
properties of the GUS sampling method cS . Moreover, yS can be estimated from samples of
any GUS (see [30]). This result is not asymptotic; it gives the exact analysis even for very
small samples. Once the estimate and the variance are computed, confidence intervals can be
readily provided using either the normality assumption or the more conservative Chebychev
bound (see [30]).
3.2 Multivariate Delta Method
Let θ = (θ1, θ2, · · · , θk) be a vector of unknown parameters with corresponding unbiased
estimators A = (A1,A2, · · · ,Ak). Then, under appropriate assumptions (such as the
29
existence of a multivariate central limit theorem for A), the delta method shows that for any
continuously differentiable function g, E [g(A)] ≈ g(θ), and
Var(g(A)) ≈k∑i=1
(▽ig(θ))2Var(Ai) +∑i =j
▽ig(θ)▽j g(θ)Cov(Ai ,Aj). (3–2)
Using the multivariate delta method, the variance of the approximation f2 in Example 1 is
given by
V (f2(SUMloc, COUNTloc, COUNTlocn, COUNTocn))
≈ COUNT2locnCOUNT2locCOUNT
2ocn
Var(SUMloc) +
SUM2locCOUNT2locCOUNT
2ocn
Var(COUNTlocn) +
SUM2locCOUNT2locn
COUNT4locCOUNT2ocn
Var(COUNTloc) +
SUM2locCOUNT2locn
COUNT2locCOUNT4ocn
Var(COUNTocn) +
2SUMlocCOUNTlocn
COUNT2locCOUNT2ocn
Cov(SUMloc, COUNTlocn)−
2SUMlocCOUNT
2locn
COUNT3locCOUNT2ocn
Cov(SUMloc, COUNTloc)−
2SUMlocCOUNT
2locn
COUNT2locCOUNT3ocn
Cov(SUMloc, COUNTocn)−
2SUM2locCOUNTlocn
COUNT3locCOUNT2ocn
Cov(COUNTlocn, COUNTloc)−
2SUM2locCOUNTlocn
COUNT2locCOUNT3ocn
Cov(COUNTlocn, COUNTocn) +
2SUM2locCOUNT
2locn
COUNT3locCOUNT3ocn
Cov(COUNTloc, COUNTocn). (3–3)
30
CHAPTER 4ANALYSIS OF SAMPLING QUERY PLANS
The high-level goal of our research is to introduce a tool that computes the confidence
bounds of estimates based on sampling. Given a query plan with sampling operators
interspersed at various points, our tool transforms it to an analytically equivalent query
plan that has a particular structure: all relational operators except the final aggregate form
a subtree that is the input to a single GUS sampling operator. The GUS operator feeds the
aggregate operator that produces the final result. Note that this transformation is done solely
for the purpose of computing the confidence bounds of the result; it does not provide a better
alternative to the execution plan used as input. Once this transformation is accomplished,
Theorem 1 readily gives the desired analysis – the equivalence ensures that the analysis for the
special plan coincides with the analysis for the original plan.
A natural and convenient strategy to obtain the desired structure is to perform multiple
local transformations on the original query plan. These local transformations are based on a
notion of analytical equivalence, that we call Second Order Analytical (SOA) equivalence. They
allow both commutativity of relational and GUS operators, and consolidation of GUS operators.
Effectively, these local transformations allow a plan to be put in the special form in which there
is a single GUS operator just before the aggregate.
In this chapter, we first define the SOA-equivalence and then use it to provide equiva-
lence relationships that allow the plan transformations mentioned above. A more elaborate
example showcases the theory in the latter part of the chapter.
4.1 SOA-Equivalence
The main reason the previous attempts to design a sampling operator were not fully
successful is the requirement to ensure IID samples at the top of the plan. Having IID samples
makes the analysis easy since Central Limit Theorem readily provides confidence intervals.
However it is too restrictive to allow plans with multiple joins to be dealt with. It is important
31
to notice that the difficulty is not in executing query plans containing sampling but in analyz-
ing such query plans.
The fundamental question we ask in this section is: What is the least restrictive
requirement we can have and still produce useful estimates? Our main interest is in how
the requirement can be transformed into a notion of equivalence. This will enable us to talk
about equivalent plans, initially, but more usefully about equivalent expressions. The
key insight comes from the observation that it is enough to compute the expected value and
variance for any query plan. Then either the conservative Chebychev bounds or the optimistic1
normal-distribution based bounds can be used to produce confidence intervals. Note that
confidence intervals are the end goal, and, preserving expected value and variance is enough to
guarantee the same confidence interval using both CLT and Chebychev methods.
Thus, for our purposes, two query plans are equivalent if their result has the same
expected value and variance. This equivalence relation between plans already allows significant
progress. It is an extension of the classic plan equivalence based on obtaining the same answer
to randomized plans. From an operational sense, though, the plan equivalence is not sufficient
to provide interesting characterizations. The main problem is the fact that the equivalence
exists only between complete plans that compute aggregates. It is not clear what can be said
about intermediate results–the equivalent of non-aggregate relational algebra expressions.
The key to extend the equivalence of plans to equivalence of expressions is to first
design such an extension for the classic relational algebra. To this end, assume that we can
only use equality on numbers that are results of SUM-like aggregates but we cannot directly
compare sets. To ensure that two expressions are equivalent, we could require that they
produce the same answer using any SUM-aggregate. Indeed, if the expressions produce the
same relation/set, they must agree on any aggregate computation using these sets since
1 While the CLT does not apply due to the lack of IID samples, the distribution of mostcomplex random variables made out of many loosely interacting parts tends to be normal.
32
aggregates are deterministic and, more importantly, do not depend on the order in which the
computation is performed. The SUM-aggregates are crucial for this definition since they form
a vector space. Aggregates At that sum function ft(u) = δtu are the basis of this vector space;
agreement on these aggregates ensures set agreement. Extending these ideas to randomized
estimation, we obtain the following.
Definition 2 (SOA-equivalence). Given (possibly
randomized) expressions E(R) and F(R), we say
E(R) SOA⇐⇒ F(R)
if for any arbitrary SUM-aggregate Af (S) =∑t∈S f (t),
E [Af (E(R))] = E [Af (F(R))]
Var [Af (E(R))] = Var [Af (F(R))].
From the above discussion, it immediately follows that SOA-equivalence is a generalization
and implies set equivalence for non-randomized expressions, as stated in the following
proposition.
Proposition 4.1. Given two relational algebra expressions E(R) and F (R) we have:
E(R) = F (R)⇔ E(R) SOA⇐⇒ F (R).
The next proposition establishes that SOA-equivalence is indeed an equivalence relation
and can be manipulated like relational equivalence.
33
Proposition 4.2. SOA-equivalence is an equivalence relation, i.e., for any expressions E ,F ,H
Figure 5-2. Plot of percentage of times the true value lies in the estimated confidenceintervals vs desired confidence level.
the ten desired confidence levels, we compute the percentage of times the true value falls
within the corresponding confidence intervals. In Figure 5-2, we show a comparison between
the desired and achieved confidence levels for 4 different sampling strategies. The achieved
confidence levels are very close to the desired values, across the different sampling strategies
and confidence levels. This provides strong empirical evidence that the confidence intervals
obtained by using the theory in Section 4.3 are accurate and tight.
5.3.3 Running Time
The next goal is to evaluate the efficiency of the estimation process. We are especially
interested in evaluating the variance of the estimators. This study performed with our research
prototype should give the practitioner some idea on the overhead expected.
Setup. Intuitively, the analysis overhead will depend on the sample size. To ensure that
we stress the analysis with large samples, we use the 1TB TPC-H instance and treat the
database as a sample of a 1PB database. More specifically we assume that the 6 billion tuples
in lineitem are a Bernoulli sample from the 6 trillion tuples in the same relation at 1PB scale
(0.001 sampling fraction). Similarly, the 1.5 billion tuples in orders are a sample without
replacement from the 1.5 trillion tuples of the 1PB database and the 200 million tuples in part
50
are a Bernoulli sample (0.001 sampling fraction) from 200 billion tuples at 1PB scale. This
ensures that the sample sizes the analysis has to deal with can be in the billions – a very harsh
scenario for analysis indeed.
Since the database is the sample, there is no sampling needed in the execution of the
query – the tuples that the analysis has to make use of are the tuples that are aggregated by
the non-sampling version of the query. As described in Section 5.1, maintaining the estimator
is as easy as performing the aggregation in the non-sampling query but computing the variance
is much more involved. The technique we proposed is to sub-sample from the sample to limit
the number of tuples used to estimate the variance. In our experiments we studied the impact
of the query characteristics (various selection predicates) and sub-sampling size on the running
time. For each experiment, we measured three running times. First, the running time of the
non-sampling query (no statistical estimation, just the aggregate). Second, the running time
of the query processing and sub-sampling process. Sub-sampling is interleaved with the rest
of the processing and the two running times cannot be separated. Third, the time to perform
the analysis. The analysis is single-threaded and starts only once the sub-sample is completely
formed.
Impact of Selectivity with Fixed Sub-Sampling. In our first experiment, we will vary
the selection predicate (thus indirectly the selectivity of the query) and set the range for the
sub-sampling at 100K-400K tuples (i.e. sub-sampling obtains 100K-400K tuples that are a
Bernoulli sample from the data provided for analysis). Results are depicted in Figure 5-3A.
We make three key observations; First, for this sub-sampling target, the analysis adds an
insignificant amount of extra effort (about 2% of the overall running time). Second, selectivity
of the query has no significant effect on the running time for either the non-sampling or for the
sampling query. Last, the running time of the sampling version of the query seems to be more
51
0100200300400500
9001200
15001800
2100
0
10
20
30
40
Run
ning
time(sec)
Num
ber
oftuples
(million)
Selection parameter
Sub-sampling range 100K - 400K
A
0100200300400500
9001200
15001800
2100
0
10
20
30
40
Run
ning
time(sec)
Num
ber
oftuples
(million)
Selection parameter
No Sub-sampling
Query processing + Sub-samplingAnalysis
Sub-sample sizeQuery without analysis
B
Figure 5-3. Plots of running time vs selection parameter with and without sub-sampling.
stable than the running time of the non-sampling version.2 It seems that, when the size of
sub-sample is below 500,000, the extra effort to perform sampling analysis is insignificant. We
show later that such sub-samples are good enough to produce stable variance estimates.
Impact of Selectivity with No Sub-Sampling. An unresolved question from the
previous experiment is what happens when no sub-sampling is performed, i.e. all the data
is used for analysis. The selectivity of the query will now control the number of tuples used
2 A similar behavior was noticed in [12]: the execution is more stable and somewhat faster athigher CPU loads.
52
for analysis and give an indication of the effort as a function of the size. Figure 5-3B depicts
result of such an experiment in which the selection predicate was varied. Results reveal that,
once the size of the sample exceeds 1M the analysis cost becomes unmanageable and starts
to dominate. At the end of the spectrum (31M tuples) the analysis was 5 times slower than
the the rest of the execution – this is clearly not acceptable in practice. As we hinted above,
targets of 100K-400K produce good enough estimates of the variance; there is no need to base
the variance analysis on millions of tuples, thus running time of analysis can be kept under
control. Sub-sampling is thus a crucial technique for applicability of sampling estimation to
large data.
5.3.4 Sub-Sample Size
This experiment sheds light on influence of sub-sampling sizes on the estimate for the variance
and thus the quality of the confidence intervals.
Setup. Since we would like to get samples from all over the data source, we use the 1TB
TPC-H instance as the data source and repeatedly derive samples from it. Remember that,
according to Section 5.1, any estimates of the terms yS can be used to analyze any of the
sampling methods. Sub-sampling is used to estimate the terms yS , but the entire sample is
used to compute the estimate. Subsampling leads to a substantial reduction in computation,
but also gives rise to wider confidence intervals. In many situations, the estimated aggregate
is various orders of magnitude larger than its estimated standard deviation (based on the
whole sample). In such cases, it clear that any increase in the width of the confidence interval
due to subsampling will be extremely minor as compared to the estimated aggregate. Thus
a much smaller sub-sample has only a secondary influence on the confidence interval, while
leading to a substantial reduction in computation. The plot in Figure 5-4 shows this fact. In
this experiment we run around 250 instances of Query 1 for sub-sampling ranges of 10K-40K,
100K-400K and 1M-4M tuples each. In all cases, we calculate the fluctuation of the resultant
confidence interval widths with respect to the confidence interval width obtained from an
analysis without sub-sampling. In particular, we define error by ratio of the difference between
53
0
0.02
0.04
0.06
0.08
0.1
10K-400K
100K-400K
1M-4M
Error
Sub-sampling target
Sub-sampling Study
Figure 5-4. Plot of fluctuation of confidence interval widths obtained with sub-sampling, wrtthe true confidence interval
the 5th and the 95th percentile values to the width of the confidence interval obtained without
sub-sampling. The plot in Figure 5-4 shows that this error is only 1% when 100K-400K tuples
are used.
Note on number of relations. As we have seen in this section, for sampling over
3 relations, good confidence intervals can be obtained with a mere 2% extra effort since
sub-samples of 100K tuples suffice. Since the analysis requires the computation of 2n terms
if n relations are sampled, the influence of the number of relations on the running time of the
analysis is of concern. In practice, these concerns can be easily addressed as follows: (a) the
computation of the yS terms from the sub-samples can be parallelized – on our system this
would result in a speedup of at least 32 (on 48 cores), (b) we noticed that foreign key joins
result in repeated values for certain terms – about half the values are repeated, (c) we see
no need to sample from more than 8 relations since there is no need to sample from small or
medium size relations. Notice that the parallelization alone would allow us to scale from 3 to
3 + 5 = 8 relations since 25 = 32, the expected speedup.
54
CHAPTER 6COVARIANCE BETWEEN MULTIPLE ESTIMATES: CHALLENGES
Having addressed the challenge of estimating variance for SUM-like aggregates, we
now focus on the more difficult problem of computing pairwise covariances between multiple
estimates. As mentioned in the introduction, previous work on covariance computation
(such as [59, 95]) has focused on deriving closed form expressions for covariances in special
settings, with specific assumptions on the type of schema, number of relations, and type of
estimates. The computation has to be done by hand on a case-by-case basis. Our goal is to
develop theory for a generalized solution, which is independent of platform, schema, number
of relations, is applicable for a wide class of sampling methods, and is also capable of being
automated. In the first part of this dissertation, we developed precisely such a theory for the
variance computation problem. Given the connection between covariance and variance - it is
natural to assume that covariance computation problem can be solved through a mild and
straightforward extension of the previous theory. However, a closer look at the problem reveals
that it is not the case, and brings out key underlying challenges. The goal of this section is to
carefully and methodically lay out these challenges. We will first recall the major steps in the
variance computation strategy, and then examine the adequacy of this strategy for covariance
computation, starting with the simplest case and then moving on to more genaral and complex
settings.
6.1 Base Lemma for Covariance
In previous chapters, we dealt with a general single query plan which has sampling at
the bottom, and the estimate at the top. The strategy for computing the variance of such an
estimate was as follows.
• The notion of SOA-Equivalence allowed us to transform this plan into an analyzable planfor variance analysis. This analyzable plan had an equivalent sampling operator at thetop of the plan, i.e., at the level of the estimate of interest.
• The tranformation of a general query plan into an analyzable query plan were acheived /carried by applying a series of algebraic rules based on SOA- Equivalence.
55
• The overall sampling method at the top of this analyzable plan was expressed as a GUSmethod, whose 2n parameters represented the sampling based correlations (one term forevery subset of the base relations) involved in the estimate.
• We used the base lemma for variance (Theorem 1 in Chapter 3) for the overall GUSmethod into the base lemma to get the required variance.
The conceptually simplest covariance computation problem corresponds to two estimates
deribed from a single query plan. Using the above strategy, we can get a SOA-Equivalent plan
with an overall GUS method on top. We plugged in these parameters into the base lemma for
variance. However, even in the simplest multiple estimate case where we have 2 estimates from
the same from the same set of base relation samples. Challenge #1: Generalize the base
lemma for covariance
Example 8. To compute the error bound for the estimate f1 using (1–1), we need to
compute V (SUMloc),V (COUNTloc) and Cov(SUMloc, COUNTloc). Let slineitem, sorders and
scustomer be Bernoulli(0.1), SWOR(1000) and Bernoulli(1000) samples drawn from lineitem,
orders & customer, denoted by l, o & c respectively in Fig 6-1. The corresponding GUS
sampling methods are denoted by Gl, Go & Gc respectively.The variance terms V (SUMloc) &
V (COUNTloc) can be computed using the theory in [76] as follows.
Starting from the initial representation in Fig 6-1(a), where sampling is at the base, we
can use Proposition 4.6 on the joins to get a SOA-Equivalent plan with a single GUS on top.
This overall GUS, Gloc in Figure 6-1(c), can be used to compute the two variance terms by
applying Theorem 1. However, the existing results do not provide any mechanism to compute
the term Cov(SUMloc, COUNTloc). 2
6.2 Covariance Parameters
The case of two estimates from a single query plan is still not general enough for the
covariance computation problem. Sampling based correlations will be present in any pair of
estimates that share samples of base relations. These estimates may come from different
queries and the queries may be based on different set of base relations. A more genralized
computation of two of these terms. In this single covariance example, using foreign keys for
optimization (without sub-sampling) reduces the analysis time, but the rate of increase with
selectivity is roughly the same. Sub-sampling (target sub-sample size between 100K - 400K
tuples) vastly reduces the analysis time and makes it uniform across all the four values of the
selectivity parameter, resulting in a stable total runtime.
88
8.3.4 Foreign Key Optimization
The effect of foreign key optimization is more useful in cases where a covariance matrix
needs to be computed, for eg. in GROUP-BY queries, where the number of yS compuation is
large (Section 8.1). We use the query in Example 16 and note the running times with/without
foreign key optimization. The query processing and subsampling time is the same in both
cases, but the analysis times reduced from 7 sec to 0.33 sec when foreign key optimization is
used.
89
CHAPTER 9A SAMPLING ALGEBRA FOR GENERAL MOMENT MATCHING
In previous chapters, we showed that a GUS sampling operator commutes with relational
operators such as selection/cross product in a SOA-equivalent sense. In particular, aggregates
derived from two SOA-equivalent sampling plans have the same first two moments. Such
an equivalence is sufficient if the aim is to compute the variance of these aggregates, and
to construct confidence intervals based on the variance. However, such an equivalence is
not sufficient if one wishes to know deeper distributional properties of the aggregates such
as skewness, kurtosis etc. For such endeavors, we need a class of sampling methods which
commute with relational operators in the sense that aggregates for all “equivalent” plans have
all the same moments up to a given order. In this chapter, we will develop a class of sampling
methods to achieve this purpose. In particular, it will be shown that
• common sampling methods such as SWR, SWOR and Bernoulli sampling (and theircombinations) are included in this class,
• these sampling methods commute with relational operators while preserving all momentsup to a given order k ,
• the moments for these methods can be expressed in a form which makes computationeasier.
9.1 k-Generalized Uniform Sampling
Consider a database R = R1 × R2 × · · · × Rn. Let k ≥ 2 be an arbitrarily chosen positive
integer. To define our class of sampling methods, we first introduce required notation.
• We say that S = (S1,S2, · · · ,Sk−1) is an ordered k-partition of V = {1, 2, · · · , n}if S1,S2, · · · ,Sk−1 are pairwise disjoint subsets of V . In this setting, we denote Sk =V \
(∪k−1i=1 Si
).
• The collection of all ordered k-partitions of V is denoted by Pk(V ).
• Let t1, t2, · · · , tk ∈ R be arbitrarily chosen. Then S({ti}ki=1
)is an ordered k-partition of
V such that
Sm({ti}ki=1
)={j ∈ V : There are exactly m distinct values in t1j , t
2j , · · · , tkj
}for 1 ≤ m ≤ k − 1.
90
With the above notation in hand, we define our desired class of sampling methods.
Definition 4. A randomized selection process which gives a sample R from R is a called
a k-Generalized Uniform Sampling (k-GUS) method if for any t1, t2, · · · , tk ∈ R,
P(t1, t2, · · · , tk ∈ R) is a function of S({ti}ki=1
). In such a case, the k-GUS parameters
b = {bT | T ∈ Pk(V )} are defined as:
bT = P(t1, t2, · · · , tk ∈ R | S
({ti}ki=1
)= T), (9–1)
and the sampling method is denoted by Gk,b.
Now, we establish some properties of the class of k-GUS methods. We first show that as k
increases, the class of k-GUS sampling methods becomes smaller.
Lemma 1. A k + 1-GUS method is also a k-GUS method.
Proof. Consider a sampling method which gives a sample R from R. Suppose this
sampling method is a k + 1-GUS method. Let t1, t2, · · · , tk ∈ R. Define tk+1 = t1. By the
We now investigate interactions between k-GUS operators when applied to the same data.
Proposition 9.6. For any expression R and k-GUS methods Gk,b1 and Gk,b2 which are applied
independently
Gk,b1(Gk,b2(R)
) kMA⇐⇒ Gk,b(R)
where bT = b1,Tb2,T.
The proof follows immediately from the independence of the two k-GUS methods.
Using Propositions 9.4, 9.5 and 9.6, we can transform a wide variety of query plans to a
kMA-equivalent query plan with sampling on the top. This allows us to construct estimates of
quantities such as skewness (for k = 3) and kurtosis (for k = 4) for SUM-like aggregates.
In future work, we will explore extensions of the the notion of kMA equivalence to pairs of
query plans (analogous to extending SOA equivalence to SOA-COV equivalence). Such an
extension will allow us to estimate the skewness, kurtosis or other quantities invloving higher
order moments for functions of multiple SUM-like aggregates. Another future line of research
will be to find compact and computation-friendly expressions for yS for a general k , i.e., to
extend Lemma 9.1 beyond the case k = 3.
98
REFERENCES
[1] “SQL-2003 Standard.” 2003.
[2] “Grokit.” 2018.
URL https://github.com/tera-insights/grokit
[3] “SWI-Prolog.” 2018.
URL http://www.swi-prolog.org
[4] “TPC-H Benchmark.” 2018.
URL http://www.tpc.org/tpch
[5] Acharya, S., Gibbons, P. B., Poosala, V., and Ramaswamy, S. “Join synopses forapproximate query answering.” Proceedings of the ACM SIGMOD International Confer-ence on Management of Data. ACM, 1999, 275–286.
[6] Acharya, Swarup, Gibbons, Phillip B., and Poosala, Viswanath. “Aqua: A fast decisionsupport system using approximate query answers.” In Proc. of 25th Intl. Conf. on VeryLarge Data Bases. 1999, 754–755.
[7] Acharya, Swarup, Gibbons, Phillip B., Poosala, Viswanath, and Ramaswamy, Sridhar.“The Aqua approximate query answering system.” Proceedings of the ACM SIGMODInternational Conference on Management of Data. New York, USA: ACM, 1999, 574–576.
URL http://doi.acm.org/10.1145/304182.304581
[8] Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., and Stoica, I. “BlinkDB:Queries with Bounded Errors and Bounded Response Times on Very Large Data.”EuroSys (2013).
[9] Agarwal, Sameer, Milner, Henry, Kleiner, Ariel, Talwalkar, Ameet, Jordan, Michael,Madden, Samuel, Mozafari, Barzan, and Stoica, Ion. “Knowing When You’re Wrong:Building Fast and Reliable Approximate Query Processing Systems Knowing When You’reWrong: Building Fast and Reliable Approximate Query Processing Systems.” Proceedingsof the ACM SIGMOD International Conference on Management of Data. ACM, 2014.
[10] Alon, N., Matias, Y., and Szegedy, M. “The space complexity of approximating thefrequency moments.” ACM Symposium on Theory of Computing. 1996.
[11] Antova, Lyublena, Jansen, Thomas, Koch, Christoph, and Olteanu, Dan. “Fast and simplerelational processing of uncertain data.” 2008 IEEE 24th International Conference on DataEngineering. IEEE, 2008, 983–992.
[12] Arumugam, Subi, Dobra, Alin, Jermaine, Christopher M., Pansare, Niketan, and Perez,Luis. “The DataPath System: A Data-centric Analytic Processing Engine for Large
Data Warehouses.” Proceedings of the ACM SIGMOD International Conference onManagement of Data. ACM, 2010, 519–530.
[13] Babcock, Brian, Datar, Mayur, and Motwani, Rajeev. “Load Shedding in Data StreamSystems.” Data Streams - Models and Algorithms. 2007. 127–147.
[14] Bickel, Peter J, Gotze, Friedrich, and van Zwet, Willem R. “Resampling fewer than nobservations: gains, losses, and remedies for losses.” Statistical Sinica (1997): 1–31.
[15] Bickel, Peter J and Sakov, Anat. “Extrapolation and the bootstrap.” Sankhya: The IndianJournal of Statistics, Series A (2002): 640–652.
[16] Bickel, Peter J and Yahav, Joseph A. “Richardson extrapolation and the bootstrap.”Journal of the American Statistical Association 83 (1988).402: 387–393.
[17] Buccafurri, F., Lax, G., Sacc‘a, D., Pontieri, L., and Rosaci, D. “Enhancing histograms bytree-like bucket indices.” The VLDB Journal 17 (2008): 1041–1061.
[18] Chakrabarti, K., Garofalakis, M. N., Rastogi, R., and Shim, K. “Approxi- mate queryprocessing using wavelets.” The VLDB Journal 10 (2001).2-3: 199–223.
[19] Chaudhuri, S., Das, G., and Narasayya, V. “Optimized stratified sampling for approximatequery processing.” TODS (2007).
[20] Chaudhuri, Surajit and Motwani, Rajeev. “On Sampling and Relational Operators.” IEEEData Eng. Bull. 22 (1999).4: 41–46.
[21] Chaudhuri, Surajit, Motwani, Rajeev, and Narasayya, Vivek R. “On Random Samplingover Joins.” SIGMOD Conference. 1999, 263–274.
[22] Cheney, James, Chiticariu, Laura, and Tan, Wang-Chiew. “Provenance in Databases:Why, How, and Where.” Foundations and Trends in Databases 1 (2007).4: 379–474.
[23] Cheng, Yu, Qin, Chengjie, and Rusu, Florin. “GLADE: Big Data Analytics Made Easy.”Proceedings of the 2012 ACM SIGMOD International Conference on Management ofData. ACM, 2012, 697–700.
[24] Cormode, G., Garofalakis, M., Haas, P.J., and Jermaine, C. “Synopses for Massive Data:Samples, Histograms, Wavelets, Sketches.” Foundations and Trends in Databases 4(2012).1-3: 1–294.
[25] Cormode, G. and Muthukrishnan, S. “An improved data stream summary: The count-minsketch and its applications.” Journal of Algorithms 55 (2005).1: 58–75.
[26] C.Pang, Q.Zhang, D.Hansen, and A.Maeder. “Unrestricted wavelet synopses undermaximum error bound.” Proceedings of the International Conference on ExtendingDatabase Technology. 2009.
[27] Dalvi, Nilesh and Suciu, Dan. “Efficient query evaluation on probabilistic databases.” TheVLDB Journal 16 (2007): 523–544.
100
[28] Dobra, A. “Histograms revisited: When are histograms the best approximation method foraggregates over joins?” Proceedings of ACM Principles of Database Systems. 2005.
[29] Dobra, A., Garofalakis, M., Gehrke, J. E., and Rastogi, R. “Processing complex aggregatequeries over data streams.” Proceedings of the ACM SIGMOD International Conferenceon Management of Data. ACM, 2002.
[30] Dobra, Alin, Jermaine, Chris, Rusu, Florin, and Xu, Fei. “Turbo-Charging EstimateConvergence in DBO.” PVLDB 2 (2009).1: 419–430.
[31] Donjerkovic, D. and Ramakrishnan, R. “Probabilistic optimization of top N queries.”Proceedings of the International Conference on Very Large Data Bases. 1999.
[32] Efron, B. “Bootstrap methods: another look at the jackknife.” Annals of Statistics 7(1979): 1–26.
[33] Efron, Bradley. “More Efficient Bootstrap Cmputations.” Journal of American StatisticalAssociation 85 (1990): 79–89.
[34] Efron, Bradley and Tibshirani, Robert J. An introduction to the bootstrap. CRC press,1994.
[35] Flajolet, P. and Martin, G. N. “Probabilistic counting algorithms for databaseapplications.” Journal of Computer and System Sciences 31 (1985): 182–209.
[36] Ganguly, S. “Counting distinct items over update streams.” Theoretical Computer Science378 (2007): 211–222.
[37] Garofalakis, M. and Gibbons, P. B. “Wavelet synopses with error guarantees.” Pro-ceedings of the ACM SIGMOD International Conference on Management of Data. ACM,2002.
[38] ———. “Probabilistic wavelet synopses.” ACM Transactions on Database Systems 29(2004).
[39] Garofalakis, M. and Kumar, A. “Deterministic wavelet thresholding for maximum-errormetrics.” Proceedings of the ACM SIGACT-SIGMOD- SIGART Symposium on Principlesof Database Systems. ACM, 2004, 166–176.
[40] ———. “Wavelet synopses for general error metrics.” ACM Transactions on DatabaseSystems 30 (2005).
[41] Gibbons, Phillip B., Poosala, Viswanath, Acharya, Swarup, Bartal, Yair, Matias, Yossi,Muthukrishnan, S., Ramaswamy, Sridhar, and Suel, Torsten. “AQUA: System andtechniques for approximate query answering.” Tech. rep., 1998.
[42] Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. J. “One-pass waveletdecomposition of data streams.” IEEE Transactions on Knowledge and Data Engineering15 (2003).
101
[43] Gryz, Jarek, Guo, Junjie, Liu, Linqi, and Zuzarte, Calisto. “Query sampling in DB2Universal Database.” Proceedings of the ACM SIGMOD International Conference onManagement of Data. ACM, 2004, 839–843.
[44] Guha, S. and Harb, B. “Wavelet synopsis for data streams: Minimizing non-euclideanerror.” Proceedings of the ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining. 2005.
[45] ———. “Approximation algorithms for wavelet transform coding of data streams.”Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms. 2006.
[46] Guha, S., Koudas, N., and Shim, K. “Approximation and streaming algorithms forhistogram construction problems.” ACM Transactions on Database Systems 31 (2006):396–438.
[47] Haas, Peter J. “Large-Sample and Deterministic Confidence Intervals for OnlineAggregation.” SSDBM. IEEE Computer Society Press, 1996, 51–63.
[48] Haas, Peter J. and Hellerstein, Joseph M. “Ripple joins for online aggregation.” Proceed-ings of the ACM SIGMOD International Conference on Management of Data. ACM, 1999,287–298.
[49] ———. “Online Query Processing.” Proceedings of the ACM SIGMOD InternationalConference on Management of Data. ACM, 2001, 623.
[50] Haas, Peter J., Naughton, Jeffrey F., Seshadri, S., and Swami, Arun N. “Selectivity andCost Estimation for Joins Based on Random Sampling.” Journal of Computer and SystemSciences 52 (1996): 550 – 569.
[51] ———. “Selectivity and cost estimation for joins based on random sampling.” J. Comput.Syst. Sci. 52 (1996): 550–569.
[52] Hellerstein, Joseph M., Haas, Peter J., and Wang, Helen J. “Online aggregation.” ACM,1997, 171–182.
URL http://doi.acm.org/10.1145/253262.253291
[53] Horvitz, D. G. and Thompson, D. J. “A generalization of sampling without replacementfrom a finite universe.” Journal of the American Statistical Association 47 (1952):663–685.
[54] Ioannidis, Y. E. “Approximations in database systems.” Proceedings of the InternationalConference on Database Theory. 2003.
[55] Ioannidis, Y. E. and Christodoulakis, S. “On the propagation of errors in the size of joinresults.” Proceedings of the ACM SIGMOD International Conference on Management ofData. ACM, 1991.
[56] ———. “Optimal histograms for limiting worst-case error propagation in the size of joinresults.” ACM Transactions on Database Systems 18 (1993).
[57] Ioannidis, Y. E. and Poosala, V. “Balancing histogram optimality and practicality forquery result size estimation.” Proceedings of the ACM SIGMOD International Conferenceon Management of Data. ACM, 1995.
[58] Ioannidis, Y.E. and Poosala, V. “Histogram-based approximation of set-valuedquery-answers.” Proceedings of the International Conference on Very Large DataBases. 1999.
[59] Jermaine, Chris, Arumugam, Subramanian, Pol, Abhijit, and Dobra, Alin. “Scalableapproximate query processing with the DBO engine.” ACM Trans. Database Syst. 33(2008).
[61] Joshi, S. and Jermaine, C. “Sampling-Based Estimators for Subset-Based Queries.”PVLDB 1 (2009): 181–202.
[62] Kandula, Srikanth, Shanbhag, Anil, Vitorovic, Aleksandar, Olma, Matthaios, Grandl,Robert, Chaudhuri, Surajit, and Ding, Bolin. “Quickr: Lazily Approximating ComplexAd-Hoc Queries in Big Data Clusters.” Proceedings of the ACM SIGMOD InternationalConference on Management of Data. ACM, 2016.
[63] Kanne, C.C. and Moerkotte, G. “Histograms reloaded: The merits of bucket diversity.”Proceedings of the ACM SIGMOD International Conference on Management of Data.ACM, 2010.
[64] Karras, P. and Manoulis, N. “One-pass wavelet synopses for maximum-error metrics.”PVLDB. ACM, 2005.
[65] Karras, P., Sacharidis, D., and Manoulis, N. “Exploiting duality in summa- rization withdeterministic guarantees.” Proceedings of the ACM SIGKDD International Conference onKnowledge Discovery and Data Mining. 2007.
[66] Kaushik, R., Naughton, J. F., Ramakrishnan, R., and Chakaravarthy, V. T. “Synopses forquery optimization: A space-complexity perspective.” ACM Transactions on DatabaseSystems 30 (2005): 1102–1127.
[67] Kempe, D., Dobra, A., and Gehrke, J. “Gossip-based computation of aggregateinformation.” Proceedings of the IEEE Conference on Foundations of Computer Sci-ence. 2003.
[68] Kleiner, A., Talwalkar, A., Agarwal, S., Stoica, I., and Jordan, M. I. “A general bootstrapperformance diagnostic.” KDD (2013).
103
[69] Kleiner, Ariel, Talwalkar, Ameet, Sarkar, Purnamrita, and Jordan, Michael. “The big databootstrap.” arXiv preprint arXiv:1206.6415 (2012).
[70] Laptev, N., Zeng, K., and Zaniolo, C. “Early Accurate Results for Advanced Analytics onMapReduce.” PVLDB 5 (2012).
[71] Li, Kun, Wang, Daisy Zhe, Dobra, Alin, and Dudley, Christopher. “UDA-GIST: AnIn-database Framework to Unify Data-parallel and State-parallel Analytics.” PVLDB(2015): 557–568.
[72] Lipton, Richard J., Naughton, Jeffrey F., Schneider, Donovan A., and Seshadri, S.“Efficient sampling strategies for relational database operations.” Theoretical ComputerScience 116 (1993).1: 195 – 226.
[73] Liu, R.Y. and Singh, K. “Using i.i.d. bootstrap inference for general non-i.i.d. models.”Journal of Statistical Planning and Inference 43 (1999): 67–75.
[74] Matias, Y. and Urieli, D. “Optimal workload-based weighted wavelet synopses.” Theoreti-cal Computer Science 371 (2007): 227–246.
[75] M.Charikar, K.Chen, and M.Farach-Colton. “Finding frequent items in data streams.”International Colloquium on Automata, Languages and Programming. 2002.
[76] Nirkhiwale, Supriya, Dobra, Alin, and Jermaine, Chris. “A Sampling Algebra for AggregateEstimation.” PVLDB 6 (2013).14: 1798–1809.
[77] Olken, Frank. “Random Sampling from Databases.” 1993.
[78] Pansare, Niketan, Borkar, Vinayak, Jermaine, Chris, and Condie, Tyson. “OnlineAggregation for Large MapReduce Jobs.” PVLDB 4 (2011): 1135–1145.
[79] Piatetsky-Shapiro, Gregory and Connell, Charles. “Accurate estimation of the numberof tuples satisfying a condition.” Proceedings of the 1984 ACM SIGMOD internationalconference on Management of data. ACM, 1984, 256–276.
[80] Pol, A. and Jermaine, C. “Relational confidence bounds are easy with the bootstrap.”Proceedings of the ACM SIGMOD International Conference on Management of Data.ACM. 2005.
[81] Politis, D.N., Romano, J.P., and Wolf, M. Subsampling. Springer, New York, 1999.
[82] Poosala, V., Ioannidis, Y. E., Haas, P. J., and Shekita, E. J. “Improved histogramsfor selectivity estimation of range predicates.” Proceedings of the ACM SIGMODInternational Conference on Management of Data. ACM, 1996.
[83] Re, Christopher and Suciu, Dan. “The trichotomy of HAVING queries on a probabilisticdatabase.” PVLDB 18 (2009): 1091–1116.
[84] Rusu, F. and 2007, A. Dobra. “Statistical Analysis of Sketch Estimators.” Proceedings ofthe ACM SIGMOD International Conference on Management of Data. ACM, 2007.
[85] Rusu, F. and Dobra, A. “Sketches for size of join estimation.” ACM Transactions onDatabase Systems 33 (2008).3.
[86] Rusu, Florin and Dobra, Alin. “Sketching Sampled Data Streams.” Proceedings of IEEEICDE. 2009.
[87] Sen, Prithviraj, Deshpande, Amol, and Getoor, Lise. “Read-once functions and queryevaluation in probabilistic databases.” Proceedings of the VLDB Endowment 3(2010).1-2: 1068–1079.
[88] Tran, Thanh T, Peng, Liping, Diao, Yanlei, McGregor, Andrew, and Liu, Anna. “CLARO:modeling and processing uncertain data streams.” The VLDB JournalThe InternationalJournal on Very Large Data Bases 21 (2012).5: 651–676.
[89] Tran, Thanh TL, Diao, Yanlei, Sutton, Charles, and Liu, Anna. “Supporting user-definedfunctions on uncertain data.” Proceedings of the VLDB Endowment 6 (2013).6: 469–480.
[90] Vitter, J. S. and Wang, M. “Approximate computation of multidimensional aggregates ofsparse data using wavelets.” Proceedings of the ACM SIGMOD International Conferenceon Management of Data. ACM, 1999.
[91] Wang, Daisy Zhe, Michelakis, Eirinaios, Garofalakis, Minos, and Hellerstein, Joseph M.“BayesStore: managing large, uncertain data repositories with probabilistic graphicalmodels.” Proceedings of the VLDB Endowment 1 (2008).1: 340–351.
[92] Wang, H. and Sevcik, K. C. “Utilizing histogram information.” Proceedings of CASCON.2001.
[93] ———. “Histograms based on the minimum description length principle.” VLDB Journal17 (2008).
[94] Whang, K. Y., Vander-Zanden, B. T., and Taylor, H. M. “A linear-time probabilisticcounting algorithm for database applications.” ACM Transactions on Database Systems15 (1990): 208.
[95] Xu, Fei, Jermaine, Christopher M., and Dobra, Alin. “Confidence bounds forsampling-based group by estimates.” ACM Trans. Database Syst. 33 (2008).
[96] Zeng, Kai, Gao, Shi, Mozafari, Barzan, and Zaniolo, Carlo. “The Analytical Bootstrap: ANew Method for Fast Error Estimation in Approximate Query Processing.” Proceedingsof the ACM SIGMOD International Conference on Management of Data. ACM, 2014,277–288.
105
BIOGRAPHICAL SKETCH
Supriya Nirkhiwale received her B.E. degree in electronics and telecommunications from
the Sri Govindram Sekseria Institute of Technology, Indore, India in 2006. She received her
master’s degree in electrical engineering in 2009 from Kansas State University and Ph.D. in
computer science in 2018 from the University of Florida. Her primary research is focused on
building theory and scalable frameworks Approximate Query Processing in large databases. She
has been working as a data scientist for LexisNexis since 2014.